Categorizing NIPS papers using LDA topic modeling

The Annual Conference on Neural Information Processing Systems (NIPS) has recently listed this year’s accepted papers. There are 403 paper titles listed, which made for great morning coffee reading, trying to pick out the ones that most interest me.

Being a machine learning conference, it’s only reasonable that we apply a little machine learning to this (decidedly _small_) data.

Building off of the great example code in a post by Jordan Barber on Latent Dirichlet Allocation (LDA) with Python, I scraped the paper titles and built an LDA topic model with 5 topics. All of the code to reproduce this post is available on github. Here are the top 10 most probable words from each of the derived topics:

0 1 2 3 4
0 learning learning optimization learning via
1 models inference networks bayesian models
2 neural sparse time sample inference
3 high models stochastic analysis networks
4 stochastic non model data deep
5 dimensional optimization convex inference learning
6 networks algorithms monte spectral fast
7 graphs multi carlo networks variational
8 optimal linear neural bandits neural
9 sampling convergence information methods convolutional

Normally, we might try to attach some kind of label to each topic using our beefy human brains and subject matter expertise, but I didn’t bother with this — nothing too obvious stuck out at me. If you think that you have appropriate names for them feel free to let me know. Given that we are only working with the titles (no abstracts or full paper text), I think that there aren’t obvious human-interpretable topics jumping out. But let’s not let that stop us from proceeding.

We can also represent the inferred topics with the much maligned, but handy-dandy wordcloud visualization:

Topic: 0
Topic: 1
Topic: 2
Topic: 3
Topic: 4

Since we are modeling the paper title generating process as a probability distribution of topics, each of which is a probability distribution of words, we can use this generating process to suggest keywords for each title. These keywords may or may not show up in the title itself. Here are some from the first 10 titles:


Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing
Generated Keywords: [u'iteration', u'inference', u'theory']


Learning with Symmetric Label Noise: The Importance of Being Unhinged
Generated Keywords: [u'uncertainty', u'randomized', u'neural']


Algorithmic Stability and Uniform Generalization
Generated Keywords: [u'spatial', u'robust', u'dimensional']


Adaptive Low-Complexity Sequential Inference for Dirichlet Process Mixture Models
Generated Keywords: [u'rates', u'fast', u'based']


Covariance-Controlled Adaptive Langevin Thermostat for Large-Scale Bayesian Sampling
Generated Keywords: [u'monte', u'neural', u'stochastic']


Robust Portfolio Optimization
Generated Keywords: [u'learning', u'online', u'matrix']


Logarithmic Time Online Multiclass prediction
Generated Keywords: [u'complexity', u'problems', u'stein']


Planar Ultrametric Rounding for Image Segmentation
Generated Keywords: [u'deep', u'graphs', u'neural']


Expressing an Image Stream with a Sequence of Natural Sentences
Generated Keywords: [u'latent', u'process', u'stochastic']


Parallel Correlation Clustering on Big Graphs
Generated Keywords: [u'robust', u'learning', u'learning']

Entropy and the most “interdisciplinary” paper title

While some titles are strongly associated with a single topic, others seem to be generated from more even distributions over topics than others. Paper titles with more equal representation over topics could be considered to be, in some way, more interdisciplinary, or at least, intertopicular (yes, I just made that word up). To find these papers, we’ll find which paper titles have the highest information entropy in their inferred topic distribution.

Here are the top 10 along with their associated entropies:

1.22769364291 Where are they looking?
1.1794725784 Bayesian dark knowledge
1.11261338284 Stochastic Variational Information Maximisation
1.06836891546 Variational inference with copula augmentation
1.06224431711 Adaptive Stochastic Optimization: From Sets to Paths
1.04994413148 The Population Posterior and Bayesian Inference on Streams
1.01801236048 Revenue Optimization against Strategic Buyers
1.01652797194 Fast Convergence of Regularized Learning in Games
0.993789478925 Communication Complexity of Distributed Convex Learning and Optimization
0.990764728084 Local Expectation Gradients for Doubly Stochastic Variational Inference

So it looks like by this method, the ‘Where are they looking’ has the highest entropy as a result of topic uncertainty, more than any real multi-topic content.


auto·di·dact n.
A self-taught person.
From Greek autodidaktosself-taught : auto-auto- + didaktostaught;


sim·u·late v.
To create a representation or model of (a physical system or particular situation, for example).
From Latin simulre, simult-, from similislike;

(If you can get past the mixing of Latin and Greek roots)

sim·u·di·dactic adj.
To learn by creating a representation or model of a physical system or particular situation. Particularly, using in silico computation to understand complex systems and phenomena.


This concept has been floating around in my head for a little while. I’ve written before on how I believe that simulation can be used to improve one’s understanding of just about anything, but have never had a nice shorthand for this process.

Simudidactic inquiry is the process of understanding aspects of the world by abstracting them into a computational model, then conducting experiments in this model world by changing the underlying properties and parameters. In this way, one can ask questions like:

  1. What type of observations might we make if x were true?
  2. If my model of the process is accurate, can I recapture the underlying parameters given the type of observations I can make in the real world? How often will I be wrong?
  3. Will I be able to distinguish between competing models given the observations I can make in the real world?

In addition to being able to ask these types of questions, the simudidact solidifies their understanding of the model by actually building it.

So go on, get simudidactic and learn via simulation!


Calculating AUC the hard way

The Area Under the Receiver Operator Curve is a commonly used metric of model performance in machine learning and many other binary classification/prediction problems. The idea is to generate a threshold independent measure of how well a model is able to distinguish between two possible outcomes. Threshold independent here just means that for any model which makes continuous predictions about binary outcomes, the conversion of the continuous predictions to binary requires making the choice of an arbitrary threshold above which will be a prediction of 1, below which will be 0.

AUC gets around this threshold problem by integrating across all possible thresholds. Typically, it is calculated by plotting the rate of false positives against false negatives across the range of possible thresholds (this is the Receiver Operator Curve)  and then integrating (calculating the area under the curve). The result is typically something like this:


I’ve implemented this algorithm in an R script ( which I use quite frequently. Whenever I am tasked with explaining the meaning of the AUC value however, I will usually just say that you want it to be 1 and that 0.5 is no better than random. This usually suffices, but if my interlocutor is of the particularly curious sort they will tend to want more. At which point I will offer the interpretation that the AUC gives you the probability that a randomly selected positive case (1) will be ranked higher in your predictions than a randomly selected negative case (0).

Which got me thinking – if this is true, why bother with all this false positive, false negative, ROC business in the first place? Why not just use Monte Carlo to estimate this probability directly?

So, of course, I did just that and by golly it works.


# Simulate some binary outcomes #
n <- 100
x <- runif(n,-3,3)
p <- 1/(1+exp(-x))
y <- bs(p)

# Using my overly verbose code at

## The hard way (but with fewer lines of code) ##
N <- 10000000
r_pos_p <- sample(p[y==1],N,replace=TRUE)
r_neg_p <- sample(p[y==0],N,replace=TRUE)

# Monte Carlo probability of a randomly drawn 1 having a higher score than
# a randomly drawn 0 (AUC by definition):

rAUC <- mean(r_pos_p > r_neg_p)

By randomly sampling positive and negative cases to see how often the positives have larger predicted probability than the negatives, the AUC can be calculated without the ROC or thresholds or anything. Now, before you object that this is necessarily an approximation, I’ll stop you right there – it is.  And it is more computationally expensive too. The real value for me in this method is for my understanding of the meaning of AUC. I hope that it has helped yours too!

From Whale Calls to Dark Matter: Competitive Data Science with R and Python

Back in June I gave a fun talk at Montreal Python on some of my dabbling in the competitive data science scene. The good people at Savior-fair Linux recorded the talk and have edited it all together into a pretty slick video. If you can spare twenty-minutes or so, have a look.

If you want the slides, head on over to my speakerdeck page.


How likely is the NSA PRISM program to catch a terrorist?

Recent revelations about PRISM, the NSA’s massive program of surveillance of civilian communications have caused quite a stir. And rightfully so, as it appears that the agency has been granted warrantless direct access to just about any form of digital communication engaged in by American citizens, and that their access to such data has been growing significantly over the past few years.

Some may argue that there is a necessary trade-off between civil liberties and public safety, and that others should just quit their whining. Lets take a look at this proposition (not the whining part). Specifically, let’s ask: how much benefit, in terms of thwarted would-be attacks, does this level of surveillance confer?

Lets start by recognizing that terrorism is extremely rare. So the probability that an individual under surveillance (and now everyone is under surveillance) is also a terrorist is also extremely low. Lets also assume that the neck-beards at the NSA are fairly clever, if exceptionally creepy. We assume that they have devised an algorithm that can detect ‘terrorist communications’ (as opposed to, for instance, pizza orders) with 99% accuracy.

P(+ |  bad guy) = 0.99

A job well done, and Murica lives to fight another day. Well, not quite. What we really want to know is: what is the probability that they’ve found a bad guy, given that they’ve gotten a hit on their screen? Or,

P(bad guy | +) =??

Which is quite a different question altogether. To figure this out, we need a bit more information. Recall that bad guys (specifically terrorists) are extremely rare, say on the order of one in a million (this is a wild over estimate with the true rate being much lower, of course – but lets not let that stop us). So,

P(bad guy) = 1/1,000,000

Further, lets say that the spooks have a pretty good algorithm that only comes up falsely positive (ie when the person under surveillance is a good guy) one in one hundred times.

P(+ |  good guy) = 0.01

And now we have all that we need. Apply a little special Bayes sauce:

P(bad guy | +) = P(+ | bad guy) P(bad guy)  /  [ P(+ |  bad guy) P(bad guy) + P(+ |  good guy) P(good guy) ]

and we get:

P(bad guy | +) = 1/10,102

That is, for every positive (the NSA calls these ‘reports’) there is only a 1 in 10,102 chance (using our rough assumptions) that they’ve found a real bad guy.

UPDATE: While former NSA analyst turned whistle blower William Binney thinks this is a plausible estimate, the point here is not that this is the ‘correct probability‘ involved (remember that we based our calculations on very rough assumptions). The take away message is simply that whenever the rate of an event of interest is extremely low, even a very accurate test will fail very often.

UPDATE 2: The Wall Street Journal’s Numbers Guy has written a piece on this in which several statisticians and security experts respond.

UPDATE 3: If you can read German, a reader reached me to point out that Der Spiegel technology section picked up the story.

Big brother is always watching, but he’s still got a needle in a haystack problem.

Big Brother 11

The television series doesn’t have this problem. On the show, they’re all bad guys.

What is probabilistic truth? Part 2 – Everything is conditional

Read Part 1

When making a statement of the form “1/2 is the correct probability that this coin will land tails”, there are a few things which are left unsaid, but which are typically implied.

The statement is one about the probability of an unknown event occurring, and it would seem reasonable to write this statement using probability notation as P(toss=tails) = 0.5. And indeed many people would express it this way. However, what is missing is the state of knowledge under which this statement has been made. For instance, is the coin yet to be flipped, or is it currently rolling in a circle on the table, leaning in toward its final resting position? Perhaps the flipping device can consistently throw a coin such that it rotates exactly 5 times in the air before landing flat on the table, or we know which side is up at the start of the flip. In these latter cases, the statement of probability would be made under considerably more knowledge than the first, and would not tend to be 0.5 in these cases. An observer placing a probability of P(toss=tails) = 0.99 at the moment when the coin is circling in on its resting position, leaning heavily toward a tails up configuration, could be said to have the correct probability also. For fairness, lets say that the first observer also makes her probability statement at the same moment, but from another room where she cannot see what has happened.

How can P(toss=tails) = 0.5, and P(toss=tails) = 0.99 be simultaneously correct?

The answer is conditioning. Each of the statements were made conditional on the observer’s state of knowledge. More completely, the two statements can be rewritten as:

P(toss=tails | knowledge of observer 1) = 0.5 , and

P(toss=tails | knowledge of observer 2) = 0.99

In practice, however, we often leave out the conditional part of the notation unless it is germane to the problem at hand. However, there is no such thing as unconditional probability. In fact, Harvard professor Joe Blitzstein calls conditioning the Soul of Statistics.

In the next post in this series, we’ll start looking at how to assess the correctness of a (conditional) probability statement after having observed an outcome.

Here's a bunch of random walks -- just 'cause its neat.

Here’s a bunch of random walks — just ’cause its neat.

What is probabilistic truth?

I am currently working on a validation metric for binary prediction models. That is, models which make predictions about outcomes that can take on either of two possible states (eg Dead/not dead, heads/tails, cat in picture/no cat in picture, etc.) The most commonly used metric for this class of models is AUC, which assesses the relative error rates (false positive, false negative) across the whole range of possible decision thresholds. The result is a curve that looks something like this:


Where the area under the curve (the curve itself is the Receiver Operator Curve (ROC)) is some value between 0 and 1. The higher this value, the better your model is said to perform. The problem with this metric, as many authors have pointed out, is that a model can perform very well in terms of AUC, but be completely miscalibrated in terms of the actual probabilities placed on each outcome.

A model which distinguishes perfectly between positive and negative cases (AUC=1) by placing a probability of 0.01 on positive cases and 0.001 on negative cases may be very far off in terms of the actual probability of a positive case. For instance, positive cases may actually occur with probability 0.6 and negative cases with 0.2. In most real situations, our models will predict a whole range of different probabilities with a unique prediction for each data point, but the general idea remains. If your goal is simply to distinguish between cases, you may not care whether the probabilities are not correct. However, if your model is purporting to quantify risk then you very much want to know if you are placing the probabilistically true predictions on cases that are yet to be observed.

Which begs the question: What is probabilistic truth? 

This questions appears, at least at first, to be rather simple. A frequentist definition would say that the probability is correct, or true, if the predicted probability is equal to the long run outcomes.  Think of a dice rolled over and over counting the number of times a one is rolled. We would compare this frequency to our predicted probability of rolling a one (1/6 for a fair six-sided die) and would say that our predicted probability was true if this frequency matched 1/6.

But what about situations where we can’t re-run an experiment over and over again? How then would we evaluate the probabilistic truth of our predictions?

I’ll be working through this problem in a series of posts in the coming weeks. Stay tuned!

Read Part 2