Did the sun just explode? The last Dutch Book you’ll ever make

In today’s XKCD, a pair of (presumably) physicists are told by their neutrino detector that the sun has gone nova. Problem is, the machine rolls two dice and if they both come up six it lies, otherwise it tells the truth.

The Frequentist reasons that the probability of obtaining this result if the sun had not, infact, gone nova is 1/36 (0.027, p<0.05) and concludes that the sun is exploding. Goodbye cruel world.

The Bayesian sees things a little differently, and bets the Frequentist $50 that he is wrong.

Let’s set aside the obvious supremacy of the Bayesian’s position due to the fact that were he to turn out to be wrong, come the fast approaching sun-up-to-end-all-sun-ups, he would have very little use for that fifty bucks anyway.

What prior probability would we have to ascribe to the sun succumbing to cataclysmic nuclear explosion on any given night in order to take the Bayesian’s bet?

Not surprisingly, we’ll need to use Bayes!

P(S|M) = \frac{P(M|S)P(S)}{P(M|S)P(S)+P(M|S \urcorner)P(S \urcorner)}

Where M is the machine reading a solar explosion and S is the event of an actual solar explosion.

Assuming we are risk neutral, and we take any bet with an expected value greater than the cost, we will take the Bayesian’s bet if P(S|M)>0.5. At this cutoff, the expected value is 0.5*0+0.5*100=50 and hence the bet is worth at least the cost of entry.

The rub is that this value depends on our prior. That is, the prior probability that we ascribe to global annihilation by complete solar nuclear fusion. We can set P(S|M)=0.5 and solve for P(S) to get the threshold value for a prior that would make the bet a good one (ie not a Dutch book). This turns out to be:

P(S) = 1-\frac{P(M|S \urcorner)}{P(M|S)P(M|S \urcorner)}, where P(S|M) = 0.5

Which is ~0.0277 — the Frequentist’s p-value!

So, assuming 1:1 payout odds on the bet, we should only take it if we already thought that there was at least a 2.7% chance that the sun would explode, before even asking the neutrino detector. From this, we can also see what odds we would be willing to take on the bet for any level of prior belief about the end of the world.

sun_explode<-function(P_S)
{
P_MgS<-35/36
P_MgNS<-1/36
P_NS<-1-P_S

P_SgM<-(P_MgS*P_S)/(P_MgS*P_S + P_MgNS*P_NS)

return(P_SgM)
}

par(cex=1.3,lwd=2,mar=c(5,5,1,2))
curve(sun_explode(x),
xlim=c(0,0.1),
ylab='P(Sun Exploded | Neutrino Machine = YES)',
xlab='P(Sun Exploded) - aka your prior')

text(0.018,0.2,'No\n thanks')
text(0.07,0.6,'A good bet,\n but frightful existence')

abline(h=0.5,lty=2)
abline(v=0.0277,lty=2)

Variable probability Bernoulli outcomes – Fast and Slow

I am working on a project that requires the generation of Bernoulli outcomes. Typically, I would go about this using the built in sample() function like so:

sample(1:0,n,prob=c(p,1-p),replace=TRUE)

This works great and is fast, even for large n. Problem is, I want to generate each sample with its own unique probability. Seems straight forward enough, I just wrapped the function and vectorized to allow the passing of a vector of p.

binomial_sampler<-function(p){
  return(sample(1:0,1,prob=c(p,1-p)))
}
bs<-Vectorize(binomial_sampler)

Naming this function bs() turned out to be rather prophetic. Nevertheless, I can call this function by passing my unique vector of outcome probabilities. And indeed I get the result I’m looking for.

bs(my_p_vec)

Problem is, this turns out to be very slow. It would seem that there is quite a bit of overhead to calling sample() for one sample at a time. R’s RNGs are very fast for generating many iid samples, so I started thinking like my old c++ programming self and tried a different approach.

Nbs<-function(p)
{
  U<-runif(length(p),0,1)
  outcomes<-U<p
  return(outcomes)
}

I call the new version Nbs for “New Bernoulli Sampler”, or “Not Bull Shit”. And what a difference it made indeed!

library(rbenchmark)
p<-runif(1000)
res <- benchmark(bs(p), Nbs(p))
print(res)
test replications elapsed relative user.self sys.self user.child sys.child
2 Nbs(p)          100   0.007        1     0.008    0.000          0         0
1  bs(p)          100   1.099      157     1.080    0.016          0         0

157x faster! Now that’s a speedup to write home about.

Dan “The Man” Bernoulli

Montreal R User Group meetup Nov. 14th

After a bit of a summer lull, the Montreal R User Group is meeting up again! We’re trying out a new venue this time. Notman House is the home of the web in Montreal. They hold hackathons and other tech user group meetups, and they are all around great people in an all around great space in downtown Montreal.

Our meetup will feature R super-user Etienne Low-Decarie, who will give a walk through of some of the most powerful packages in R, many of which were built by rstats rock star Hadley Wickham.

I will also kick off the meetup with a short session on how R is revolutionizing data science in academia, journalism, business and beyond.

  • November 14th, 7pm at 51 Sherbrooke W.
  • BYOL&D (Bring Your Own Laptop & Data)

Don’t forget to RSVP. Hope to see you there!

Introduction to Bayesian lecture: Accompanying handouts and demos

I recently posted the slides from a guest lecture that I gave on Bayesian methods for biologists/ecologist. In an effort to promote active learning, the class was not a straight forward lecture, but rather a combination of informational input from me and opportunities for students to engage with the concepts via activities and discussion of demonstrations. These active components were designed with the goal of promoting students’ construction of knowledge, as opposed to a passive transfer from teacher to learner.

In order to bring the online reader into closer allignment with the experience of attending the class, I have decided to provide the additional materials that I used to promote active learning.

1) Monte-Carlo activity:

In pairs, students are provided with a random number sheet and a circle plot handout:

One student is the random number generator, the other is the plotter. After students plot a few points, we collect all the data and walk through a discussion of why this works. We then scale up and take a look at the same experiment using a computer simulation to see how our estimate converges toward the correct value.

2) Metropolis-Hastings in action:

In this demonstration, we walk through the steps of the MH algorithm visually.

Discussion is then facilitated regarding the choice of proposal distribution, autocorrelation, and convergence diagnosis around this demonstration.

I hope that you find this helpful. If you are teaching this topic in your class, feel free to borrow, and improve upon, these materials. If you do, drop me a note and let me know how it went!

Dark matter benchmarks: All over the map

The three benchmark algorithms for predicting the location of dark matter halos are, for the most part, all over the map. Most of the test skies look something like this:

There are, however, some skies with rather strong halo signals that get a decent amount of agreement:

The Lenstool MLE algorithm is the current state of the art. As such, it’s the algo to beat. As of this morning, there was only one entry on the leader board with a score topping this benchmark.

*cracks fingers* – Let’s see if we can give it a run for it’s money.

Observing Dark Worlds – Visualizing dark matter’s distorting effect on galaxies

Some people like to do crossword puzzles. I like to do machine learning puzzles.

Lucky for me, a new contest was just posted yesterday on Kaggle. So naturally, my lazy Saturday was spent getting elbow deep into the data.

The training set consists of a series of ‘skies’, each containing a bunch of galaxies. Normally, these galaxies would exhibit random ellipticity. That is, if it weren’t for all that dark matter out there! The dark matter, while itself invisible (it is dark after all), tends to aggregate and do some pretty funky stuff. These aggregations of dark matter produce massive halos which bend the heck out of spacetime itself! The result is that any galaxies behind these halos (from our perspective here on earth) appear contorted around the halo.

The tricky bit is to distinguish between the background noise in the ellipticity of galaxies, and the regular effect of the dark matter halos. How hard could it be?

Step one, as always, is to have a look at what you’re working with using some visualization.

An example of the training data. This sky has 3 dark matter halos. I f you squint, you can kind of see the effect on the ellipticity of the surrounding galaxies.

If you want to try it yourself, I’ve posted the code here.

If you don’t feel like running it yourself, here are all 300 skies from the training set.

 

Now for the simple matter of the predictions. Looks like Sunday will be a fun day too! Stay tuned…