IPython vs RStudio+knitr

At a meeting last night with some collaborators at the Vélobstacles project, I was excitedly told about the magic of IPython and it’s notebook functionality for reproducible research. This sounds familiar, I thought to myself. Using a literate programming approach to integrate computation with the communication of methodology and results has been at the core of the development of the RStudio IDE and associated tools such as knitr.

Here is Fernando Pérez speaking at PyCon Canada 2012 in Toronto about IPython for reproducible scientific computing.


This looks like convergent evolution in the R and Python communities, and I’m sure these projects can (and have already) learn a lot from each other.

What does R do? Bring people together, of course!

Last night we had a great meet up of the Montreal R User Group. I got things started with a little presentation asking the question “What does R do?” (slides). I made the presentation using Montreal R User Group member Ramnath Vaidyanathan‘s Slidify package. Slidify allows you to generate rather handsome HTML5 slides directly using R markdown.

We were then treated to a great workshop by Etienne Low-Decarie. He gave us a fly over of some of the most powerful R packages for wrangling data, namely plyr, reshape and ggplot.

Here are Etienne’s slides.

You can also follow along with the code posted here.

We met a lot of people who are doing very cool things using R. I’m looking forward to our next meetup!

I’m no shutterbug – drop me a note if you came and have any better pictures.

We had clementines!

Also, thanks to Notman House for hosting us. The haunted house feeling wasn’t enough to scare off this hardy group of data geeks.

Did the sun just explode? The last Dutch Book you’ll ever make

In today’s XKCD, a pair of (presumably) physicists are told by their neutrino detector that the sun has gone nova. Problem is, the machine rolls two dice and if they both come up six it lies, otherwise it tells the truth.

The Frequentist reasons that the probability of obtaining this result if the sun had not, infact, gone nova is 1/36 (0.027, p<0.05) and concludes that the sun is exploding. Goodbye cruel world.

The Bayesian sees things a little differently, and bets the Frequentist $50 that he is wrong.

Let’s set aside the obvious supremacy of the Bayesian’s position due to the fact that were he to turn out to be wrong, come the fast approaching sun-up-to-end-all-sun-ups, he would have very little use for that fifty bucks anyway.

What prior probability would we have to ascribe to the sun succumbing to cataclysmic nuclear explosion on any given night in order to take the Bayesian’s bet?

Not surprisingly, we’ll need to use Bayes!

P(S|M) = \frac{P(M|S)P(S)}{P(M|S)P(S)+P(M|S \urcorner)P(S \urcorner)}

Where M is the machine reading a solar explosion and S is the event of an actual solar explosion.

Assuming we are risk neutral, and we take any bet with an expected value greater than the cost, we will take the Bayesian’s bet if P(S|M)>0.5. At this cutoff, the expected value is 0.5*0+0.5*100=50 and hence the bet is worth at least the cost of entry.

The rub is that this value depends on our prior. That is, the prior probability that we ascribe to global annihilation by complete solar nuclear fusion. We can set P(S|M)=0.5 and solve for P(S) to get the threshold value for a prior that would make the bet a good one (ie not a Dutch book). This turns out to be:

P(S) = 1-\frac{P(M|S \urcorner)}{P(M|S)P(M|S \urcorner)}, where P(S|M) = 0.5

Which is ~0.0277 — the Frequentist’s p-value!

So, assuming 1:1 payout odds on the bet, we should only take it if we already thought that there was at least a 2.7% chance that the sun would explode, before even asking the neutrino detector. From this, we can also see what odds we would be willing to take on the bet for any level of prior belief about the end of the world.

sun_explode<-function(P_S)
{
P_MgS<-35/36
P_MgNS<-1/36
P_NS<-1-P_S

P_SgM<-(P_MgS*P_S)/(P_MgS*P_S + P_MgNS*P_NS)

return(P_SgM)
}

par(cex=1.3,lwd=2,mar=c(5,5,1,2))
curve(sun_explode(x),
xlim=c(0,0.1),
ylab='P(Sun Exploded | Neutrino Machine = YES)',
xlab='P(Sun Exploded) - aka your prior')

text(0.018,0.2,'No\n thanks')
text(0.07,0.6,'A good bet,\n but frightful existence')

abline(h=0.5,lty=2)
abline(v=0.0277,lty=2)

Variable probability Bernoulli outcomes – Fast and Slow

I am working on a project that requires the generation of Bernoulli outcomes. Typically, I would go about this using the built in sample() function like so:

sample(1:0,n,prob=c(p,1-p),replace=TRUE)

This works great and is fast, even for large n. Problem is, I want to generate each sample with its own unique probability. Seems straight forward enough, I just wrapped the function and vectorized to allow the passing of a vector of p.

binomial_sampler<-function(p){
  return(sample(1:0,1,prob=c(p,1-p)))
}
bs<-Vectorize(binomial_sampler)

Naming this function bs() turned out to be rather prophetic. Nevertheless, I can call this function by passing my unique vector of outcome probabilities. And indeed I get the result I’m looking for.

bs(my_p_vec)

Problem is, this turns out to be very slow. It would seem that there is quite a bit of overhead to calling sample() for one sample at a time. R’s RNGs are very fast for generating many iid samples, so I started thinking like my old c++ programming self and tried a different approach.

Nbs<-function(p)
{
  U<-runif(length(p),0,1)
  outcomes<-U<p
  return(outcomes)
}

I call the new version Nbs for “New Bernoulli Sampler”, or “Not Bull Shit”. And what a difference it made indeed!

library(rbenchmark)
p<-runif(1000)
res <- benchmark(bs(p), Nbs(p))
print(res)
test replications elapsed relative user.self sys.self user.child sys.child
2 Nbs(p)          100   0.007        1     0.008    0.000          0         0
1  bs(p)          100   1.099      157     1.080    0.016          0         0

157x faster! Now that’s a speedup to write home about.

Dan “The Man” Bernoulli

Montreal R User Group meetup Nov. 14th

After a bit of a summer lull, the Montreal R User Group is meeting up again! We’re trying out a new venue this time. Notman House is the home of the web in Montreal. They hold hackathons and other tech user group meetups, and they are all around great people in an all around great space in downtown Montreal.

Our meetup will feature R super-user Etienne Low-Decarie, who will give a walk through of some of the most powerful packages in R, many of which were built by rstats rock star Hadley Wickham.

I will also kick off the meetup with a short session on how R is revolutionizing data science in academia, journalism, business and beyond.

  • November 14th, 7pm at 51 Sherbrooke W.
  • BYOL&D (Bring Your Own Laptop & Data)

Don’t forget to RSVP. Hope to see you there!

Introduction to Bayesian lecture: Accompanying handouts and demos

I recently posted the slides from a guest lecture that I gave on Bayesian methods for biologists/ecologist. In an effort to promote active learning, the class was not a straight forward lecture, but rather a combination of informational input from me and opportunities for students to engage with the concepts via activities and discussion of demonstrations. These active components were designed with the goal of promoting students’ construction of knowledge, as opposed to a passive transfer from teacher to learner.

In order to bring the online reader into closer allignment with the experience of attending the class, I have decided to provide the additional materials that I used to promote active learning.

1) Monte-Carlo activity:

In pairs, students are provided with a random number sheet and a circle plot handout:

One student is the random number generator, the other is the plotter. After students plot a few points, we collect all the data and walk through a discussion of why this works. We then scale up and take a look at the same experiment using a computer simulation to see how our estimate converges toward the correct value.

2) Metropolis-Hastings in action:

In this demonstration, we walk through the steps of the MH algorithm visually.

Discussion is then facilitated regarding the choice of proposal distribution, autocorrelation, and convergence diagnosis around this demonstration.

I hope that you find this helpful. If you are teaching this topic in your class, feel free to borrow, and improve upon, these materials. If you do, drop me a note and let me know how it went!