Olympic body match and 1:1 BMI

In my morning attempt to read the whole internet before beginning work, I came across a program on the BBC website which allows you to see which Olympic athletes are your body doubles. Or rather, which athletes share your height and weight, and therefore your body mass index. Being a Canadian, I exist in an uneasy hybrid world of measurements, comfortable with metric units for some things and imperial for others. I know my height in cm (185), and my weight in lbs (185). This makes my Olympic ‘body double’ Elizabeth Gleadle, the Canadian javelin thrower.

Apart from discovering that I may have a body type for women’s competitive javelin, I was intrigued by the symmetry of height(cm) = weight(lbs). Over what range is this equivalence healthy? A quick googling of BMI tells me that ‘normal’ range of BMI is from 18.5 to 25. I come in at 24.5; well fed, to be sure, but in the safe zone.

Scratching this curiosity itch is nothing that can’t be handled by a quick R script. Looks like you can carry the height = weight equivalence, adding a pound for each cm of height, all the way up to about 8 feet tall (~245lbs) before you’re underweight, according to BMI.

Okay, now I should get to work.

Code for the curious.


for(i in hw)

plot(hw,BMI,type='l',xlab='Height(cm) = Weight(lbs)',main='Same height as weight BMI')
legend('topright',legend="The sweet spot\n 185cm, 185lbs\n", pch=8,col='blue',bg='white')
#underweight = <18.5
#Normal weight = 18.5–24.9
#Overweight = 25–29.9

Dynamical systems: Mapping chaos with R

Chaos. Hectic, seemingly unpredictable, complex dynamics. In a word: fun. I usually stick to the warm and fuzzy world of stochasticity and probability distributions, but this post will be (almost) entirely devoid of randomness. While chaotic dynamics are entirely deterministic, their sensitivity to initial conditions can trick the observer into seeing iid.

In ecology, chaotic dynamics can emerge from a very simple model of population.

x_{t+1} = r x_t(1-x_t)

Where the population in time-step t+1 is dependent on the population at time step t, and some intrinsic rate of growth, r. This is known as the logistic (or quadratic) map. For any starting value of x at t0, the entire evolution of the system can be computed exactly. However, there some values of r for which the system will diverge substantially with even a very slight change in the initial position.

We can see the behaviour of this model by simply plotting the time series of population sizes. Another, and particularly instructive way of visualizing the dynamics, is through the use of a cobweb plot. In this representation, we can see how the population x at time t maps to population x at time t+1 by reflecting through the 1:1 line. Each representation is plotted here:

You can plot realizations of the system using the following R script.

############# Trace #############
for(i in 2:N)


##########  Quadradic Map ########
for(i in 1:length(x))


lines(x=c(start,start),y=c(0,r*start*(1-start)) )
for(i in 1:(2*N))
lines(x=c(start,start),y=c(start,r*start*(1-start)) )
r*start*(1-start)) )

To use, simply call the function with any value of r, and a starting position between 0 an 1.


Fun right?

Now that you’ve tried a few different values of r at a few starting positions, it’s time to look a little closer at what ranges of r values produce chaotic behaviour, which result in stable orbits, and which lead to dampening oscillations toward fixed points. There is a rigorous mathematics behind this kind of analysis of dynamic systems, but we’re just going to do some numerical experimentation using trusty R and a bit of cpu time.

To do this, we’ll need to iterate across a range of r values, and at each one start a dynamical system with a random starting point (told you there would be some randomness in this post). After some large number of time-steps, we’ll record where the system ended up. Plotting the results, we can see a series of period doubling (2,4,8, etc) bifurcations interspersed with regions of chaotic behaviour.

for(i in 1:res)


for(i in 2:N)




#warning: Even in parallel with 4 cores, this is by no means fast code!
plot(bi,col='green',xlab='R',ylab='n --> inf',main='',pch=15,cex=0.2)

This plot is known as a bifurcation diagram and is likely a familiar sight.

Hopefully working through the R code and running it yourself will help you interpret cobweb plots, as well as bifurcation diagrams. It is really quite amazing how the simple looking logistic map equation can lead to such interesting behaviour.

R Workshop: Introducing Slidify – HTML5 slides from R markdown

Thursday, June 28th, 2012  19h. <–  new evening time!

Tomson House: 650 McTavish, H3A 1Y2, Montréal, QC <– new social setting!

guRu: Ramnath Vaidyanathan (McGill University)

Ramnath Vaidyanathan will introduce the group to slidify, his brand new R package.

From the slidify website:

“The objective of slidify is to make it easy to create reproducible HTML5 presentations from .Rmd files. The guiding philosophy of slidify is to completely separate writing of content from its rendering, so that content can be written once in R Markdown, and rendered as an HTML5 presentation using any of the HTML5 slide frameworks supported.”

The package is currently in alpha and therefore not yet available on cran. You can find install instructions here.

Ramnath Vaidyanathan is Assistant Professor of Operations Management at McGill University’s Desautels Faculty of Management.

This is a meeting of the Montreal R Users Group. We’re open to everyone! Sign up to RSVP!

Simulating Euro 2012

Why settle for just one realisation of this year’s UEFA Euro when you can let the tournament play out 10,000 times in silico?

Since I already had some code lying around from my submission to the Kaggle hosted 2010 Take on the Quants challenge, I figured I’d recycle it for the Euro this year. The model takes a simulation based approach, using Poisson distributed goals. The rate parameters in a given match are determined by the relative strength of each team as given by their ELO rating.

The advantage of this approach is that the tournament structure (which teams are assigned to which groups, how points are assigned, quarter final structure, etc) can be accounted for. If we wanted to take this into account when making probabilistic forecasts of tournament outcomes, we would need to calculate many conditional probability statements, following the extremely large number of possible outcomes. However, the downside is that it is only as reliable as the team ratings themselves. In my submission to Kaggle, I used a weighted average of ELO and FIFA ratings.

After simulating the tournament 10,000 times, the probability of victory for each team is just the number of times that team arose victorious divided by 10,000.

Feel free to get the code here and play around with it yourself. You can use your own rating system and see how the predicted outcomes change. If you are of a certain disposition, you might even find it more fun than the human version of the tournament itself!

Distribution of Oft-Used Bash Commands

Browsing commandlinefu.com today, I came across this little one-liner to display which commands I use most often.

$ history | awk '{a[$2]++}END{for(i in a){print a[i] " " i}}' \
| sort -rn | head

Here’s what I got:

283 ls
236 cd
52 cat
40 vim
36 sudo
27 ssh
27 rm
23 git
21 screen
21 R

Yep, seems legit. I navigate and look at files a whole bunch (ls, cd, cat), and I do a butt tonne of editing (vim). I sudo like a boss, hop onto various servers (ssh), clean up after myself (rm), commit commit commit (git), tuck away interactive sessions for later (screen), and of course, do mad stats (R).

Now, my bash.rc set up is set to save the 1000 most recent commands. Given that it is Friday afternoon and I’m avoiding real work while waiting for the softball game to start, I thought I’d have a look at my whole usage distribution. So, lets just collect it up into a file comme ca:

$history | awk '{a[$2]++}END{for(i in a){print a[i] " " i}}' \
| sort -rn > cmd_hist.txt

Then crack open an R session and have a look:


Cool. Looks like my command usage pattern is roughly power-law distributed! Now, to . publish . my findings . in . Nature!

The R bits:


More Bixi Data Visualization

I mentioned in a previous post that our team at the recent Hack/Reduce hackathon had some fun with a data set which consisted of Bixi station states at minute level temporal resolution. In addition to pulling out and plotting the flux at each station on an hourly basis, we also plotted the system state (number of bikes at each station) at each time-step we had. This totalled to 24,217 individual plots. Each plot was generated using an R script which took in the system state at each time-step, and output a png.

Team member Kamal Marhubi also did some nice post-processing to overlay the information on a map. The results are a little mesmerising. Things don’t get fun until about 40s into the video, as the first part mostly just shows the stations coming online for the first part of the season.

And for the non-Montrealers out there, here’s an image of a Bixi bike; our durable, data generating little hero.

Heartbeat of a Cycling City: Update

I recently posted about some Bixi data our group analysed at the Hack/Reduce Montreal 2 event. One of the observations I made was that it seemed as though the evening rush was generally stronger than the morning rush. This seemed to be true at least for the week of April 11th to 14th. I even speculated that this was because riding home might be a great way to relax after work. Reader Joey Berger contacted me with an alternative take on this:

You were surprised (as I was) by this, in part I think because downtown is downhill for a lot of Bixi users. When I used to commute downtown I always rode in the morning and took the bus in the evening.
Anyway, I got curious so I looked up some Environment Canada data.

As you can see, there wasn’t much rain to affect bike use, but April mornings are a lot cooler than April evenings. I suspect two things.  First, below about 12 degrees, riding a Bixi isn’t as comfortable as it needs to be for mass use. Especially if it’s windy and you don’t have gloves on. Second, I assume there’s a lot more downtown traffic in the evening, especially among pedestrians/bikers who are both commuting from work and entering downtown for dinner, movies, etc.
Keep the comments coming and check back here for an analysis of Bixi traffic during the STM outage on Thursday morning.