CAISN

Reblogged from Zero to R Hero:

Click to visit the original post

Canadian Aquatic Invasive Species Networks Annual General Meeting in Kananaskis, Alberta. May 03, 3:25-5:30.

This 2-hour workshop will focus on how and why we do numerical simulation in R. Time permitting, we will also look at how to build and fit likelihood based statistical models.

We ask that you bring your laptop with both R and R-Studio installed. If you've never worked with R before, please have a look at the…

Read more… 94 more words

I can think of worse places to get down and dirty with R than Kananaskis, Alberta.

Mathematical abstraction and the robustness to assumptions

I’ve been showing my new favourite toys to just about anyone foolish enough to actually engage me in conversation. I described how my shiny new set of non-transitive dice work here, complete with a map showing all the relevant probabilities.

All was neat and tidy and wonderful until fellow ecologist, Aaron Ball, tried to burst my bubble.

Nope. I couldn’t find the error. Fortunately, he works across the hall so I just went and asked him.

The problem he found, it turns out, was not with my calculations but with my assumptions. Aaron told me that dice constructed with rounded corners and hollowed out pips for the numbers on the faces tend to be biased in the frequency at which each face rolls up. I had assumed, of course, that each side of each of the five dice would roll with the same probability (ie. 1 in 6).

As with any model of a real world system, the mathematics were carried out on a simplified abstraction of the system being modelled. There are always, by necessity, assumptions being made. The important thing is to make these assumptions as explicit as possible and, where possible, to test the robustness of the model predictions to violations of the assumptions. Implicit to my calculations of the odds of the non-transitive Grime dice was the assumption that the dice are fair.

To check the model for robustness to this assumption, we can relax it and find out if we still get the same behaviour. Specifically, we can ask here whether some sort of pip-and-rounded-corner-induced bias can lead to a change in the Grime dice non-transitive cycles.

It seems a natural place to look would be between the dice pairings which have the closest to even odds. We can find out what level of bias would be required to switch the directionality of the odds (or at least erase the tendency for one die to roll higher than the other). Lets try looking at Magenta and Red, which under the fair dice assumption have odds p(Magenta > Red)=5/9. What kind of bias will change this relationship? The odds can be evened out by either Magenta rolling ones more often, or red rolling nine more often. The question is then, how much bias would there need to in the dice in order to even out the odds between Magenta and Red?

Lets start with Red biasing toward rolling nine more often (recall that nine appears on only one face). Under the fair dice hypothesis, Red can roll nine (1/6 of the time) and win no matter what Magenta rolls, or by rolling four (5/6 of the time) and win when Magenta rolls one (1/3 of the time).

P(Red > Magenta) = 1/6 + 5/6 * 1/3, which is 4/9.

If we set this probability equal to 1/2, and replace the fraction of times that Red rolls nine with x, we can solve for the frequency needed to even the odds.

x + 5/6 * 1/3 = 1/2

x = 2/9

Meaning that the Red die would have to be biased toward rolling nine with 2/9 odds. That’s equivalent to rolling a nine 1 and 1/3 times (33%) more often than you would expect if the die were fair!

Alternatively, the other way the odds between Red and Magenta could be evened is if Magenta biased towards rolling ones more often. We can do the same kind of calculation as above to figure out how much bias would be needed.

1/6 + 5/6 * x = 1/2

x = 2/5

Which corresponds to Magenta having  a 20% bias toward rolling ones. Of course, some combination of these biases could also be possible.

I leave it to the reader to work out the other pairings, but from the Red-Magenta analysis we can see that even if the dice deviated quite a bit from the expected 1/6 probability for each side, the edge afforded to Magenta is retained. I couldn’t find any convincing  evidence for the extent of bias caused by pipping and rounded corners but it seems unlikely that it would be strong enough to change the structure of the game.

A quick guide to non-transitive Grime Dice

A very special package that I am rather excited about arrived in the mail recently. The package contained a set of 6-sided dice. These dice, however, don’t have the standard numbers one to six on their faces. Instead, they have assorted numbers between zero and nine. Here’s the exact configuration:

red<-c(4,4,4,4,4,9)
blue<-c(2,2,2,7,7,7)
olive<-c(0,5,5,5,5,5)
yellow<-c(3,3,3,3,8,8)
magenta<-c(1,1,6,6,6,6)

Aside from maybe making for a more interesting version of snakes and ladders, why the heck am I so excited about these wacky dice? To find out what makes them so interesting, lets start by just rolling one against another and seeing which one rolls the higher number. Simple enough. Lets roll Red against Blue. Until you get your own set, you can roll in silico.

That was fun. We can do it over and over again and we’ll find that Red beats Blue more often than not. So it seems like Red is a pretty good bet. Now lets try rolling Olive against Red. I’ll wait.

Hey, look at that, the mighty Red has fallen. Olive tends to roll a higher number than Red more often than it doesn’t. So far, we have discovered this relationship:

Olive > Red > Blue

All hail the dominant Olive! Out of these three dice, if we want the best chance of winning, we should always pick Olive right? No dice, as they say. When we roll Olive against Blue, we find that Blue wins more often!

For any one of these three dice, there is another that will roll a higher number more often than not.

Olive > Red > Blue > Olive > Red > Blue > Olive > Red > Blue..

This forms a chain of dominance relationships that is a closed cycle. This property is called intransivity, and you can use it to win riches beyond your wildest dreams, er, well, at least to impress your friends.

Neat, right? But there’s more! We can do the same trick with Yellow, Magenta, and Red (Red > Magenta > Yellow > Red > …). With all five dice, there is a chain for which the order is given by that length of the word for each colour.

Red > Blue > Olive > Yellow > Magenta > …

Awesome. But that’s not it, either! You may have noticed from our three way comparisons that there is another five way chain. This time, the chain order is given by the alphabetical order of the words for each of the colours.

Blue > Magenta > Olive > Red > Yellow > …

What are the odds?

So far I’ve just asked you to take my word for it that the dominance relationships are as I described. Working out the odds of winning for any given pairing of dice as actually quite straightforward. Start by looking at the number on each side of the first die, one at a time. Count how many sides on the opposing die are less than the current number and divide by six. Since each side on the first die has a 1/6 chance of appearing, divide by 6 again. Sum these values for all six sides and you will have the probability that the first die will roll a higher number than the second.

For example, P(Red > Blue) = 5/6 x 1/2 + 1/6, which is 7/12.

Here I’ve worked out all of the pairwise odds:

Grime_dice

So, you can always win in this game as long as you get to be second to choose a colour. The odds are strongest in your favour when your opponent either chooses Magenta or Red, and you choose Olive or Yellow, respectively. Isn’t probability wonderful!

And if you still want more, it turns out that if you roll the Grime dice in pairs, the order of the word length chain reverses!

Open Data Exchange 2013, April 6. Montreal

UPDATE: The day was great! There are many people doing really amazing things with open data and it was amazing to meet them. Here are my slides from the panel talk.
Next Saturday, I’ll be sitting on a panel discussing future avenues for open data at ODX13.
From the odx13 site:

Odx13 is a mini-conference to discuss the successes and challenges of extracting value from Open Data for civic engagement, international aid transparency, scientific research, and more!

Program

Morning Session – Open Data Stories; Panel Discussions

9:00 AM    Introduction and Welcome

9:15 AM    Winning with Open Data – Panel 1

10:10 AM    Les données ouvertes en action - Panel 2 (en français)

  • Guillaume Ducharme, gestionnaire dans le réseau de la santé et membre du collectif Démocratie Ouverte
  • Sébastien Pierre, fondateur, FFunction & Montréal Ouvert
  • Josée Plamondon, co-conceptrice, ContratsNet
  • Jean-Noé Landry (l’animateur de discussion), fondateur, Montréal Ouvert et Québec Ouvert

11:05 AM    Future Avenues for Open Data – Panel 3

12:00 PM    Lunch will be provided

Afternoon Session – Digging into Data; Workshop and Lightning Talks

1:00 PM    Data Dive Intro – Exploratory Data Analysis with Trudat

1:30 PM    Data Dive

We will dive into interesting Open Data sets with experts on hand to guide us through the weeds, including data on

  • International Aid
  • Government contracts
  • Biodiversity
  • and more…

3:00 PM   Lightning Talks

4:00 PM    Present data insights

4:45 PM    Closing remarks

Introduction to Simulation using R

We had a great turnout yesterday for our Zero to R Hero workshop at the Quebec Centre for Biodiversity Science. We went from the absolute basics of the command line, to the intricacies of importing data, and finally we had a look at plotting using ggplot2. We didn’t have time to get to this extra module introducing simulation, but if you want to work through it on your own you can find the slides here:

intro_sim

The script file to follow along with is here:

https://gist.github.com/cjbayesian/5220711

The Gambling Machine Puzzle

This puzzle came up in the New York Times Number Play blog. It goes like this:

An entrepreneur has devised a gambling machine that chooses two independent random variables x and y that are uniformly and independently distributed between 0 and 100. He plans to tell any customer the value of x and to ask him whether y > x or x > y.

If the customer guesses correctly, he is given y dollars. If x = y, he’s given y/2 dollars. And if he’s wrong about which is larger, he’s given nothing.

The entrepreneur plans to charge his customers $40 for the privilege of playing the game. Would you play?

I figured I’d give it a go.  Since I was feeling lazy, and already had my computer in front of me, I thought that I’d do it via simulation rather than working out the exact maths. I tried playing the game with the first strategy that came to mind. If x<50, I would choose y>x, and if x>50, I’d choose y<x, figuring I’d maximize my probability of winning something rather than nothing. This was probably due to the inherent risk aversion of system one. Let’s see how that works out:

N<-100000
x<-sample.int(100,N,replace=TRUE)
y<-sample.int(100,N,replace=TRUE)
dec_rule=50
payout<-numeric(N)
for(i in 1:N)
{
## Correct Guess (playing simple max p(!0) strategy)
if( (x[i]>dec_rule & y[i]<x[i]) | (x[i]<=dec_rule & y[i]>x[i]) )
payout[i]<-y[i]

## Incorrect Guess (playing simple max p(!0) strategy)
if( (x[i]>dec_rule & y[i]>x[i]) | (x[i]<=dec_rule & y[i]<x[i]) )
payout[i]<-0

## Tie pays out y/2
if(x[i] == y[i])
payout[i]<-y[i]/2
}
## Expected Payout ##
print(paste(dec_rule,mean(payout)))

Which leads to an expected payout of $37.75. Playing the risk averse strategy leads to an expected value less than the cost of admission, loosing on average 25 cents per play. No deal, Mr entrepreneur, I had something else in mind for my forty bucks anyway.

Lets try alternate strategies, and see if we can’t play in such a way as to improve our outlook.

## Gambling Machine Puzzle ##
## Puzzle presented in http://wordplay.blogs.nytimes.com/2013/03/04/machine/

result<-numeric(100)
for(dec_rule in 1:100)
{
N<-10000
x<-sample.int(100,N,replace=TRUE)
y<-sample.int(100,N,replace=TRUE)

payout<-numeric(N)

for(i in 1:N)
{
## Correct Guess (playing dec_rule strategy)
if( (x[i]>dec_rule & y[i]<x[i]) | (x[i]<=dec_rule & y[i]>x[i]) )
payout[i]<-y[i]

## Incorrect Guess (playing dec_rule strategy)
if( (x[i]>dec_rule & y[i]>x[i]) | (x[i]<=dec_rule & y[i]<x[i]) )
payout[i]<-0

## Tie pays out y/2
if(x[i] == y[i])
payout[i]<-y[i]/2
}

## Expected Payout ##
print(paste(dec_rule,mean(payout)))
result[dec_rule]<-mean(payout)
}
par(cex=1.5)
plot(result,xlab='Decision rule',ylab='E(payout)',pch=20)

abline(v=which.max(result))
abline(h=max(result))
abline(h=40,lty=3)

gmachine

According to which, the best case scenario is an expected payout of $40.66, or an expected net of 66 cents per bet, if you were to play the strategy of choosing y>x for any x<73 and y<x for any x>73. You’re on, Mr entrepreneur!

To calculate the exactly optimal strategy and expected payout, we would need to compute the derivative of the expected payout function with respect to the within game decision threshold. I leave this fun stuff to the reader ;)

Montreal R User group meetup at Wajam

This Thursday (Jan 24th), 5:30pm, the good folks at Wajam are hosting a meetup of the Montreal R User Group.

The event will be at Bolidea at 4115 St Laurent, Montréal, QC. Be sure to RSVP.

From Benjamin Rollert:

This is an opportunity for people interested in R to hang out at our office, eat pizza and drink beer! We’ll also show some of the cool stuff we’ve done with R as part of live applications for our business intelligence.

Hope to see you there!

http://photos1.meetupstatic.com/photos/event/d/c/e/8/highres_101276552.jpeg

Dark matter top 10, but an hour too late

Well, that’s embarrassing. A little tweak to my dark matter model resulted in a leaderboard score in the top 10. The only problem is that the contest closed about an hour ago.

top_10

I ran this final prediction earlier today but then simply forgot to go back to it and submit!! On the bright side, I learned a lot of really interesting things about gravitational lensing and had a tonne of fun doing it. I’ll probably write a post-mortem sometime in the next few days, but for now I’m just kicking myself.

Simulating weak gravitational lensing

In the search for dark matter, I have been having mixed success. It seems that locating DM in single halo skies is a fairly straightforward problem. However, when there are more than one halo, things get quite a bit trickier.

As I have advocated many times before, including here and here, simulation can provide deep insights into many (if not all) problems. I never trust my own understanding of a complicated problem until I have simulated from a hypothesized model of that problem. Once I have a simulation in place, I can then test out all kinds of hypotheses about the system by manipulating the component parts. I think of this process as a kind of computer-assisted set of thought experiments.

So, when I was hitting a wall with the dark matter challenge, I of course turned to simulation for insights. Normally this would have been my very first step, however in this case my level of understanding of the physics involved was insufficient when I started out. After having done a bit of reading on the topic, I built a model which implements a weak lensing regime on an otherwise random background of galaxies. The model assumes an Einasto profile of dark matter mass density, with parameters A and α determining the strength of the tangential shearing caused by foreground dark matter.

A=0.2, alpha=0.5

I can then increase the strength of the lens by either increasing the mass of the dark matter, or by varying the parameters of the Einasto profile.

A=0.059, alpha=0.5

A=0.03, alpha=0.5

You can check out this visualization over a range of A values.

I can also see how two halos interact in terms of the induced tangential ellipticity profile by simulating two halos and then moving them closer to one another.

You can see the effect here. You get the idea – I can also try out any combination of configurations, shapes, and strengths of interacting halos. I can then analyse the characteristics of the resulting observable factors (in this case, galaxy location and ellipticities) in order to build better a predictive model.

Unfortunately, since this is a competition with cold hard cash on the line, I am not releasing the source for this simulation at this time. I will, however, open source the whole thing when the competition ends.

IPython vs RStudio+knitr

At a meeting last night with some collaborators at the Vélobstacles project, I was excitedly told about the magic of IPython and it’s notebook functionality for reproducible research. This sounds familiar, I thought to myself. Using a literate programming approach to integrate computation with the communication of methodology and results has been at the core of the development of the RStudio IDE and associated tools such as knitr.

Here is Fernando Pérez speaking at PyCon Canada 2012 in Toronto about IPython for reproducible scientific computing.


This looks like convergent evolution in the R and Python communities, and I’m sure these projects can (and have already) learn a lot from each other.