From Whale Calls to Dark Matter: Competitive Data Science with R and Python

Back in June I gave a fun talk at Montreal Python on some of my dabbling in the competitive data science scene. The good people at Savior-fair Linux recorded the talk and have edited it all together into a pretty slick video. If you can spare twenty-minutes or so, have a look.

If you want the slides, head on over to my speakerdeck page.

whaledarkmattercover

The Professor, the Bikini Model and the 5 Sigma Mistake

Today in The New York Times Magazine Maxine Swann tells the curious story of Paul Frampton, a 68 year old theoretical particle physicist who was apparently duped into becoming a drug mule by a bikini model he met online. The story is a fascinating tale of a giant academic ego and the seemingly infinite gullibility of this scientist.

Something stood out in particular for me. During the trial, Frampton was asked about several notes and calculations that were found on him when he was arrested. He had jotted: “5 standard deviations 99.99994%”, which he explained in court to be the criterion for the discovery of the Higgs Boson; a result that is unlikely to occur due to chance. He further explained that he was “calculating the probability that Denise Milani would become my second wife, which was almost a certainty.” Apparently, he took the messages and love notes that he had exchanged online with the purported ‘Milani’ to be strong evidence that she loved him. Under the null hypothesis — she doesn’t love me — these behaviours would have been very unlikely indeed.

Aside from committing the p-value fallacy, what else is wrong with Frampton’s logic?

The fact that Frampton was being set up was immediately obvious to his friend, who warned him about what was up in no uncertain terms. Most of us would have taken all the information available to us to make a conclusion. How often do young bikini models fall for older professors with a poor relationship track record, for instance? However, Frampton choose to only use a select set of observations on which to make his inference. Had he have incorporated prior information, or updated his beliefs as new evidence became available, he may have been able to avoid his 5 sigma mistake, and the nearly 5 years of jail time which he was sentenced for it.

Dark matter top 10, but an hour too late

Well, that’s embarrassing. A little tweak to my dark matter model resulted in a leaderboard score in the top 10. The only problem is that the contest closed about an hour ago.

top_10

I ran this final prediction earlier today but then simply forgot to go back to it and submit!! On the bright side, I learned a lot of really interesting things about gravitational lensing and had a tonne of fun doing it. I’ll probably write a post-mortem sometime in the next few days, but for now I’m just kicking myself.

Simulating weak gravitational lensing

In the search for dark matter, I have been having mixed success. It seems that locating DM in single halo skies is a fairly straightforward problem. However, when there are more than one halo, things get quite a bit trickier.

As I have advocated many times before, including here and here, simulation can provide deep insights into many (if not all) problems. I never trust my own understanding of a complicated problem until I have simulated from a hypothesized model of that problem. Once I have a simulation in place, I can then test out all kinds of hypotheses about the system by manipulating the component parts. I think of this process as a kind of computer-assisted set of thought experiments.

So, when I was hitting a wall with the dark matter challenge, I of course turned to simulation for insights. Normally this would have been my very first step, however in this case my level of understanding of the physics involved was insufficient when I started out. After having done a bit of reading on the topic, I built a model which implements a weak lensing regime on an otherwise random background of galaxies. The model assumes an Einasto profile of dark matter mass density, with parameters A and α determining the strength of the tangential shearing caused by foreground dark matter.

A=0.2, alpha=0.5

I can then increase the strength of the lens by either increasing the mass of the dark matter, or by varying the parameters of the Einasto profile.

A=0.059, alpha=0.5

A=0.03, alpha=0.5

You can check out this visualization over a range of A values.

I can also see how two halos interact in terms of the induced tangential ellipticity profile by simulating two halos and then moving them closer to one another.

You can see the effect here. You get the idea – I can also try out any combination of configurations, shapes, and strengths of interacting halos. I can then analyse the characteristics of the resulting observable factors (in this case, galaxy location and ellipticities) in order to build better a predictive model.

Unfortunately, since this is a competition with cold hard cash on the line, I am not releasing the source for this simulation at this time. I will, however, open source the whole thing when the competition ends.