Time-series forecasting: Bike Accidents

About a year ago I posted this video visualization of all the reported accidents involving bicycles in Montreal between 2006 and 2010. In the process I also calculated and plotted the accident rate using a monthly moving average. The results followed a pattern that was for the most part to be expected. The rate shoots up in the spring, and declines to only a handful during the winter months.

It’s now 2013 and unfortunately our data ends in 2010. However, the pattern does seem to be quite regular (that is, exhibits annual periodicity) so I decided to have a go at forecasting the time series for the missing years. I used a seasonal decomposition of time series by LOESS to accomplish this.

You can see the code on github but here are the results. First, I looked at the four components of the decomposition:

decomp_collisions

Indeed the seasonal component is quite regular and does contain the intriguing dip in the middle of the summer that I mentioned in the first post.

seasonal_collisions

 

This figure shows just the seasonal deviation from the average rates. The peaks seem to be early July and again in late September. Before doing any seasonal aggregation I thought that the mid-summer dip may correspond with the mid-August construction holiday, however it looks now like it is a broader summer-long reprieve. It could be a population wide vacation effect.

Finally, I used an exponential smoothing model to project the accident rates into the 2011-2013 seasons.

forecast_collisions

It would be great to get the data from these years to validate the forecast, but for now lets just hope that we’re not pushing up against those upper confidence bounds.

From Whale Calls to Dark Matter: Competitive Data Science with R and Python

Back in June I gave a fun talk at Montreal Python on some of my dabbling in the competitive data science scene. The good people at Savior-fair Linux recorded the talk and have edited it all together into a pretty slick video. If you can spare twenty-minutes or so, have a look.

If you want the slides, head on over to my speakerdeck page.

whaledarkmattercover

Uncertainty matters

In a post I wrote earlier this year, I noted a sentiment expressed in The Economist about understanding and embracing uncertainty.

…recent reforms to the IPCC’s procedures will do little to change its tendency to focus on the areas where there is greater consensus, avoiding the uncertainties which, though unpalatable for scientists, are important to policy. (link)

Which I felt was contrary to the way we, as scientists, speak among ourselves about policy makers. Specifically, that it is they who fear and misunderstand the implications of uncertainty.

This is the same perception which has led to the launch today by the group Sense About Science of a publication titled Making Sense of Uncertainty: Why uncertainty is part of science.

Launching a guide to Making Sense of Uncertainty at the World Conference of Science Journalists today, researchers working in some of the most significant, cutting edge fields say that if policy makers and the public are discouraged by the existence of uncertainty, we miss out on important discussions about the development of new drugs, taking action to mitigate the impact of natural hazards, how to respond to the changing climate and to pandemic threats.

Interrogated with the question ‘But are you certain?’, they say, they have ended up sounding defensive or as though their results are not meaningful. Instead we need to embrace uncertainty, especially when trying to understand more about complex systems, and ask about operational knowledge: ‘What do we need to know to make a decision? And do we know it?’

The report seems to be in line with arguments I have made about uncertainty and decision making as they pertain to ecological research, management, and policy.

Among the contributors to the report is someone who I consider to be among the best when it comes to understanding and communicating uncertainty, David Spiegelhalter. While I haven’t made my way all the way through it yet, it looks like this report will be an informative read for both scientists and policy makers (oh ya, and journalists — can’t forget about them).

Who knows, we might be able to stop the finger pointing and work together in mutual understanding of the importance of uncertainty.

How likely is the NSA PRISM program to catch a terrorist?

Recent revelations about PRISM, the NSA’s massive program of surveillance of civilian communications have caused quite a stir. And rightfully so, as it appears that the agency has been granted warrantless direct access to just about any form of digital communication engaged in by American citizens, and that their access to such data has been growing significantly over the past few years.

Some may argue that there is a necessary trade-off between civil liberties and public safety, and that others should just quit their whining. Lets take a look at this proposition (not the whining part). Specifically, let’s ask: how much benefit, in terms of thwarted would-be attacks, does this level of surveillance confer?

Lets start by recognizing that terrorism is extremely rare. So the probability that an individual under surveillance (and now everyone is under surveillance) is also a terrorist is also extremely low. Lets also assume that the neck-beards at the NSA are fairly clever, if exceptionally creepy. We assume that they have devised an algorithm that can detect ‘terrorist communications’ (as opposed to, for instance, pizza orders) with 99% accuracy.

P(+ |  bad guy) = 0.99

A job well done, and Murica lives to fight another day. Well, not quite. What we really want to know is: what is the probability that they’ve found a bad guy, given that they’ve gotten a hit on their screen? Or,

P(bad guy | +) =??

Which is quite a different question altogether. To figure this out, we need a bit more information. Recall that bad guys (specifically terrorists) are extremely rare, say on the order of one in a million (this is a wild over estimate with the true rate being much lower, of course – but lets not let that stop us). So,

P(bad guy) = 1/1,000,000

Further, lets say that the spooks have a pretty good algorithm that only comes up falsely positive (ie when the person under surveillance is a good guy) one in one hundred times.

P(+ |  good guy) = 0.01

And now we have all that we need. Apply a little special Bayes sauce:

P(bad guy | +) = P(+ | bad guy) P(bad guy)  /  [ P(+ |  bad guy) P(bad guy) + P(+ |  good guy) P(good guy) ]

and we get:

P(bad guy | +) = 1/10,102

That is, for every positive (the NSA calls these ‘reports’) there is only a 1 in 10,102 chance (using our rough assumptions) that they’ve found a real bad guy.

UPDATE: While former NSA analyst turned whistle blower William Binney thinks this is a plausible estimate, the point here is not that this is the ‘correct probability‘ involved (remember that we based our calculations on very rough assumptions). The take away message is simply that whenever the rate of an event of interest is extremely low, even a very accurate test will fail very often.

UPDATE 2: The Wall Street Journal’s Numbers Guy has written a piece on this in which several statisticians and security experts respond.

UPDATE 3: If you can read German, a reader reached me to point out that Der Spiegel technology section picked up the story.

Big brother is always watching, but he’s still got a needle in a haystack problem.

Big Brother 11

The television series doesn’t have this problem. On the show, they’re all bad guys.

What is probabilistic truth? Part 2 – Everything is conditional

Read Part 1

When making a statement of the form “1/2 is the correct probability that this coin will land tails”, there are a few things which are left unsaid, but which are typically implied.

The statement is one about the probability of an unknown event occurring, and it would seem reasonable to write this statement using probability notation as P(toss=tails) = 0.5. And indeed many people would express it this way. However, what is missing is the state of knowledge under which this statement has been made. For instance, is the coin yet to be flipped, or is it currently rolling in a circle on the table, leaning in toward its final resting position? Perhaps the flipping device can consistently throw a coin such that it rotates exactly 5 times in the air before landing flat on the table, or we know which side is up at the start of the flip. In these latter cases, the statement of probability would be made under considerably more knowledge than the first, and would not tend to be 0.5 in these cases. An observer placing a probability of P(toss=tails) = 0.99 at the moment when the coin is circling in on its resting position, leaning heavily toward a tails up configuration, could be said to have the correct probability also. For fairness, lets say that the first observer also makes her probability statement at the same moment, but from another room where she cannot see what has happened.

How can P(toss=tails) = 0.5, and P(toss=tails) = 0.99 be simultaneously correct?

The answer is conditioning. Each of the statements were made conditional on the observer’s state of knowledge. More completely, the two statements can be rewritten as:

P(toss=tails | knowledge of observer 1) = 0.5 , and

P(toss=tails | knowledge of observer 2) = 0.99

In practice, however, we often leave out the conditional part of the notation unless it is germane to the problem at hand. However, there is no such thing as unconditional probability. In fact, Harvard professor Joe Blitzstein calls conditioning the Soul of Statistics.

In the next post in this series, we’ll start looking at how to assess the correctness of a (conditional) probability statement after having observed an outcome.

Here's a bunch of random walks -- just 'cause its neat.

Here’s a bunch of random walks — just ’cause its neat.

What is probabilistic truth?

I am currently working on a validation metric for binary prediction models. That is, models which make predictions about outcomes that can take on either of two possible states (eg Dead/not dead, heads/tails, cat in picture/no cat in picture, etc.) The most commonly used metric for this class of models is AUC, which assesses the relative error rates (false positive, false negative) across the whole range of possible decision thresholds. The result is a curve that looks something like this:

auc

Where the area under the curve (the curve itself is the Receiver Operator Curve (ROC)) is some value between 0 and 1. The higher this value, the better your model is said to perform. The problem with this metric, as many authors have pointed out, is that a model can perform very well in terms of AUC, but be completely miscalibrated in terms of the actual probabilities placed on each outcome.

A model which distinguishes perfectly between positive and negative cases (AUC=1) by placing a probability of 0.01 on positive cases and 0.001 on negative cases may be very far off in terms of the actual probability of a positive case. For instance, positive cases may actually occur with probability 0.6 and negative cases with 0.2. In most real situations, our models will predict a whole range of different probabilities with a unique prediction for each data point, but the general idea remains. If your goal is simply to distinguish between cases, you may not care whether the probabilities are not correct. However, if your model is purporting to quantify risk then you very much want to know if you are placing the probabilistically true predictions on cases that are yet to be observed.

Which begs the question: What is probabilistic truth? 

This questions appears, at least at first, to be rather simple. A frequentist definition would say that the probability is correct, or true, if the predicted probability is equal to the long run outcomes.  Think of a dice rolled over and over counting the number of times a one is rolled. We would compare this frequency to our predicted probability of rolling a one (1/6 for a fair six-sided die) and would say that our predicted probability was true if this frequency matched 1/6.

But what about situations where we can’t re-run an experiment over and over again? How then would we evaluate the probabilistic truth of our predictions?

I’ll be working through this problem in a series of posts in the coming weeks. Stay tuned!

Read Part 2

Simulation and Likelihood Methods Workshop in Kananaskis

Corey Chivers:

I can think of worse places to get down and dirty with R than Kananaskis, Alberta.

Originally posted on Zero to R Hero:

CAISN_Primary_trans

Canadian Aquatic Invasive Species Networks Annual General Meeting in Kananaskis, Alberta. May 03, 3:25-5:30.

This 2-hour workshop will focus on how and why we do numerical simulation in R. Time permitting, we will also look at how to build and fit likelihood based statistical models.

We ask that you bring your laptop with both R and R-Studio installed. If you’ve never worked with R before, please have a look at the getting started with R document. You can
also check out the slides from our more introductory workshops.

Outline

Section 1: Introduction to Simulation (script)

  •     What is (numerical) simulation?
  •     Drawing random samples from a set
  •     Drawing random samples from a probability distribution
  •     Describing models in terms of their deterministic and stochastic parts
  •     Simulating data from a model

Section 2: Likelihood Methods(script)

  •     The Likelihood Principle
  •     The Ecologist’s Quarter
  •     Maximum…

View original 30 more words