Heartbeat of a Cycling City: Bixi data at Hack/Reduce

Corey Chivers:

With spring finally making it’s presence known, I thought I’d re-share this cycling data analysis and visualization I did with some great people a while back. Get out there and feel that wind in your hair!

Originally posted on bayesianbiologist:

The recent Hack/Reduce hackathon in Montreal was a tonne of fun. Our team tackled a data set of consisting of Bixi (Montreal’s bicycle share system) station states at one minute temporal resolution. We used Hadoop and mapreduce to pull out some features of user behaviours. One of the things we extracted was the flux at each station, which we defined as the number of bikes arriving and departing from a given station per unit time. When you plot the total system flux across all stations against time, you can see the pulse of the city. Here are the first few weeks of this year’s Bixi season.(click to enlarge)

A few things jump out: 1) There are clearly defined peaks at both the morning and evening rush hours, but it looks like the evening rush is typically a little stronger. I guess cycling home is a great way to relax after…

View original 231 more words

Online R and Plotly Graphs: Canadian and U.S. Maps, Old Faithful with Multiple Axes, & Overlaid Histograms

Guest post by Matt Sundquist of plot.ly.

Plotly is a social graphing and analytics platform. Plotly’s R library lets you make and share publication-quality graphs online. Your work belongs to you, you control privacy and sharing, and public use is free (like GitHub). We are in beta, and would love your feedback, thoughts, and advice.

1. Installing Plotly

Let’s install Plotly. Our documentation has more details.

install.packages("devtools")
library("devtools")
devtools::install_github("R-api","plotly")

Then signup online or like this:

library(plotly)
response = signup (username = 'yourusername', email= 'youremail')


Thanks for signing up to plotly! Your username is: MattSundquist Your temporary password is: pw. You use this to log into your plotly account at https://plot.ly/plot. Your API key is: “API_Key”. You use this to access your plotly account through the API.

2. Canadian Population Bubble Chart

Our first graph was made at a Montreal R Meetup by Plotly’s own Chris Parmer. We’ll be using the maps package. You may need to load it:

install.packages("maps")

Then:

library(plotly)
p <- plotly(username="MattSundquist", key="4om2jxmhmn")
library(maps)
data(canada.cities)
trace1 <- list(x=map(regions="canada")$x,
  y=map(regions="canada")$y)

trace2 <- list(x= canada.cities$long,
  y=canada.cities$lat,
  text=canada.cities$name,
  type="scatter",
  mode="markers",
  marker=list(
    "size"=sqrt(canada.cities$pop/max(canada.cities$pop))*100,
    "opacity"=0.5)
  )

response <- p$plotly(trace1,trace2)
url <- response$url
filename <- response$filename
browseURL(response$url)

In our graph, the bubble size represents the city population size. Shown below is the GUI, where you can annotate, select colors, analyze and add data, style traces, place your legend, change fonts, and more.

map1

Editing from the GUI, we make a styled version. You can zoom in and hover on the points to find out about the cities. Want to make one for another country? We’d love to see it.

map2

And, here is said meetup, in action:

plotly_mtlRmeetup

You can also add in usa and us.cities:

map3

3. Old Faithful and Multiple Axes

Ben Chartoff’s graph shows the correlation between a bimodal eruption time and a bimodal distribution of eruption length. The key series are: a histogram scale of probability, Eruption Time scale in minutes, and a scatterplot showing points within each bin on the x axis. The graph was made with this gist.

old_faithful

4. Plotting Two Histograms Together

Suppose you are studying correlations in two series (Popular Stack Overflow ?). You want to find overlap. You can plot two histograms together, one for each series. The overlapping sections are the darker orange, automatically rendered if you set barmode to 'overlay'.

library(plotly)
p <- plotly(username="Username", key="API_KEY")

x0 <- rnorm(500)
x1 <- rnorm(500)+1

data0 <- list(x=x0,
  name = "Series One",
  type='histogramx',
  opacity = 0.8)

data1 <- list(x=x1,
  name = "Series Two",
  type='histogramx',
  opacity = 0.8)

layout <- list(
  xaxis = list(
  ticks = "",
  gridcolor = "white",zerolinecolor = "white",
  linecolor = "white"
 ),
 yaxis = list(
  ticks = "",
  gridcolor = "white",
  zerolinecolor = "white",
  linecolor = "white"
 ),
 barmode='overlay',
 # style background color. You can set the alpha by adding an a.
 plot_bgcolor = 'rgba(249,249,251,.85)'
)

response <- p$plotly(data0, data1, kwargs=list(layout=layout))
url <- response$url
filename <- response$filename
browseURL(response$url)

plotly5

5. Plotting y1 and y2 in the Same Plot

Plotting two lines or graph types in Plotly is straightforward. Here we show y1 and y2 together (Popular SO ?). 

library(plotly)
p <- plotly(username="Username", key="API_KEY")

# enter data
x <- seq(-2, 2, 0.05)
y1 <- pnorm(x)
y2 <- pnorm(x,1,1)

# format, listing y1 as your y.
First <- list(
  x = x,
  y = y1,
  type = 'scatter',
  mode = 'lines',
  marker = list(
   color = 'rgb(0, 0, 255)',
   opacity = 0.5)
  )

# format again, listing y2 as your y.
Second <- list(
  x = x,
  y = y2,
  type = 'scatter',
  mode = 'lines',
  opacity = 0.8,
  marker = list(
   color = 'rgb(255, 0, 0)')
  )

plotly6

And a shot of the Plotly gallery, as seen at the Montreal meetup. Happy plotting!

plotly_mtlRmeetup2

What’s Warren Buffett’s $1 Billion Basketball Bet Worth?

A friend of mine just alerted me to a story on NPR describing a prize on offer from Warren Buffett and Quicken Loans. The prize is a billion dollars (1B USD) for correctly predicting all 63 games in the men’s Division I college basketball tournament this March. The facebook page announcing the contest puts the odds at 1:9,223,372,036,854,775,808, which they note “may vary depending upon the knowledge and skill of entrant”.

Being curious, I thought I’d see what the assumptions were that went into that number. It would make sense to start with the assumption that you don’t know a lick about college basketball and you just guess using a coin flip for every match-up. In this scenario you’re pretty bad, but you are no worse than random. If we take this assumption, we can calculate the odds as 1/(0.5)^63.  To get precision down to a whole integer I pulled out trusty bc for the heavy lifting:

$ echo "scale=50;  1/(0.5^63)" | bc
9223372036854775808.000000

Well, that was easy. So if you were to just guess randomly, your odds of winning the big prize would be those published on the contest page. We can easily calculate the expected value of entering the contest as P(win)*prize, or 9,223,372,036ths of a dollar (that's 9 nano dollars, if you're paying attention). You've literally already spent that (and then some) in opportunity cost sunk into the time you are spending thinking about this contest and reading this post (but read on, 'cause it's fun!).

But of course, you're cleverer than that. You know everything about college basketball - or more likely if you are reading this blog - you have a kickass predictive model that is going to up your game and get your hands into the pocket of the Oracle from Omaha.

What level of predictiveness would you need to make this bet worth while? Let's have a look at the expected value as a function of our individual game probability of being correct.

buffet1

And if you think that you're really good, we can look at the 0.75 to 0.85 range:

buffet2

So it's starting to look enticing, you might even be willing to take off work for a while if you thought you could get your model up to a consistent 85% correct game predictions, giving you an expected return of ~$35,000. A recent paper found that even after observing the first 40 scoring events, the outcome of NBA games is only predictable at 80%. In order to be eligible to win, you've obviously got to submit your picks before the playoff games begin, but even at this herculean level of accuracy, the expected value of an entry in the contest plummets down to $785.

Those are the odds for an individual entrant, but what are the chances that Buffet and co will have to pay out? That, of course, depends on the number of entrants. Lets assume that the skill of all entrants is the same, though they all have unique models which make different predictions. In this case we can get the probability of at least one of them hitting it big. It will be the complement of no one winning. We already know the odds for a single entrant with a given level of accuracy, so we can just take the probability that each one doesn't win, then take 1 minus that value.

buffet3

Just as we saw that the expected value is very sensitive to the predictive accuracy of the participant, so too is the probability that the prize will be awarded at all. If 1 million super talented sporting sages with 80%  game-level accuracy enter the contest, there will only be a slightly greater than 50% chance of anyone actually winning. If we substitute in a more reasonable (but let's face it, still wildly high) figure for participants' accuracy of 70%, the chance becomes only 1 in 5739  (0.017%) that the top prize will even be awarded even with a 1 million strong entrant pool.

tl;dr You're not going to win, but you're still going to play.

If you want to reproduce the numbers and plots in this post, check out this gist.

In praise of exploratory statistics

Originally posted on Dynamic Ecology:

There has been a lot of discussion of researcher degrees of freedom lately (e.g. Jeremy here or Andrew Gelman here – PS by my read Gelman got the specific example wrong because I think the authors really did have a genuine a priori hypothesis but the general point remains true and the specific example is revealing of how hard this is to sort out in the current research context).

I would argue that this problem comes about because people fail to be clear about their goals in using statistics (mostly the researchers, this is not a critique of Jeremy or Andrew’s posts). When I teach a 2nd semester graduate stats class, I teach that there are three distinct goals for which one might use statistics:

  1. Hypothesis testing
  2. Prediction
  3. Exploration

These three goals are all pretty much mutually exclusive (although there is some overlap between prediction and exploration). Hypothesis testing is of…

View original 1,385 more words

Simulation and Likelihood Methods Workshop in Kananaskis

Corey Chivers:

I can think of worse places to get down and dirty with R than Kananaskis, Alberta.

Originally posted on Zero to R Hero:

CAISN_Primary_trans

Canadian Aquatic Invasive Species Networks Annual General Meeting in Kananaskis, Alberta. May 03, 3:25-5:30.

This 2-hour workshop will focus on how and why we do numerical simulation in R. Time permitting, we will also look at how to build and fit likelihood based statistical models.

We ask that you bring your laptop with both R and R-Studio installed. If you’ve never worked with R before, please have a look at the getting started with R document. You can
also check out the slides from our more introductory workshops.

Outline

Section 1: Introduction to Simulation (script)

  •     What is (numerical) simulation?
  •     Drawing random samples from a set
  •     Drawing random samples from a probability distribution
  •     Describing models in terms of their deterministic and stochastic parts
  •     Simulating data from a model

Section 2: Likelihood Methods(script)

  •     The Likelihood Principle
  •     The Ecologist’s Quarter
  •     Maximum…

View original 30 more words

A quick guide to non-transitive Grime Dice

A very special package that I am rather excited about arrived in the mail recently. The package contained a set of 6-sided dice. These dice, however, don’t have the standard numbers one to six on their faces. Instead, they have assorted numbers between zero and nine. Here’s the exact configuration:

red<-c(4,4,4,4,4,9)
blue<-c(2,2,2,7,7,7)
olive<-c(0,5,5,5,5,5)
yellow<-c(3,3,3,3,8,8)
magenta<-c(1,1,6,6,6,6)

Aside from maybe making for a more interesting version of snakes and ladders, why the heck am I so excited about these wacky dice? To find out what makes them so interesting, lets start by just rolling one against another and seeing which one rolls the higher number. Simple enough. Lets roll Red against Blue. Until you get your own set, you can roll in silico.

That was fun. We can do it over and over again and we'll find that Red beats Blue more often than not. So it seems like Red is a pretty good bet. Now lets try rolling Olive against Red. I'll wait.

Hey, look at that, the mighty Red has fallen. Olive tends to roll a higher number than Red more often than it doesn't. So far, we have discovered this relationship:

Olive > Red > Blue

All hail the dominant Olive! Out of these three dice, if we want the best chance of winning, we should always pick Olive right? No dice, as they say. When we roll Olive against Blue, we find that Blue wins more often!

For any one of these three dice, there is another that will roll a higher number more often than not.

Olive > Red > Blue > Olive > Red > Blue > Olive > Red > Blue..

This forms a chain of dominance relationships that is a closed cycle. This property is called intransivity, and you can use it to win riches beyond your wildest dreams, er, well, at least to impress your friends.

Neat, right? But there's more! We can do the same trick with Yellow, Magenta, and Red (Red > Magenta > Yellow > Red > ...). With all five dice, there is a chain for which the order is given by that length of the word for each colour.

Red > Blue > Olive > Yellow > Magenta > ...

Awesome. But that's not it, either! You may have noticed from our three way comparisons that there is another five way chain. This time, the chain order is given by the alphabetical order of the words for each of the colours.

Blue > Magenta > Olive > Red > Yellow > ...

What are the odds?

So far I've just asked you to take my word for it that the dominance relationships are as I described. Working out the odds of winning for any given pairing of dice as actually quite straightforward. Start by looking at the number on each side of the first die, one at a time. Count how many sides on the opposing die are less than the current number and divide by six. Since each side on the first die has a 1/6 chance of appearing, divide by 6 again. Sum these values for all six sides and you will have the probability that the first die will roll a higher number than the second.

For example, P(Red > Blue) = 5/6 x 1/2 + 1/6, which is 7/12.

Here I've worked out all of the pairwise odds:

Grime_dice

So, you can always win in this game as long as you get to be second to choose a colour. The odds are strongest in your favour when your opponent either chooses Magenta or Red, and you choose Olive or Yellow, respectively. Isn't probability wonderful!

And if you still want more, it turns out that if you roll the Grime dice in pairs, the order of the word length chain reverses!

Corey Chivers:

As a follow up to my simulation based approximate solution to the Gambling Machine Puzzle, here is the exact solution from mathematician Michael Lugo with a nice explaination.

Originally posted on God plays dice:

From the New York Times “Numberplay” blog:

An entrepreneur has devised a gambling machine that chooses two independent random variables x and y that are uniformly and independently distributed between 0 and 100. He plans to tell any customer the value of x and to ask him whether y > x or x > y.

If the customer guesses correctly, he is given y dollars. If x = y, he’s given y/2 dollars. And if he’s wrong about which is larger, he’s given nothing.

The entrepreneur plans to charge his customers $40 for the privilege of playing the game. Would you play?

Clearly the strategy is to guess that y > x if x is small, and to guess that y < x if x is large. Say you’re told x = 60. If you guess x is the larger variable, then conditional on your guess being correct (which…

View original 459 more words

Predictive Ecology and Management Decisions Under Uncertainty

In February, I was honoured with the inaugural McGill Biology Graduate Students Association Organismal Seminar Award. This is the talk I gave on uncertainty and predictions in ecological management. I have inserted selected visuals, but you can get the complete slides which accompanied the talk here.

———————————————————————————————————–

Predictive invasion ecology and management decisions under uncertainty

When I was told that I had been selected to give this talk I spent a rather long time thinking about how one goes about preparing a presentation of this type. The typical graduate student experience presenting at a conference is that we are expected to detail a single piece of research to a relatively small audience of fellow researchers who, it is hoped, have at least a tertiary familiarity with whatever esoteric domain we’ve been recently slaving away in. The format usually goes: Introduction – methods – results – conclusion – next speaker. However, it seemed to me that if I was going to ask you all to come out and listen during what would otherwise be a prime manuscript writing Thursday afternoon, I was going to have to do something a little different, and I hope, more interesting.

I’ve titled the talk ‘Predictive invasion ecology and management decisions under uncertainty’, and indeed that is what I’ll be talking about. However, I’ve decided to structure this talk as something more akin to an argumentative essay. And the argument that I am going to be making has to do with how, as I see it, we should be doing ecology as it pertains to informing management and policy decisions.

By this I mean, how do we go from data, which is often sparse or limited, and theory, which is our (also limited) abstracted understanding of ecological processes, to making inference, predictions, and ultimately, to help inform decisions. I am going to lay out, and advocate, a framework for how I think that we ought to go about this. I’ll outline a sort of methodological recipe for doing this, and then I’ll go through how I have applied this general methodological approach in my own research. First on measuring the impacts of invasive species (specifically forest pests), then on various aspects of the spread, and management, of aquatic invasives.

The paper that I gave to Carly to send around as recommended reading (which I’m sure you all diligently read), is this one. This is one of the first papers that I came across when I started graduate work here at McGill. Clark and colleagues published this paper in Science back in 2001, but its message is just as, if not more important today. In it, the authors argue that if we are going to make decisions about what to do to conserve biological diversity, ecosystem function, and the valuable ecosystem services provided to use by nature, then we will need to make forecasts about what we expect to happen in the future. I was struck by something that Justin Travis said last week during his talk about the idea of engaging in ‘ecological meteorology’. Which might bring to mind something like this.

ecological_met

When we think of forecasting, we usually think about the best guess at what some future state of nature is going to be. When I speak of forecasts, however, I mean something a little bit different than what is communicated by weather forecasters. When I talk about forecasting, I mean specifically placing probability distributions over the range of possible future outcomes, and in that sense, a forecast is not a single prediction about a future state of nature, but rather is a way to communicate uncertainty about the future. And indeed, our level of uncertainty may well be quite high. It was the Danish physicist Niels Bohr, who once said:

Prediction is very difficult, especially about the future.

–Niels Bohr, Danish physicist (1885-1962)

Now, the temptation can be to suggest that because this is true, we should not act. Many will call for ‘further study’, or cite ‘insufficient data available’ as a reason to delay decisions. Of course, we can always collect more data, and we there is always room to further refine and develop our theory and models. And yes, indeed, we ought to continue to do both of these things. However, the approach of not making a decision has been thought of as being a precautionary one, on the basis that action can have unintended consequences. I want to outline why I believe that this is not the correct way to think about things.

decisions

  1. Not making a decision is, itself a decision!
  2. As such, in a changing world, doing nothing can also have unintended consequences!

Delaying decision making on the hopes that things with stay the same, may very well have the unintended consequence that they will not. By doing away with this we can focus on the problem at hand. And that is: How do make the best use of the data that we have to make predictions which take into account (potentially large) inherent uncertainties?

That is, how do we use all of the information that we have available to us to make probabilistic predictions, with their many sources of uncertainty, about the things that we are interested in? For instance, in my own work, these are things like: How much damage are non-native forest insects causing, and how much damage should we expect to be caused by the next one? At current import rates, what is the probability of a new fish species invading via the aquarium trade? Or, after initial introduction, how is a novel invasive likely to spread across the landscape? And how is that spread likely to be altered in the presence of targeted mitigation strategies?

The framework that I am advocating approaches all of these problems in a similar way. At this point, for those of you who know me, you’re probably making a guess about where I’m going here.

clark_bayesian

Here again is Jim Clark, this time outlining all the wonderful things that the Bayesian can do. Bayes is indeed a useful tool for doing inference and prediction while accounting for uncertainty and variability. However, I am not going to argue here specifically for the use of Bayesian methods (though they can help.) In fact, in my own work I have used both classical maximum likelihood, as well as Bayesian methods to get the job done. Instead, I am arguing for the use of a broader framework. The important elements of this framework are that we use the information (data) that we have available to us to the maximum extent possible while taking account of our uncertainties in the form of probability distributions. It turns out that Bayesian methods are great at this. They do, however suffer from other problems, not the least of which can be the computational costs of estimating models.

Let’s start outlining a procedure for this general methodological framework which I am proposing. The first step is data.

—> What information do I have? What can I go out and observe? <—

Some may suggest starting with theory, but I would argue that if the theory or models don’t make predictions about observable, quantifiable phenomena, then we’re done before we even get started. So, start with data. In ecology, we often have observational data (as a opposed to experimental data). From the information (or data) that we have, we can then start asking the question:

—> What are the hypothesized processes which generated these data? <—

process1

This is where our ecological understanding comes in in the form of models. Models are our abstractions of what we believe to be the key processes involved in some real world phenomenon. These abstractions of real world processes, by definition contain only a subset of the real components which are involved. Therefore, something we need to always keep in mind was best said by George Box:

All models are wrong, but some are useful.

–George E.P. Box.

This is a quote that I am sure many of you are familiar with, but I find it to be a very useful one to remind myself of from time to time. Any model or theory that we have of the world will be incomplete. The challenge we face is to build our abstractions of real world processes such that they contain the main relevant features of the phenomena that we are interested in, and produce predictions about observable quantities in the real world.

Once we have formalized our theories about how we think that our observable data is being generated into one or more models, the next step is to build an artificial world and in that world, to simulate the data generating process from your models.

By this I mean stochastic simulation – the in silico use of random number generation to probabilistically mimic the behaviour of individuals or populations. Random number generation is really the best thing since sliced bread. It can be used as a tool to generate and test theories and as a platform to conduct thought experiments about complex systems.

process2

This is a step that I believe is most often skipped in studies with data. Many studies go straight from formulating a model to confronting that model with data. Instead, the next step should be to confront your model, or models, with the psuedo data, which is in the exact same form as your real world data.

I have found this step to be critically important as a safeguard against faulty logic. If you’re anything like me, you will have many of these moments when building and testing models.

A rather nefarious thing about models is that even broken ones can produce seemingly reasonable predictions when confronted with data. The point is that simulation provides a test suite (a massive laboratory right in your computer) for experimenting and testing your logic before confronting models with data.

Once we have gone around this loop a few (or many) times, only then do we go ahead and test it out on our real data. At this point you may still need to go back to the theoretical drawing board, but hopefully you’ve ironed out all the kinks in your logic. It is hear that you estimate model parameters and distinguish between competing models.

You are now ready to start testing hypotheses about how a system may behave if altered, and to make probabilistic forecasts about things you care about. Here, as a bonus, you’ve already set up a simulation model of your system which you can now use to push forward in time under any given scenario or intervention you may be considering.

process3

It is then possible to run all kinds of analyses of risk based on our probabilistic forecasts in order to help make decisions about management that will maximize our probability of getting a desired outcome.

I have presented this as a general framework that I think is applicable to broad range of ecological problems as a way to link data and theory to make inference, predictions and inform decisions under uncertainty. What I want to do now is to give you a look at how I’ve used this general methodological framework in my own research, which is specific to invasive species.

The first project that I worked on when I arrived for graduate study at McGill was on estimating the impacts of non-native forest insects in the US. I don’t know how I swung this one, but no sooner that I had arrived, Brian was bringing me down to California to work with an amazing interdisciplinary team of forest ecologists, entomologists, and resource economists on this project at NCEAS.

First a bit about invasive forest insects. We know that while international trade provides many economic benefits, there are also many externalities that are not taken into consideration. When we move massive amounts of stuff around the globe, we inevitably also move species, and some of these species can become problematic in their adoptive landscape. We know that some of these species are causing large ecological and economic damages, but we don’t really have any estimates of what the total damages are amounting to. Without any estimate, these damages can not be incorporated into the cost of trade, nor can we make informed decisions about what costs we would be willing to pay to avoid them.

Given that there are many non-native species being introduced and only a fraction of those are highly damaging, we also wanted to get an estimate of the probability of seeing another high impact pest under current import levels and phytosanitary legislation. Further, since the mode by which forest pests are being transported is highly associated with the feeding guild of the insects, we wanted to get separate estimates grouped by guild. And since economic impact is not likely to be evenly distributed, we set out to determine who is paying for these impacts.

So, what information did we have? Not much, really. The entomologists set out to compile a list of all known non-native forest pests, and to identify a short list of those species that had been documented to have caused some noticeable damage. While we would have liked them to have done all of the species, the lazy economists agreed to do a national scale, full cost analysis of the most damaging pest in each of the three feeding guilds (borers, sap suckers, and foliage feeders). And they did this analysis across three broad economic sectors (costs borne to government, households, and to the lumber market).

The information which we had then, for each guild and economic sector, was a count of the number of species causing very little damage (most of them), the number a species causing some intermediate level of damage (a few), and one estimate of the damage caused by the most damaging species.

How do we go from this information to an estimate of the frequency distribution of costs across all species? We assumed from what we know about there being many innocuous species, fewer intermediate species and only one real ‘poster’ species in each guild, that the distribution was likely to be concave across most of it’s range and that the frequency of species causing infinite damage should asymptote to zero.

There are only a handful of functional forms that meet our asymptotic and concavity assumptions, fitting distributions using maximum likelihood would be fairly straightforward. But unfortunately, what we have are frequencies of species in different ranges.

The solution to this is to integrate the curve in these ranges and use that integral in our likelihood function.

nceas1

So to recap, we have counts of species in each of three cost categories. From that information we want to determine which functional form is likely to have generated that observed data. The parameters of each functional form themselves are uncertain, but we want to estimate a probability distribution of which values are more likely. And from there, with the distribution of possible distributions in hand, we can extract any number of useful quantities, for instance, the total cost of all species, the expected cost of a new species, or the probability that a new species will be a highly damaging one.

Okay, nice system, but will it work? To find out if our scheme would work, we simulated from each of the models, actually generating each data point, then collected ‘pseudo-data’ in the form which we actually had in real life, and went ahead to see if we could, in fact get back both the generating model, and the generating parameters.

The great thing about simulations is that you know what the model and parameters are that generated your simulated data. So when you apply your estimation procedures, you can assess how well you’ve done. So here we had four possible generating functional forms (power, log-normal, gamma, weibull), and what I check for is how often the estimating procedure is able to correctly identify the generating model. While some models were less distinct from others, we were able to in fact recapture the generating model.

We also check to see that we can recapture the parameter values and any derived quantities of interest. Again the panels are the four models, and in green are samples from the posterior distribution of parameter values with the generating value shown in red. Shown here is only one instance, but what we actually did was to re-simulate thousands of times to determine whether the generating parameter values were distributed as random draws from the posterior distribution. Since, for any one instance we wouldn’t expect the posterior distribution to be centred exactly on the true value, but rather that the true value would be a random draw from the posterior distribution.

nceas2

Also shown in these plots as a color ramp is the total cost associated with a curve having those parameter values. We also checked to ensure that there was no bias in our estimations of our derived quantities of interest.

I’m showing here the procedure that worked, however this took some doing to get right – highlighting the importance of testing models via simulation before sending out in prime-time.

So we found that the bulk of the cost of invasive forest insects was being placed on local governments, and that it was borers doing all the damage at around 1.7billion USD per annum. We also found that the single most damaging pest in each guild was responsible for between 25 and 50 % of the total impacts, and that at the current rates of introduction, there is about a 32% chance that another highly damaging pest will occur in the next 10 years.

Staying on the theme of imports, I want to move now to the live fish trade, and specifically invasion risk posed by aquarium fish. This is some work that is forthcoming in Diversity and Distributions which I did with lab mate, Johanna Bradie wherein we had some fairly high resolution data on fish species that were being imported to Canada. We had records of the number of individuals at the species level that were reported at customs, and we wanted to know ‘given only this information, can we estimate the risk of establishment?’.

So we know that due to species specific traits such as life history and environmental tolerances, that not all species are equally likely to establish. However, what we wondered was, in the absence of this information, can we estimate an overall pathway level risk. So if you look at the plot, what I am showing is that the relationship between propagule pressure and probability of establishment is not likely to be the same across different species. But the question is ‘how much does this matter in determining a risk estimate in the absence of species specific information?’

aquarium1

What we showed using simulation was that in most cases, it was possible to estimate a pathway level establishment curve, without significant bias. The only place where it particularly failed was when the distribution of q values, which is essentially a measure of environmental suitability, was strongly bimodal. Or, in other words, where there are two very different types of species represented in the import pool.

In short we found that at an import level of 100,000 individuals, there was pathway level establishment risk of 19%, and that importing 1 million individuals leads to just under a 1 in 2 chance of establishment for a species chosen at random from the import pool.

These studies so far have looked at invasions at very broad spatial scales, but now I want to drill down a little and talk about the process of species spread across a landscape once a local population has been established. Specifically, I investigate the way in which human behaviour mediates the dispersal of aquatic invasive species at the landscape level, and how these patterns of dispersal interact with the population dynamics of the invading species to affect the spatial patterns of spread.

We’ve known for a while now that freshwater species are transported from lake to lake primarily by hitchhiking to recreational vessels like this one here. If we are going to try to make predictions about how species will spread, it stands to reason that we should probably try to understand the behaviour of the humans who are pulling these boats around. There have been two main models of boater behaviour proposed in the literature, the gravity model, which makes use of an analogy to the gravitational ‘pull’ of large bodies, in this case lakes, and the random utility model, which models boaters as rational utility maximizers. But at their core, what both of these models do is to try to estimate which lakes a given boater is likely to visit. That is, they both try to estimate a boater’s probability distribution over lakes. Essentially, they are just alternative functional forms of the same decision process.

I wanted to observe boater choice in order to predict how propagules are likely flow between lakes. To do this I conducted an online survey. In the survey I asked boaters to identify which lakes they visited during the boating season, as well as how many times they visited each lake. Respondents were able to identify their responses on an interactive map, and were able to reminisce about all the great fish they caught that summer while they did it (they were doing it in December).

obs_boater_choiceWe ended up with 510 respondents with 146 boaters indicated that they visited multiple lakes.

In keeping with the methodology that I have outlined, I of course simulated some boaters making trip taking decisions in a simulated landscape. I had simulations with the boaters behaving according to each of the two behavioural models and in each simulation, I had those boater fill out a simulated survey just like the one I had online. By collecting this pseudo-data, I then went and looked at whether I could recapture the parameter values underlying the data.

Shown here are the results of that experiment, with the 1:1 line indicating perfect agreement between the actual and predicted values of each of the four parameters for each model. For the same set of simulations, I also checked to see whether I could correctly identify when the generating model was the gravity model, or the RUM.

So what where the results of the survey (the real, flesh and blood survey)?

Well, for the boaters I surveyed in Ontario, the gravity model functional form fit the observed data considerably better. This table just shows the delta AIC values for each of the models tested. Each value is relative to the best model with increasing values indicating decreasing fit.

Okay, so that’s maybe a mildly interesting. But it only matters insofar as whether it has an effect our prediction about spreading invasions. It turns out, that it does.

Because both models were fit using the same data and covariates, they both predicted the same amount of overall boater traffic between lakes, and hence the same net connectivity. The feature in which they differed, however, was in how that connectivity was distributed across the landscape.

We can measure this difference in connectivity distribution using a simple evenness measure like Shannon entropy. High entropy means a very even distribution, whereas low entropy means that the distribution of connectivity is more highly concentrated around few edges of the network. This connectivity distribution particularly matters when the spreading population experiences an Allee effect in its population dynamics. That is, when populations are disproportionately likely to go extinct at low population sizes, whether due to mate limitation in the case of sexually reproducing species, or other forms of density dependent growth. The reason this matters is that in dispersal networks with low entropy connectivity distributions, emigrants are not arriving in sufficient numbers to overcome the demographic barriers to establishment. There is a diluting effect in this situation. Whereas in low entropy connectivity networks, there are sufficient ‘hub’ patches, providing emigrants in sufficient numbers to overcome initial barriers to growth.

We can see this happening when we look at rates of spread in the GM and RUM cases. As Allee effects get stronger (going from left to right), the rates of spread in the early phases are much lower under the evenly distributed RUM dispersal network then under the more concentrated edge distribution of the gravity model.

While this is an important finding in the field of invasion dynamics, I think that this has implications more broadly and could be applied when looking at, instead of a population which we would like to suppress, a conservation setting where we are looking to preserve habitat patch connectivity.

But for now, back to populations which we are trying to suppress. I’ve shown so far how we can model the spread of aquatic invasives via their primary vector, which is recreational boaters. But ultimately, we want to know where they are likely to spread so that ideally, we can get out in front of them and actually do something about it.

One measure that has been proposed in some districts, is to place mandatory cleaning stations at strategically selected launch sites. The idea being to both prevent outgoing as well as incoming organisms to a lake. The problem, just like any lunch ever consumed in history, is that these are not free. There is a cost borne to the individual boaters, in the form of time. And there is cost associated with operating these stations, which may be passed on to boaters in the form of a mandatory usage fee.

boatwash

The question becomes, then, how will boaters react in the face of such a cost, and how will their behaviours effect the efficacy of the policy?

To look at this question, I extended the original behavioral model to incorporate responses to management action. Boaters may choose a substitute lake to visit, redistributing their trips, or they may simply decided to stay home, reducing the total number of trips taken. Or they may do some combination of both. The observation model employs a binomial distribution of trip outcomes to estimate the behavioral parameters.

In order to estimate the behavioural responses, I conducting something that economists call ‘counterfactual’ analysis. This is a fancy word for ‘what if’ scenarios. After having identified which lakes they visited, I presented boaters with a hypothetical situation in which there were a mandatory cleaning procedure at one of the lakes which they indicated visiting. They were then asked how many times they would have used that lake, and how many times they would have used another lake which they had also identified visiting.

As always, using simulation I checked that my observation model was able to recapture the true parameters, given a sample of the same form and size as that which I would observe from the survey, with the expected amount of uncertainty around those estimates.

Once we have estimated the behavioural model parameters from the survey responses, we can start doing things like scenario analysis, wherein we impose any number of strategic interventions to see how the predicted rates of spread will respond.

scenario_analysis

So, I’ve outlined my case for how I think that we should approach ecological problems within the domain of management decisions, and I’ve given a few examples from my own work of this general methodology in action. Hopefully I’ve peeked your interest in this framework, and hopefully we’ll have lots to discuss, and potentially boisterously argue about over beers.

What does R do? Bring people together, of course!

Last night we had a great meet up of the Montreal R User Group. I got things started with a little presentation asking the question “What does R do?” (slides). I made the presentation using Montreal R User Group member Ramnath Vaidyanathan‘s Slidify package. Slidify allows you to generate rather handsome HTML5 slides directly using R markdown.

We were then treated to a great workshop by Etienne Low-Decarie. He gave us a fly over of some of the most powerful R packages for wrangling data, namely plyr, reshape and ggplot.

Here are Etienne’s slides.

You can also follow along with the code posted here.

We met a lot of people who are doing very cool things using R. I’m looking forward to our next meetup!

I’m no shutterbug – drop me a note if you came and have any better pictures.

We had clementines!

Also, thanks to Notman House for hosting us. The haunted house feeling wasn’t enough to scare off this hardy group of data geeks.

Variable probability Bernoulli outcomes – Fast and Slow

I am working on a project that requires the generation of Bernoulli outcomes. Typically, I would go about this using the built in sample() function like so:

sample(1:0,n,prob=c(p,1-p),replace=TRUE)

This works great and is fast, even for large n. Problem is, I want to generate each sample with its own unique probability. Seems straight forward enough, I just wrapped the function and vectorized to allow the passing of a vector of p.

binomial_sampler<-function(p){
  return(sample(1:0,1,prob=c(p,1-p)))
}
bs<-Vectorize(binomial_sampler)

Naming this function bs() turned out to be rather prophetic. Nevertheless, I can call this function by passing my unique vector of outcome probabilities. And indeed I get the result I'm looking for.

bs(my_p_vec)

Problem is, this turns out to be very slow. It would seem that there is quite a bit of overhead to calling sample() for one sample at a time. R's RNGs are very fast for generating many iid samples, so I started thinking like my old c++ programming self and tried a different approach.

Nbs<-function(p)
{
  U<-runif(length(p),0,1)
  outcomes<-U<p
  return(outcomes)
}

I call the new version Nbs for "New Bernoulli Sampler", or "Not Bull Shit". And what a difference it made indeed!

library(rbenchmark)
p<-runif(1000)
res <- benchmark(bs(p), Nbs(p))
print(res)
test replications elapsed relative user.self sys.self user.child sys.child
2 Nbs(p)          100   0.007        1     0.008    0.000          0         0
1  bs(p)          100   1.099      157     1.080    0.016          0         0

157x faster! Now that's a speedup to write home about.

Dan "The Man" Bernoulli