January 23, 2016

A probabilistic justification to carpe diem

By Corey Chivers ¶ Posted in decision making, health care, Probability, Teaching, Uncategorized, uncertainty ¶ Tagged monte carlo, python, risk, simulation, statistics ¶ 3 Comments

There’s a curious thing about unlikely independent events: no matter how rare, they’re most likely to happen right away.

Let’s get hypothetical

You’ve taken a bet that pays off if you guess the exact date of the next occurrence of a rare event (p = 0.0001 on any given day i.i.d). What day do you choose? In other words, what is the most likely day for this rare event to occur?

Setting aside for now why in the world you’ve taken such a silly sounding bet, it would seem as though a reasonable way to think about it would be to ask: what is the expected number of days until the event? That must be the best bet, right?

We can work out the expected number of days quite easily as 1/p = 10000. So using the logic of expectation, we would choose day 10000 as our bet.

Let’s simulate to see how often we would win with this strategy. We’ll simulate the outcomes by flipping a weighted coin until it comes out heads. We’ll do this 100,000 times and record how many flips it took each time.

p0001

The event occurred on day 10,000 exactly 35 times. However, if we look at a histogram of our simulation experiment, we can see that the time it took for the rare event to happen was more often short, than long. In fact, the event occurred 103 times on the very first flip (the most common Time to Event in our set)!

So from the experiment it would seem that the most likely amount of time to pass until the rare event occurs is 0. Maybe our hypothetical event was just not rare enough. Let’s try it again with p=0.0000001, or an event with a 1 in 1million chance of occurring each day.

p0000001

While now our event is extremely unlikely to occur, it’s still most likely to occur right away.

Existential Risk

What does this all have to do with seizing the day? Everything we do in a given day comes with some degree of risk. The Stanford professor Ronald A. Howard conceived of a way of measuring the riskiness of various day-to-day activities, which he termed the micromort. One micromort is a unit of risk equal to p = 0.000001 (1 in a million chance) of death. We are all subject to a baseline level of risk in micromorts, and additional activities may add or subtract from that level (skiing, for instance adds 0.7 micromorts per day).

While minimizing the risks we assume in our day-to-day lives can increase our expected life span, the most likely exact day of our demise is always our next one. So carpe diem!!

Post Script:

Don’t get too freaked out by all of this. It’s just a bit of fun that comes from viewing the problem in a very specific way. That is, as a question of which exact day is most likely. The much more natural way to view it is to ask, what is the relative probability of the unlikely event occurring tomorrow vs any other day but tomorrow. I leave it to the reader to confirm that for events with p < 0.5, the latter is always more likely.

November 23, 2013

Simudidactic

By Corey Chivers ¶ Posted in Machine learning, Probability, Rstats, Teaching, uncertainty ¶ Tagged autodidact, computation, computational model, learning, modelling, physical system, programming, randomization, representation, simulation ¶ 2 Comments

auto·di·dact n.
A self-taught person.
From Greek autodidaktos, self-taught : auto-, auto- + didaktos, taught;

sim·u·late v.
To create a representation or model of (a physical system or particular situation, for example).
From Latin simulre, simult-, from similis, like;

=
(If you can get past the mixing of Latin and Greek roots)

sim·u·di·dactic adj.
To learn by creating a representation or model of a physical system or particular situation. Particularly, using in silico computation to understand complex systems and phenomena.

———————————————————————

This concept has been floating around in my head for a little while. I’ve written before on how I believe that simulation can be used to improve one’s understanding of just about anything, but have never had a nice shorthand for this process.

Simudidactic inquiry is the process of understanding aspects of the world by abstracting them into a computational model, then conducting experiments in this model world by changing the underlying properties and parameters. In this way, one can ask questions like:

What type of observations might we make if x were true?
If my model of the process is accurate, can I recapture the underlying parameters given the type of observations I can make in the real world? How often will I be wrong?
Will I be able to distinguish between competing models given the observations I can make in the real world?

In addition to being able to ask these types of questions, the simudidact solidifies their understanding of the model by actually building it.

So go on, get simudidactic and learn via simulation!

March 2, 2013

Predictive Ecology and Management Decisions Under Uncertainty

By Corey Chivers ¶ Posted in decision making, ecology, Probability, Uncategorized, uncertainty ¶ Tagged biology graduate students, decisions, ecology, graduate student experience, invasions, predictions, probabilistic forecasts, research, risk, science, simulation, uncertainty ¶ 3 Comments

In February, I was honoured with the inaugural McGill Biology Graduate Students Association Organismal Seminar Award. This is the talk I gave on uncertainty and predictions in ecological management. I have inserted selected visuals, but you can get the complete slides which accompanied the talk here.

———————————————————————————————————–

Predictive invasion ecology and management decisions under uncertainty

When I was told that I had been selected to give this talk I spent a rather long time thinking about how one goes about preparing a presentation of this type. The typical graduate student experience presenting at a conference is that we are expected to detail a single piece of research to a relatively small audience of fellow researchers who, it is hoped, have at least a tertiary familiarity with whatever esoteric domain we’ve been recently slaving away in. The format usually goes: Introduction – methods – results – conclusion – next speaker. However, it seemed to me that if I was going to ask you all to come out and listen during what would otherwise be a prime manuscript writing Thursday afternoon, I was going to have to do something a little different, and I hope, more interesting.

I’ve titled the talk ‘Predictive invasion ecology and management decisions under uncertainty’, and indeed that is what I’ll be talking about. However, I’ve decided to structure this talk as something more akin to an argumentative essay. And the argument that I am going to be making has to do with how, as I see it, we should be doing ecology as it pertains to informing management and policy decisions.

By this I mean, how do we go from data, which is often sparse or limited, and theory, which is our (also limited) abstracted understanding of ecological processes, to making inference, predictions, and ultimately, to help inform decisions. I am going to lay out, and advocate, a framework for how I think that we ought to go about this. I’ll outline a sort of methodological recipe for doing this, and then I’ll go through how I have applied this general methodological approach in my own research. First on measuring the impacts of invasive species (specifically forest pests), then on various aspects of the spread, and management, of aquatic invasives.

The paper that I gave to Carly to send around as recommended reading (which I’m sure you all diligently read), is this one. This is one of the first papers that I came across when I started graduate work here at McGill. Clark and colleagues published this paper in Science back in 2001, but its message is just as, if not more important today. In it, the authors argue that if we are going to make decisions about what to do to conserve biological diversity, ecosystem function, and the valuable ecosystem services provided to use by nature, then we will need to make forecasts about what we expect to happen in the future. I was struck by something that Justin Travis said last week during his talk about the idea of engaging in ‘ecological meteorology’. Which might bring to mind something like this.

When we think of forecasting, we usually think about the best guess at what some future state of nature is going to be. When I speak of forecasts, however, I mean something a little bit different than what is communicated by weather forecasters. When I talk about forecasting, I mean specifically placing probability distributions over the range of possible future outcomes, and in that sense, a forecast is not a single prediction about a future state of nature, but rather is a way to communicate uncertainty about the future. And indeed, our level of uncertainty may well be quite high. It was the Danish physicist Niels Bohr, who once said:

Prediction is very difficult, especially about the future.

–Niels Bohr, Danish physicist (1885-1962)

Now, the temptation can be to suggest that because this is true, we should not act. Many will call for ‘further study’, or cite ‘insufficient data available’ as a reason to delay decisions. Of course, we can always collect more data, and we there is always room to further refine and develop our theory and models. And yes, indeed, we ought to continue to do both of these things. However, the approach of not making a decision has been thought of as being a precautionary one, on the basis that action can have unintended consequences. I want to outline why I believe that this is not the correct way to think about things.

Not making a decision is, itself a decision!
As such, in a changing world, doing nothing can also have unintended consequences!

Delaying decision making on the hopes that things with stay the same, may very well have the unintended consequence that they will not. By doing away with this we can focus on the problem at hand. And that is: How do make the best use of the data that we have to make predictions which take into account (potentially large) inherent uncertainties?

That is, how do we use all of the information that we have available to us to make probabilistic predictions, with their many sources of uncertainty, about the things that we are interested in? For instance, in my own work, these are things like: How much damage are non-native forest insects causing, and how much damage should we expect to be caused by the next one? At current import rates, what is the probability of a new fish species invading via the aquarium trade? Or, after initial introduction, how is a novel invasive likely to spread across the landscape? And how is that spread likely to be altered in the presence of targeted mitigation strategies?

The framework that I am advocating approaches all of these problems in a similar way. At this point, for those of you who know me, you’re probably making a guess about where I’m going here.

Here again is Jim Clark, this time outlining all the wonderful things that the Bayesian can do. Bayes is indeed a useful tool for doing inference and prediction while accounting for uncertainty and variability. However, I am not going to argue here specifically for the use of Bayesian methods (though they can help.) In fact, in my own work I have used both classical maximum likelihood, as well as Bayesian methods to get the job done. Instead, I am arguing for the use of a broader framework. The important elements of this framework are that we use the information (data) that we have available to us to the maximum extent possible while taking account of our uncertainties in the form of probability distributions. It turns out that Bayesian methods are great at this. They do, however suffer from other problems, not the least of which can be the computational costs of estimating models.

Let’s start outlining a procedure for this general methodological framework which I am proposing. The first step is data.

—> What information do I have? What can I go out and observe? <—

Some may suggest starting with theory, but I would argue that if the theory or models don’t make predictions about observable, quantifiable phenomena, then we’re done before we even get started. So, start with data. In ecology, we often have observational data (as a opposed to experimental data). From the information (or data) that we have, we can then start asking the question:

—> What are the hypothesized processes which generated these data? <—

This is where our ecological understanding comes in in the form of models. Models are our abstractions of what we believe to be the key processes involved in some real world phenomenon. These abstractions of real world processes, by definition contain only a subset of the real components which are involved. Therefore, something we need to always keep in mind was best said by George Box:

All models are wrong, but some are useful.

–George E.P. Box.

This is a quote that I am sure many of you are familiar with, but I find it to be a very useful one to remind myself of from time to time. Any model or theory that we have of the world will be incomplete. The challenge we face is to build our abstractions of real world processes such that they contain the main relevant features of the phenomena that we are interested in, and produce predictions about observable quantities in the real world.

Once we have formalized our theories about how we think that our observable data is being generated into one or more models, the next step is to build an artificial world and in that world, to simulate the data generating process from your models.

By this I mean stochastic simulation – the in silico use of random number generation to probabilistically mimic the behaviour of individuals or populations. Random number generation is really the best thing since sliced bread. It can be used as a tool to generate and test theories and as a platform to conduct thought experiments about complex systems.

This is a step that I believe is most often skipped in studies with data. Many studies go straight from formulating a model to confronting that model with data. Instead, the next step should be to confront your model, or models, with the psuedo data, which is in the exact same form as your real world data.

I have found this step to be critically important as a safeguard against faulty logic. If you’re anything like me, you will have many of these moments when building and testing models.

A rather nefarious thing about models is that even broken ones can produce seemingly reasonable predictions when confronted with data. The point is that simulation provides a test suite (a massive laboratory right in your computer) for experimenting and testing your logic before confronting models with data.

Once we have gone around this loop a few (or many) times, only then do we go ahead and test it out on our real data. At this point you may still need to go back to the theoretical drawing board, but hopefully you’ve ironed out all the kinks in your logic. It is hear that you estimate model parameters and distinguish between competing models.

You are now ready to start testing hypotheses about how a system may behave if altered, and to make probabilistic forecasts about things you care about. Here, as a bonus, you’ve already set up a simulation model of your system which you can now use to push forward in time under any given scenario or intervention you may be considering.

It is then possible to run all kinds of analyses of risk based on our probabilistic forecasts in order to help make decisions about management that will maximize our probability of getting a desired outcome.

I have presented this as a general framework that I think is applicable to broad range of ecological problems as a way to link data and theory to make inference, predictions and inform decisions under uncertainty. What I want to do now is to give you a look at how I’ve used this general methodological framework in my own research, which is specific to invasive species.

The first project that I worked on when I arrived for graduate study at McGill was on estimating the impacts of non-native forest insects in the US. I don’t know how I swung this one, but no sooner that I had arrived, Brian was bringing me down to California to work with an amazing interdisciplinary team of forest ecologists, entomologists, and resource economists on this project at NCEAS.

First a bit about invasive forest insects. We know that while international trade provides many economic benefits, there are also many externalities that are not taken into consideration. When we move massive amounts of stuff around the globe, we inevitably also move species, and some of these species can become problematic in their adoptive landscape. We know that some of these species are causing large ecological and economic damages, but we don’t really have any estimates of what the total damages are amounting to. Without any estimate, these damages can not be incorporated into the cost of trade, nor can we make informed decisions about what costs we would be willing to pay to avoid them.

Given that there are many non-native species being introduced and only a fraction of those are highly damaging, we also wanted to get an estimate of the probability of seeing another high impact pest under current import levels and phytosanitary legislation. Further, since the mode by which forest pests are being transported is highly associated with the feeding guild of the insects, we wanted to get separate estimates grouped by guild. And since economic impact is not likely to be evenly distributed, we set out to determine who is paying for these impacts.

So, what information did we have? Not much, really. The entomologists set out to compile a list of all known non-native forest pests, and to identify a short list of those species that had been documented to have caused some noticeable damage. While we would have liked them to have done all of the species, the lazy economists agreed to do a national scale, full cost analysis of the most damaging pest in each of the three feeding guilds (borers, sap suckers, and foliage feeders). And they did this analysis across three broad economic sectors (costs borne to government, households, and to the lumber market).

The information which we had then, for each guild and economic sector, was a count of the number of species causing very little damage (most of them), the number a species causing some intermediate level of damage (a few), and one estimate of the damage caused by the most damaging species.

How do we go from this information to an estimate of the frequency distribution of costs across all species? We assumed from what we know about there being many innocuous species, fewer intermediate species and only one real ‘poster’ species in each guild, that the distribution was likely to be concave across most of it’s range and that the frequency of species causing infinite damage should asymptote to zero.

There are only a handful of functional forms that meet our asymptotic and concavity assumptions, fitting distributions using maximum likelihood would be fairly straightforward. But unfortunately, what we have are frequencies of species in different ranges.

The solution to this is to integrate the curve in these ranges and use that integral in our likelihood function.

So to recap, we have counts of species in each of three cost categories. From that information we want to determine which functional form is likely to have generated that observed data. The parameters of each functional form themselves are uncertain, but we want to estimate a probability distribution of which values are more likely. And from there, with the distribution of possible distributions in hand, we can extract any number of useful quantities, for instance, the total cost of all species, the expected cost of a new species, or the probability that a new species will be a highly damaging one.

Okay, nice system, but will it work? To find out if our scheme would work, we simulated from each of the models, actually generating each data point, then collected ‘pseudo-data’ in the form which we actually had in real life, and went ahead to see if we could, in fact get back both the generating model, and the generating parameters.

The great thing about simulations is that you know what the model and parameters are that generated your simulated data. So when you apply your estimation procedures, you can assess how well you’ve done. So here we had four possible generating functional forms (power, log-normal, gamma, weibull), and what I check for is how often the estimating procedure is able to correctly identify the generating model. While some models were less distinct from others, we were able to in fact recapture the generating model.

We also check to see that we can recapture the parameter values and any derived quantities of interest. Again the panels are the four models, and in green are samples from the posterior distribution of parameter values with the generating value shown in red. Shown here is only one instance, but what we actually did was to re-simulate thousands of times to determine whether the generating parameter values were distributed as random draws from the posterior distribution. Since, for any one instance we wouldn’t expect the posterior distribution to be centred exactly on the true value, but rather that the true value would be a random draw from the posterior distribution.

Also shown in these plots as a color ramp is the total cost associated with a curve having those parameter values. We also checked to ensure that there was no bias in our estimations of our derived quantities of interest.

I’m showing here the procedure that worked, however this took some doing to get right – highlighting the importance of testing models via simulation before sending out in prime-time.

So we found that the bulk of the cost of invasive forest insects was being placed on local governments, and that it was borers doing all the damage at around 1.7billion USD per annum. We also found that the single most damaging pest in each guild was responsible for between 25 and 50 % of the total impacts, and that at the current rates of introduction, there is about a 32% chance that another highly damaging pest will occur in the next 10 years.

Staying on the theme of imports, I want to move now to the live fish trade, and specifically invasion risk posed by aquarium fish. This is some work that is forthcoming in Diversity and Distributions which I did with lab mate, Johanna Bradie wherein we had some fairly high resolution data on fish species that were being imported to Canada. We had records of the number of individuals at the species level that were reported at customs, and we wanted to know ‘given only this information, can we estimate the risk of establishment?’.

So we know that due to species specific traits such as life history and environmental tolerances, that not all species are equally likely to establish. However, what we wondered was, in the absence of this information, can we estimate an overall pathway level risk. So if you look at the plot, what I am showing is that the relationship between propagule pressure and probability of establishment is not likely to be the same across different species. But the question is ‘how much does this matter in determining a risk estimate in the absence of species specific information?’

What we showed using simulation was that in most cases, it was possible to estimate a pathway level establishment curve, without significant bias. The only place where it particularly failed was when the distribution of q values, which is essentially a measure of environmental suitability, was strongly bimodal. Or, in other words, where there are two very different types of species represented in the import pool.

In short we found that at an import level of 100,000 individuals, there was pathway level establishment risk of 19%, and that importing 1 million individuals leads to just under a 1 in 2 chance of establishment for a species chosen at random from the import pool.

These studies so far have looked at invasions at very broad spatial scales, but now I want to drill down a little and talk about the process of species spread across a landscape once a local population has been established. Specifically, I investigate the way in which human behaviour mediates the dispersal of aquatic invasive species at the landscape level, and how these patterns of dispersal interact with the population dynamics of the invading species to affect the spatial patterns of spread.

We’ve known for a while now that freshwater species are transported from lake to lake primarily by hitchhiking to recreational vessels like this one here. If we are going to try to make predictions about how species will spread, it stands to reason that we should probably try to understand the behaviour of the humans who are pulling these boats around. There have been two main models of boater behaviour proposed in the literature, the gravity model, which makes use of an analogy to the gravitational ‘pull’ of large bodies, in this case lakes, and the random utility model, which models boaters as rational utility maximizers. But at their core, what both of these models do is to try to estimate which lakes a given boater is likely to visit. That is, they both try to estimate a boater’s probability distribution over lakes. Essentially, they are just alternative functional forms of the same decision process.

I wanted to observe boater choice in order to predict how propagules are likely flow between lakes. To do this I conducted an online survey. In the survey I asked boaters to identify which lakes they visited during the boating season, as well as how many times they visited each lake. Respondents were able to identify their responses on an interactive map, and were able to reminisce about all the great fish they caught that summer while they did it (they were doing it in December).

We ended up with 510 respondents with 146 boaters indicated that they visited multiple lakes.

In keeping with the methodology that I have outlined, I of course simulated some boaters making trip taking decisions in a simulated landscape. I had simulations with the boaters behaving according to each of the two behavioural models and in each simulation, I had those boater fill out a simulated survey just like the one I had online. By collecting this pseudo-data, I then went and looked at whether I could recapture the parameter values underlying the data.

Shown here are the results of that experiment, with the 1:1 line indicating perfect agreement between the actual and predicted values of each of the four parameters for each model. For the same set of simulations, I also checked to see whether I could correctly identify when the generating model was the gravity model, or the RUM.

So what where the results of the survey (the real, flesh and blood survey)?

Well, for the boaters I surveyed in Ontario, the gravity model functional form fit the observed data considerably better. This table just shows the delta AIC values for each of the models tested. Each value is relative to the best model with increasing values indicating decreasing fit.

Okay, so that’s maybe a mildly interesting. But it only matters insofar as whether it has an effect our prediction about spreading invasions. It turns out, that it does.

Because both models were fit using the same data and covariates, they both predicted the same amount of overall boater traffic between lakes, and hence the same net connectivity. The feature in which they differed, however, was in how that connectivity was distributed across the landscape.

We can measure this difference in connectivity distribution using a simple evenness measure like Shannon entropy. High entropy means a very even distribution, whereas low entropy means that the distribution of connectivity is more highly concentrated around few edges of the network. This connectivity distribution particularly matters when the spreading population experiences an Allee effect in its population dynamics. That is, when populations are disproportionately likely to go extinct at low population sizes, whether due to mate limitation in the case of sexually reproducing species, or other forms of density dependent growth. The reason this matters is that in dispersal networks with low entropy connectivity distributions, emigrants are not arriving in sufficient numbers to overcome the demographic barriers to establishment. There is a diluting effect in this situation. Whereas in low entropy connectivity networks, there are sufficient ‘hub’ patches, providing emigrants in sufficient numbers to overcome initial barriers to growth.

We can see this happening when we look at rates of spread in the GM and RUM cases. As Allee effects get stronger (going from left to right), the rates of spread in the early phases are much lower under the evenly distributed RUM dispersal network then under the more concentrated edge distribution of the gravity model.

While this is an important finding in the field of invasion dynamics, I think that this has implications more broadly and could be applied when looking at, instead of a population which we would like to suppress, a conservation setting where we are looking to preserve habitat patch connectivity.

But for now, back to populations which we are trying to suppress. I’ve shown so far how we can model the spread of aquatic invasives via their primary vector, which is recreational boaters. But ultimately, we want to know where they are likely to spread so that ideally, we can get out in front of them and actually do something about it.

One measure that has been proposed in some districts, is to place mandatory cleaning stations at strategically selected launch sites. The idea being to both prevent outgoing as well as incoming organisms to a lake. The problem, just like any lunch ever consumed in history, is that these are not free. There is a cost borne to the individual boaters, in the form of time. And there is cost associated with operating these stations, which may be passed on to boaters in the form of a mandatory usage fee.

The question becomes, then, how will boaters react in the face of such a cost, and how will their behaviours effect the efficacy of the policy?

To look at this question, I extended the original behavioral model to incorporate responses to management action. Boaters may choose a substitute lake to visit, redistributing their trips, or they may simply decided to stay home, reducing the total number of trips taken. Or they may do some combination of both. The observation model employs a binomial distribution of trip outcomes to estimate the behavioral parameters.

In order to estimate the behavioural responses, I conducting something that economists call ‘counterfactual’ analysis. This is a fancy word for ‘what if’ scenarios. After having identified which lakes they visited, I presented boaters with a hypothetical situation in which there were a mandatory cleaning procedure at one of the lakes which they indicated visiting. They were then asked how many times they would have used that lake, and how many times they would have used another lake which they had also identified visiting.

As always, using simulation I checked that my observation model was able to recapture the true parameters, given a sample of the same form and size as that which I would observe from the survey, with the expected amount of uncertainty around those estimates.

Once we have estimated the behavioural model parameters from the survey responses, we can start doing things like scenario analysis, wherein we impose any number of strategic interventions to see how the predicted rates of spread will respond.

So, I’ve outlined my case for how I think that we should approach ecological problems within the domain of management decisions, and I’ve given a few examples from my own work of this general methodology in action. Hopefully I’ve peeked your interest in this framework, and hopefully we’ll have lots to discuss, and potentially boisterously argue about over beers.

November 24, 2012

Simulating weak gravitational lensing

By Corey Chivers ¶ Posted in Machine learning, physics, Probability, Rstats ¶ Tagged cosmology, dark matter, gravitational lensing, kaggle, physics, shear, simulation, statistical simulation ¶ 7 Comments

In the search for dark matter, I have been having mixed success. It seems that locating DM in single halo skies is a fairly straightforward problem. However, when there are more than one halo, things get quite a bit trickier.

As I have advocated many times before, including here and here, simulation can provide deep insights into many (if not all) problems. I never trust my own understanding of a complicated problem until I have simulated from a hypothesized model of that problem. Once I have a simulation in place, I can then test out all kinds of hypotheses about the system by manipulating the component parts. I think of this process as a kind of computer-assisted set of thought experiments.

So, when I was hitting a wall with the dark matter challenge, I of course turned to simulation for insights. Normally this would have been my very first step, however in this case my level of understanding of the physics involved was insufficient when I started out. After having done a bit of reading on the topic, I built a model which implements a weak lensing regime on an otherwise random background of galaxies. The model assumes an Einasto profile of dark matter mass density, with parameters A and α determining the strength of the tangential shearing caused by foreground dark matter.

A=0.2, alpha=0.5

I can then increase the strength of the lens by either increasing the mass of the dark matter, or by varying the parameters of the Einasto profile.

A=0.059, alpha=0.5

A=0.03, alpha=0.5

You can check out this visualization over a range of A values.

I can also see how two halos interact in terms of the induced tangential ellipticity profile by simulating two halos and then moving them closer to one another.

You can see the effect here. You get the idea – I can also try out any combination of configurations, shapes, and strengths of interacting halos. I can then analyse the characteristics of the resulting observable factors (in this case, galaxy location and ellipticities) in order to build better a predictive model.

Unfortunately, since this is a competition with cold hard cash on the line, I am not releasing the source for this simulation at this time. I will, however, open source the whole thing when the competition ends.

November 1, 2012

Variable probability Bernoulli outcomes – Fast and Slow

By Corey Chivers ¶ Posted in Probability, Rstats, Uncategorized ¶ Tagged bernoulli trials, c++, code optimization, computer, iid, rbenchmark, rng, sampling, simulation, software, uniform random number ¶ 6 Comments

I am working on a project that requires the generation of Bernoulli outcomes. Typically, I would go about this using the built in sample() function like so:

sample(1:0,n,prob=c(p,1-p),replace=TRUE)

This works great and is fast, even for large n. Problem is, I want to generate each sample with its own unique probability. Seems straight forward enough, I just wrapped the function and vectorized to allow the passing of a vector of p.

binomial_sampler<-function(p){
  return(sample(1:0,1,prob=c(p,1-p)))
}
bs<-Vectorize(binomial_sampler)

Naming this function bs() turned out to be rather prophetic. Nevertheless, I can call this function by passing my unique vector of outcome probabilities. And indeed I get the result I’m looking for.

bs(my_p_vec)

Problem is, this turns out to be very slow. It would seem that there is quite a bit of overhead to calling sample() for one sample at a time. R’s RNGs are very fast for generating many iid samples, so I started thinking like my old c++ programming self and tried a different approach.

Nbs<-function(p)
{
  U<-runif(length(p),0,1)
  outcomes<-U<p
  return(outcomes)
}

I call the new version Nbs for “New Bernoulli Sampler”, or “Not Bull Shit”. And what a difference it made indeed!

library(rbenchmark)
p<-runif(1000)
res <- benchmark(bs(p), Nbs(p))
print(res)
test replications elapsed relative user.self sys.self user.child sys.child
2 Nbs(p)          100   0.007        1     0.008    0.000          0         0
1  bs(p)          100   1.099      157     1.080    0.016          0         0

157x faster! Now that’s a speedup to write home about.

Dan “The Man” Bernoulli

August 10, 2012

Simulation: The modeller’s laboratory

By Corey Chivers ¶ Posted in Probability, Rstats ¶ Tagged experiment, gravity model, invasive species, model selection, random utility model, simulation, spread ¶ 6 Comments

In his 2004 paper in Trends in Ecology and Evolution, Steven Peck argues:

Simulation models can be used to mimic complex systems, but unlike nature, can be manipulated in ways that would be impossible, too costly or unethical to do in natural systems. Simulation can add to theory development and testing, can offer hypotheses about the way the world works and can give guidance as to which data are most important to gather experimentally.

A sentiment I agree with fully. However, another important use of simulation is in the experimentation phase of model development before confronting models with data. In ecology, epidemiology and related fields, it is common to have observational data (as opposed to controlled, randomized experiments). In these situations, two questions need to be asked:

What can we observe about the system? (ie what will the data look like?)
Given what we can observe (1), will our model(s) be able to capture the underlying process and parameters?

In order to answer these questions, we need to be able to simulate the hypothesized processes and use the simulated observations to fit our model(s). This process has the additional benefit of forcing us to understand the process that we are modelling. In fact, I find that the act of formalizing the hypothesized process into a coded simulation makes the formulation of the likelihood function more straightforward.

Let’s look at an example. In a recent paper in the Journal of Applied Ecology, we model the trip taking behaviour of recreational boaters in Ontario. Through a survey, we observed the trip outcomes of a sample of boaters. We wanted to compare two models of the trip taking process (see the paper for model descriptions).


###############################
##### RUM v GM sim tests ######
##
## Corey Chivers, 2012
##
###############################

n_boaters<-50
n_lakes<-750
n_trips<-rpois(n_boaters,16)

euclidian<-function(from_x,from_y,to_x,to_y)
{
return( sqrt( (from_x-to_x)^2 + (from_y-to_y)^2 ) )
}
sim_boaters<-function(N=n_boaters)
{
##Home locations
return(as.matrix(cbind(runif(N,0,100),runif(N,0,100)) ))
}
sim_lakes<-function(N=n_lakes)
{
x<-cbind(runif(N,0,100))
y<-cbind(runif(N,0,100))
size<-abs(rnorm(N,x+y,x+y))
return(cbind(x,y,size))
}
get_d_mat<-function()
{
## matrix of boaterxlake distances
d_mat<-array(dim=c(n_boaters,n_lakes))
for(b in 1:n_boaters)
{
d_mat[b,]<-euclidian( boater[b,1],boater[b,2],lakes[,1],lakes[,2] )
}
return(d_mat)
}
sim_trips<-function(par,M)
{
trips<-list()
for(b in 1:n_boaters)
{
P<-M(par,b)
trips[[b]]<-sample(1:n_lakes,n_trips[b],p=P,replace=TRUE)
}
return(trips)
}

RUM<-function(par,b) ## par{b1,b2}
{
exp_sum_V<-sum(exp(par[1]*lakes[,3]+par[2]*d_mat[b,]) )
V<-exp(par[1]*lakes[,3]+par[2]*d_mat[b,])
p_j<-V/exp_sum_V
return(p_j)
}
GM<-function(par,b) ## par{e,d}
{
Ai<-sum(lakes[,3]^(par[1]) * d_mat[b,]^(-par[2]))
WD<-lakes[,3]^(par[1]) * d_mat[b,]^(-par[2])
p_j<-WD/Ai
return(p_j)
}
ll<-function(par,M,data,give_neg=-1)
{
ll<-0
ll<-sapply(1:n_boaters,function(b){
P_b<-M(par,b)
l_tmp<-0
for(t in 1:length(data[[b]]))
l_tmp<-l_tmp+log(P_b[ data[[b]][t] ] )
return(l_tmp)
})
return(give_neg*sum(ll))
}
dAIC<-function(ll)
{
delta_aic<-numeric(length(ll[,1]))
for(i in 1:length(ll[,1]))
{
delta_aic[i]<-(2*ll[i,2]+(2*n_par))-(2*ll[i,1]+(2*n_par))
}
return(delta_aic)
}
plot_trips<-function(col='green',add_trips=TRUE,pdffile=NULL)
{
if(!is.null(pdffile))
pdf(pdffile)

par(mgp=c(0.5,0,0),cex=1.2,mfrow=c(1,2),mar=c(1,2,1,0.5),pty='s')
col=rgb(0,0,1,0.05)
plot(lakes[,1],lakes[,2],cex=lakes[,3]/200, xlab='Lon', ylab='Lat',xaxt='n',yaxt='n')
if(add_trips)
{
for(i in 1:n_boaters)
{
for(tr in 1:length(trips[[i]]))
{
segments(boater[i,1],boater[i,2],lakes[trips[[i]],1],lakes[trips[[i]],2],col=col,lwd=2)
}
points(boater[i,1],boater[i,2],pch=15,col='blue',cex=0.5)
}
}

### Make a legend ###
plot(1,1,col='white', xlab='', ylab='',xaxt='n',yaxt='n',xlim=c(0,1),ylim=c(0,1))
yloc<-c(0.9,0.8,0.7,0.6)
n_tr<-c(5,10,15,20)
for(i in 1:4)
{
for(n in 1:n_tr[i])
segments(0.1,yloc[i],0.5,yloc[i],col=col,lwd=3)
text(0.6,yloc[i],n_tr[i])
}
text(0.82,0.75,"Number\nof trips")

yloc<-c(0.4,0.3,0.2,0.1)
l_size<-c(200,400,600,800)
for(i in 1:4)
{
for(n in 1:n_tr[i])
points(0.3,yloc[i],cex=l_size[i]/200)
text(0.5,yloc[i],l_size[i])
}
text(0.7,0.25,"Size\nof lake")

if(!is.null(pdffile))
dev.off()
}

With all of the relevant functions defined, we then simulate behaviour under each model, fit to each model and compare our ability to both capture the generating parameter values and to distinguish between the competing models.

boater<-sim_boaters()
lakes<-sim_lakes()
d_mat<-get_d_mat()

n_sims<-1000
n_par<-2

save_par<-array(dim=c(n_sims,2*n_par))
ll_save<-array(dim=c(n_sims,2))

### RUM GENERATING ###
for(i in 1:n_sims)
{
lakes<-sim_lakes()
d_mat<-get_d_mat()
## pars{b1,b2}
pars<-c(runif(1,0.01,0.1),runif(1,-10.5,-1.5))
trips<-sim_trips(pars,RUM)
opt<-optim(pars,ll,M=RUM,data=trips)

save_par[i,]<-c(pars,opt$par)

pars<-c(0.5,1,0,0) #seeds
opt2<-optim(pars,ll,M=GM,data=trips)

ll_save[i,]<-c(opt$value,opt2$value)
print(i)
}
par(mfrow=c(1,2),pty='s')
plot(ll_save)
abline(0,1,lty=2)

hist(dAIC(ll_save))
ll_saveRUM<-ll_save
save_parRUM<-save_par

par(mfrow=c(2,2),pty='s')
param_names<-c('B1','B2')
for(i in 1:n_par)
{
plot(save_parRUM[,i],save_parRUM[,i+n_par],xlab=paste('Generating ',param_names[i]),ylab=paste('Fit ',param_names[i]))
abline(0,1)
}

### GM GENERATING ###
for(i in 1:n_sims)
{
lakes<-sim_lakes()
d_mat<-get_d_mat()
## par{e,d}
e<-runif(1,0.1,0.9)
d<-runif(1,1,5)
pars<-c(e,d)
trips<-sim_trips(pars,GM)
opt<-optim(pars,ll,M=GM,data=trips)

save_par[i,]<-c(pars,opt$par)

pars<-c(0.001,-2) #seeds
opt2<-optim(pars,ll,M=RUM,data=trips)

ll_save[i,]<-c(opt$value,opt2$value)
print(i)
}

x11()
par(mfrow=c(1,2),pty='s')
plot(ll_save)
abline(0,1,lty=2)

hist(dAIC(ll_save))
ll_saveGM<-ll_save
save_parGM<-save_par

par(mfrow=c(2,2),pty='s')
param_names<-c('e','d')
for(i in 1:n_par)
{
plot(save_parGM[,i],save_parGM[,i+n_par],xlab=paste('Generating ',param_names[i]),ylab=paste('Fit ',param_names[i]))
abline(0,1)
}

### Plot deltaAIC distributions ###
par(mfrow=c(1,2),pty='s')
hist(dAIC(ll_saveGM),breaks=30)
hist(dAIC(ll_saveRUM),breaks=30)

plot_trips()

After verifying that we can both distinguish between the two models, and can recapture the generating parameters, we then use simulation, in the manner described by Peck (2004), to analyse the implications of using alternative models of boater behaviour on the spread of invasive species.

Insights derived from simulation need to be complimented with controlled, randomized experiments. However, the use of simulation as a laboratory for the statistical modeller is indispensable, and you can do it all without putting a white coat on – unless you’re into that kind of thing.

June 11, 2012

Simulating Euro 2012

By Corey Chivers ¶ Posted in Probability, Rstats ¶ Tagged conditional probability, ELO, euro cup, euro2012, football, gaming, in silico, monte carlo, probabilistic forecasts, probability statements, simulation, soccer, sports, uefa euro ¶ 8 Comments

Why settle for just one realisation of this year’s UEFA Euro when you can let the tournament play out 10,000 times in silico?

Since I already had some code lying around from my submission to the Kaggle hosted 2010 Take on the Quants challenge, I figured I’d recycle it for the Euro this year. The model takes a simulation based approach, using Poisson distributed goals. The rate parameters in a given match are determined by the relative strength of each team as given by their ELO rating.

The advantage of this approach is that the tournament structure (which teams are assigned to which groups, how points are assigned, quarter final structure, etc) can be accounted for. If we wanted to take this into account when making probabilistic forecasts of tournament outcomes, we would need to calculate many conditional probability statements, following the extremely large number of possible outcomes. However, the downside is that it is only as reliable as the team ratings themselves. In my submission to Kaggle, I used a weighted average of ELO and FIFA ratings.

After simulating the tournament 10,000 times, the probability of victory for each team is just the number of times that team arose victorious divided by 10,000.

Feel free to get the code here and play around with it yourself. You can use your own rating system and see how the predicted outcomes change. If you are of a certain disposition, you might even find it more fun than the human version of the tournament itself!

April 24, 2012

Insights into Quantile Regression from Arthur Charpentier

By Corey Chivers ¶ Posted in Probability, Rstats ¶ Tagged hurricane intensity, hurricanes, quantile regression, regression coefficient, risk, rstats, science, simulation ¶ 2 Comments

At this Monday’s Montreal R User Group meeting, Arthur Charpentier gave an interesting talk on the subject of quantile regression.

One of the main messages I took away from the workshop was that quantile regression can be used to determine if extreme events are becoming more extreme. The example given was hurricane intensity since 1978. It may be that the average intensity is not increasing and therefore a standard linear regression would show no trend (since linear regression predicts expected, or mean values), but that’s not really what we are interested in anyway. If we are going to formulate proper risk models, what we want to know is whether the strong hurricanes are getting stronger. This is where quantile regression comes in.

I always find that the best way for me to check my understanding is to simulate some data and check to see that things are behaving the way I expect them to. The advantage of doing this rather than just playing with data is that you know what the real process is, since you defined it. To get a handle on this stuff, I simulated some non-gausian (gamma distributed) data to mimic the hurricane data. I set it up so that the mean intensity stays constant across years, and the variance increasing constantly over time such that the intense (simulated) hurricanes get more intense over time.

## Simulate some non-gausian data with constant mean
## and increasing variance
n_i<-80
d<-array(dim=c(n_i*20,2))
for(i in 1:20)
{
d[((i-1)*n_i+1):(i*n_i),2]<-rgamma(n_i,i,i)
d[((i-1)*n_i+1):(i*n_i),1]<-21-i
}
plot(d)

I then followed the procedure suggested by Arthur, which is to conduct quantile regressions across the quantile range (0,1). The results can be plotted as quantile vs the regression coefficient in order to see the magnitude and direction of the relationship across the quantile range.

## Run quantile regression on the simulated data
## across a range of quantiles
u=seq(.025,.975,by=.01)

coefstd=function(u) summary(rq(d[,2]~d[,1],tau=u))$coefficients[,2]
coefest=function(u) summary(rq(d[,2]~d[,1],tau=u))$coefficients[,1]

CS=Vectorize(coefstd)(u)
CE=Vectorize(coefest)(u)

## Plot the results
k=2
plot(u,CE[k,],type='l',xlab=expression(tau),ylab='Coefficient')
polygon(c(u,rev(u)),c(CE[k,]+1.96*CS[k,],rev(CE[k,]-1.96*CS[k,])),col='grey')
lines(u,CE[k,])

So, the coefficient seems to be an increasing function of the quantile (tau). But how do we interpret this? Low intensity (simulated) storms are becoming less intense given that the regression coefficient at low quantiles is negative. More importantly, however, is that the high intensity (simulated) storms are becoming more intense. We can see this by noting that the regression coefficients in the high quantile range are positive, and increasing.

Another way to visualise the quantile regression results is by animating the regressions together to see how the relationship changes across the quantiles (tau).

Click the image to see the animated GIF.

Arthur has a prettier animation of this type using the actual hurricane data here.

March 14, 2012

π Day Special! Estimating π using Monte Carlo

By Corey Chivers ¶ Posted in Probability, Rstats, Teaching ¶ Tagged integration, monte carlo, pi, pi day, rstats, simulation ¶ 5 Comments

In honour of π day (03.14 – can’t wait until 2015~) , I thought I’d share this little script I wrote a while back for an introductory lesson I gave on using Monte Carlo methods for integration.

The concept is simple – we can estimate the area of an object which is inside another object of known area by drawing many points at random in the larger area and counting how many of those land inside the smaller one. The ratio of this count to the total number of points drawn will approximate the ratio of the areas as the number of points grows large.

If we do this with a unit circle inside of a unit square, we can re-arrange our area estimate to yield an estimate of π!

This R script lets us see this Monte Carlo routine in action:

##############################################
### Monte Carlo Simulation estimation of pi ##
## Author: Corey Chivers                    ##
##############################################

rm(list=ls())

options(digits=4)

## initialize ##
N=500 # Number of MC points
points <- data.frame(x=numeric(N),y=numeric(N))
pi_est <- numeric(N)
inner <-0
outer <-0

## BUILD Circle ##
circle <- data.frame(x=1:360,y=1:360)

for(i in 1:360)
{
circle$x[i] <-0.5+cos(i/180*pi)*0.5
circle$y[i] <-0.5+sin(i/180*pi)*0.5
}

## SIMULATE ##
pdf('MCpiT.pdf')

layout(matrix(c(2,3,1,1), 2, 2, byrow = TRUE))
for(i in 1:N)
{

# Draw a new point at random
points$x[i] <-runif(1)
points$y[i] <-runif(1)

# Check if the point is inside
# the circle
if( (points$x[i]-0.5)^2 + (points$y[i]-0.5)^2 > 0.25 )
{
outer=outer+1
}else
{
inner=inner+1
}

current_pi<-(inner/(outer+inner))/(0.25)
pi_est[i]= current_pi
print(current_pi)

par(mar = c(5, 4, 4, 2),pty='m')
plot(pi_est[1:i],type='l',
main=i,col="blue",ylim=c(0,5),
lwd=2,xlab="# of points drawn",ylab="estimate")
# Draw true pi for reference
abline(pi,0,col="red",lwd=2)

par(mar = c(1, 4, 4, 1),pty='s')
plot(points$x[1:i],points$y[1:i],
col="red",
main=c('Estimate of pi: ',formatC(current_pi, digits=4, format="g", flag="#")),
cex=0.5,pch=19,ylab='',xlab='',xlim=c(0,1),ylim=c(0,1))
lines(circle$x,circle$y,lw=4,col="blue")
frame() #blank

}
dev.off()
##############################################
##############################################

The resulting plot (multi-page pdf) lets us watch the estimate of π converge toward the true value.

At 500 sample points, I got an estimate of 3.122 – not super great. If you want to give your computer a workout, you can ramp up the number of iterations (N) and see how close your estimate can get. It should be noted that this is not an efficient way of estimating π, but rather a nice and simple example of how Monte Carlo can be used for integration.

In the lesson, before showing the simulation, I started by having students pair up and manually draw points, plot them, and calculate their own estimate.

If you use this in your classroom, drop me a note and let me know how it went!

bayesianbiologist

Corey Chivers on P(A|B) ∝P(B|A)P(A)

Tag Archives: simulation