March 2, 2013

Predictive Ecology and Management Decisions Under Uncertainty

By Corey Chivers ¶ Posted in decision making, ecology, Probability, Uncategorized, uncertainty ¶ Tagged biology graduate students, decisions, ecology, graduate student experience, invasions, predictions, probabilistic forecasts, research, risk, science, simulation, uncertainty ¶ 3 Comments

In February, I was honoured with the inaugural McGill Biology Graduate Students Association Organismal Seminar Award. This is the talk I gave on uncertainty and predictions in ecological management. I have inserted selected visuals, but you can get the complete slides which accompanied the talk here.

———————————————————————————————————–

Predictive invasion ecology and management decisions under uncertainty

When I was told that I had been selected to give this talk I spent a rather long time thinking about how one goes about preparing a presentation of this type. The typical graduate student experience presenting at a conference is that we are expected to detail a single piece of research to a relatively small audience of fellow researchers who, it is hoped, have at least a tertiary familiarity with whatever esoteric domain we’ve been recently slaving away in. The format usually goes: Introduction – methods – results – conclusion – next speaker. However, it seemed to me that if I was going to ask you all to come out and listen during what would otherwise be a prime manuscript writing Thursday afternoon, I was going to have to do something a little different, and I hope, more interesting.

I’ve titled the talk ‘Predictive invasion ecology and management decisions under uncertainty’, and indeed that is what I’ll be talking about. However, I’ve decided to structure this talk as something more akin to an argumentative essay. And the argument that I am going to be making has to do with how, as I see it, we should be doing ecology as it pertains to informing management and policy decisions.

By this I mean, how do we go from data, which is often sparse or limited, and theory, which is our (also limited) abstracted understanding of ecological processes, to making inference, predictions, and ultimately, to help inform decisions. I am going to lay out, and advocate, a framework for how I think that we ought to go about this. I’ll outline a sort of methodological recipe for doing this, and then I’ll go through how I have applied this general methodological approach in my own research. First on measuring the impacts of invasive species (specifically forest pests), then on various aspects of the spread, and management, of aquatic invasives.

The paper that I gave to Carly to send around as recommended reading (which I’m sure you all diligently read), is this one. This is one of the first papers that I came across when I started graduate work here at McGill. Clark and colleagues published this paper in Science back in 2001, but its message is just as, if not more important today. In it, the authors argue that if we are going to make decisions about what to do to conserve biological diversity, ecosystem function, and the valuable ecosystem services provided to use by nature, then we will need to make forecasts about what we expect to happen in the future. I was struck by something that Justin Travis said last week during his talk about the idea of engaging in ‘ecological meteorology’. Which might bring to mind something like this.

When we think of forecasting, we usually think about the best guess at what some future state of nature is going to be. When I speak of forecasts, however, I mean something a little bit different than what is communicated by weather forecasters. When I talk about forecasting, I mean specifically placing probability distributions over the range of possible future outcomes, and in that sense, a forecast is not a single prediction about a future state of nature, but rather is a way to communicate uncertainty about the future. And indeed, our level of uncertainty may well be quite high. It was the Danish physicist Niels Bohr, who once said:

Prediction is very difficult, especially about the future.

–Niels Bohr, Danish physicist (1885-1962)

Now, the temptation can be to suggest that because this is true, we should not act. Many will call for ‘further study’, or cite ‘insufficient data available’ as a reason to delay decisions. Of course, we can always collect more data, and we there is always room to further refine and develop our theory and models. And yes, indeed, we ought to continue to do both of these things. However, the approach of not making a decision has been thought of as being a precautionary one, on the basis that action can have unintended consequences. I want to outline why I believe that this is not the correct way to think about things.

Not making a decision is, itself a decision!
As such, in a changing world, doing nothing can also have unintended consequences!

Delaying decision making on the hopes that things with stay the same, may very well have the unintended consequence that they will not. By doing away with this we can focus on the problem at hand. And that is: How do make the best use of the data that we have to make predictions which take into account (potentially large) inherent uncertainties?

That is, how do we use all of the information that we have available to us to make probabilistic predictions, with their many sources of uncertainty, about the things that we are interested in? For instance, in my own work, these are things like: How much damage are non-native forest insects causing, and how much damage should we expect to be caused by the next one? At current import rates, what is the probability of a new fish species invading via the aquarium trade? Or, after initial introduction, how is a novel invasive likely to spread across the landscape? And how is that spread likely to be altered in the presence of targeted mitigation strategies?

The framework that I am advocating approaches all of these problems in a similar way. At this point, for those of you who know me, you’re probably making a guess about where I’m going here.

Here again is Jim Clark, this time outlining all the wonderful things that the Bayesian can do. Bayes is indeed a useful tool for doing inference and prediction while accounting for uncertainty and variability. However, I am not going to argue here specifically for the use of Bayesian methods (though they can help.) In fact, in my own work I have used both classical maximum likelihood, as well as Bayesian methods to get the job done. Instead, I am arguing for the use of a broader framework. The important elements of this framework are that we use the information (data) that we have available to us to the maximum extent possible while taking account of our uncertainties in the form of probability distributions. It turns out that Bayesian methods are great at this. They do, however suffer from other problems, not the least of which can be the computational costs of estimating models.

Let’s start outlining a procedure for this general methodological framework which I am proposing. The first step is data.

—> What information do I have? What can I go out and observe? <—

Some may suggest starting with theory, but I would argue that if the theory or models don’t make predictions about observable, quantifiable phenomena, then we’re done before we even get started. So, start with data. In ecology, we often have observational data (as a opposed to experimental data). From the information (or data) that we have, we can then start asking the question:

—> What are the hypothesized processes which generated these data? <—

This is where our ecological understanding comes in in the form of models. Models are our abstractions of what we believe to be the key processes involved in some real world phenomenon. These abstractions of real world processes, by definition contain only a subset of the real components which are involved. Therefore, something we need to always keep in mind was best said by George Box:

All models are wrong, but some are useful.

–George E.P. Box.

This is a quote that I am sure many of you are familiar with, but I find it to be a very useful one to remind myself of from time to time. Any model or theory that we have of the world will be incomplete. The challenge we face is to build our abstractions of real world processes such that they contain the main relevant features of the phenomena that we are interested in, and produce predictions about observable quantities in the real world.

Once we have formalized our theories about how we think that our observable data is being generated into one or more models, the next step is to build an artificial world and in that world, to simulate the data generating process from your models.

By this I mean stochastic simulation – the in silico use of random number generation to probabilistically mimic the behaviour of individuals or populations. Random number generation is really the best thing since sliced bread. It can be used as a tool to generate and test theories and as a platform to conduct thought experiments about complex systems.

This is a step that I believe is most often skipped in studies with data. Many studies go straight from formulating a model to confronting that model with data. Instead, the next step should be to confront your model, or models, with the psuedo data, which is in the exact same form as your real world data.

I have found this step to be critically important as a safeguard against faulty logic. If you’re anything like me, you will have many of these moments when building and testing models.

A rather nefarious thing about models is that even broken ones can produce seemingly reasonable predictions when confronted with data. The point is that simulation provides a test suite (a massive laboratory right in your computer) for experimenting and testing your logic before confronting models with data.

Once we have gone around this loop a few (or many) times, only then do we go ahead and test it out on our real data. At this point you may still need to go back to the theoretical drawing board, but hopefully you’ve ironed out all the kinks in your logic. It is hear that you estimate model parameters and distinguish between competing models.

You are now ready to start testing hypotheses about how a system may behave if altered, and to make probabilistic forecasts about things you care about. Here, as a bonus, you’ve already set up a simulation model of your system which you can now use to push forward in time under any given scenario or intervention you may be considering.

It is then possible to run all kinds of analyses of risk based on our probabilistic forecasts in order to help make decisions about management that will maximize our probability of getting a desired outcome.

I have presented this as a general framework that I think is applicable to broad range of ecological problems as a way to link data and theory to make inference, predictions and inform decisions under uncertainty. What I want to do now is to give you a look at how I’ve used this general methodological framework in my own research, which is specific to invasive species.

The first project that I worked on when I arrived for graduate study at McGill was on estimating the impacts of non-native forest insects in the US. I don’t know how I swung this one, but no sooner that I had arrived, Brian was bringing me down to California to work with an amazing interdisciplinary team of forest ecologists, entomologists, and resource economists on this project at NCEAS.

First a bit about invasive forest insects. We know that while international trade provides many economic benefits, there are also many externalities that are not taken into consideration. When we move massive amounts of stuff around the globe, we inevitably also move species, and some of these species can become problematic in their adoptive landscape. We know that some of these species are causing large ecological and economic damages, but we don’t really have any estimates of what the total damages are amounting to. Without any estimate, these damages can not be incorporated into the cost of trade, nor can we make informed decisions about what costs we would be willing to pay to avoid them.

Given that there are many non-native species being introduced and only a fraction of those are highly damaging, we also wanted to get an estimate of the probability of seeing another high impact pest under current import levels and phytosanitary legislation. Further, since the mode by which forest pests are being transported is highly associated with the feeding guild of the insects, we wanted to get separate estimates grouped by guild. And since economic impact is not likely to be evenly distributed, we set out to determine who is paying for these impacts.

So, what information did we have? Not much, really. The entomologists set out to compile a list of all known non-native forest pests, and to identify a short list of those species that had been documented to have caused some noticeable damage. While we would have liked them to have done all of the species, the lazy economists agreed to do a national scale, full cost analysis of the most damaging pest in each of the three feeding guilds (borers, sap suckers, and foliage feeders). And they did this analysis across three broad economic sectors (costs borne to government, households, and to the lumber market).

The information which we had then, for each guild and economic sector, was a count of the number of species causing very little damage (most of them), the number a species causing some intermediate level of damage (a few), and one estimate of the damage caused by the most damaging species.

How do we go from this information to an estimate of the frequency distribution of costs across all species? We assumed from what we know about there being many innocuous species, fewer intermediate species and only one real ‘poster’ species in each guild, that the distribution was likely to be concave across most of it’s range and that the frequency of species causing infinite damage should asymptote to zero.

There are only a handful of functional forms that meet our asymptotic and concavity assumptions, fitting distributions using maximum likelihood would be fairly straightforward. But unfortunately, what we have are frequencies of species in different ranges.

The solution to this is to integrate the curve in these ranges and use that integral in our likelihood function.

So to recap, we have counts of species in each of three cost categories. From that information we want to determine which functional form is likely to have generated that observed data. The parameters of each functional form themselves are uncertain, but we want to estimate a probability distribution of which values are more likely. And from there, with the distribution of possible distributions in hand, we can extract any number of useful quantities, for instance, the total cost of all species, the expected cost of a new species, or the probability that a new species will be a highly damaging one.

Okay, nice system, but will it work? To find out if our scheme would work, we simulated from each of the models, actually generating each data point, then collected ‘pseudo-data’ in the form which we actually had in real life, and went ahead to see if we could, in fact get back both the generating model, and the generating parameters.

The great thing about simulations is that you know what the model and parameters are that generated your simulated data. So when you apply your estimation procedures, you can assess how well you’ve done. So here we had four possible generating functional forms (power, log-normal, gamma, weibull), and what I check for is how often the estimating procedure is able to correctly identify the generating model. While some models were less distinct from others, we were able to in fact recapture the generating model.

We also check to see that we can recapture the parameter values and any derived quantities of interest. Again the panels are the four models, and in green are samples from the posterior distribution of parameter values with the generating value shown in red. Shown here is only one instance, but what we actually did was to re-simulate thousands of times to determine whether the generating parameter values were distributed as random draws from the posterior distribution. Since, for any one instance we wouldn’t expect the posterior distribution to be centred exactly on the true value, but rather that the true value would be a random draw from the posterior distribution.

Also shown in these plots as a color ramp is the total cost associated with a curve having those parameter values. We also checked to ensure that there was no bias in our estimations of our derived quantities of interest.

I’m showing here the procedure that worked, however this took some doing to get right – highlighting the importance of testing models via simulation before sending out in prime-time.

So we found that the bulk of the cost of invasive forest insects was being placed on local governments, and that it was borers doing all the damage at around 1.7billion USD per annum. We also found that the single most damaging pest in each guild was responsible for between 25 and 50 % of the total impacts, and that at the current rates of introduction, there is about a 32% chance that another highly damaging pest will occur in the next 10 years.

Staying on the theme of imports, I want to move now to the live fish trade, and specifically invasion risk posed by aquarium fish. This is some work that is forthcoming in Diversity and Distributions which I did with lab mate, Johanna Bradie wherein we had some fairly high resolution data on fish species that were being imported to Canada. We had records of the number of individuals at the species level that were reported at customs, and we wanted to know ‘given only this information, can we estimate the risk of establishment?’.

So we know that due to species specific traits such as life history and environmental tolerances, that not all species are equally likely to establish. However, what we wondered was, in the absence of this information, can we estimate an overall pathway level risk. So if you look at the plot, what I am showing is that the relationship between propagule pressure and probability of establishment is not likely to be the same across different species. But the question is ‘how much does this matter in determining a risk estimate in the absence of species specific information?’

What we showed using simulation was that in most cases, it was possible to estimate a pathway level establishment curve, without significant bias. The only place where it particularly failed was when the distribution of q values, which is essentially a measure of environmental suitability, was strongly bimodal. Or, in other words, where there are two very different types of species represented in the import pool.

In short we found that at an import level of 100,000 individuals, there was pathway level establishment risk of 19%, and that importing 1 million individuals leads to just under a 1 in 2 chance of establishment for a species chosen at random from the import pool.

These studies so far have looked at invasions at very broad spatial scales, but now I want to drill down a little and talk about the process of species spread across a landscape once a local population has been established. Specifically, I investigate the way in which human behaviour mediates the dispersal of aquatic invasive species at the landscape level, and how these patterns of dispersal interact with the population dynamics of the invading species to affect the spatial patterns of spread.

We’ve known for a while now that freshwater species are transported from lake to lake primarily by hitchhiking to recreational vessels like this one here. If we are going to try to make predictions about how species will spread, it stands to reason that we should probably try to understand the behaviour of the humans who are pulling these boats around. There have been two main models of boater behaviour proposed in the literature, the gravity model, which makes use of an analogy to the gravitational ‘pull’ of large bodies, in this case lakes, and the random utility model, which models boaters as rational utility maximizers. But at their core, what both of these models do is to try to estimate which lakes a given boater is likely to visit. That is, they both try to estimate a boater’s probability distribution over lakes. Essentially, they are just alternative functional forms of the same decision process.

I wanted to observe boater choice in order to predict how propagules are likely flow between lakes. To do this I conducted an online survey. In the survey I asked boaters to identify which lakes they visited during the boating season, as well as how many times they visited each lake. Respondents were able to identify their responses on an interactive map, and were able to reminisce about all the great fish they caught that summer while they did it (they were doing it in December).

We ended up with 510 respondents with 146 boaters indicated that they visited multiple lakes.

In keeping with the methodology that I have outlined, I of course simulated some boaters making trip taking decisions in a simulated landscape. I had simulations with the boaters behaving according to each of the two behavioural models and in each simulation, I had those boater fill out a simulated survey just like the one I had online. By collecting this pseudo-data, I then went and looked at whether I could recapture the parameter values underlying the data.

Shown here are the results of that experiment, with the 1:1 line indicating perfect agreement between the actual and predicted values of each of the four parameters for each model. For the same set of simulations, I also checked to see whether I could correctly identify when the generating model was the gravity model, or the RUM.

So what where the results of the survey (the real, flesh and blood survey)?

Well, for the boaters I surveyed in Ontario, the gravity model functional form fit the observed data considerably better. This table just shows the delta AIC values for each of the models tested. Each value is relative to the best model with increasing values indicating decreasing fit.

Okay, so that’s maybe a mildly interesting. But it only matters insofar as whether it has an effect our prediction about spreading invasions. It turns out, that it does.

Because both models were fit using the same data and covariates, they both predicted the same amount of overall boater traffic between lakes, and hence the same net connectivity. The feature in which they differed, however, was in how that connectivity was distributed across the landscape.

We can measure this difference in connectivity distribution using a simple evenness measure like Shannon entropy. High entropy means a very even distribution, whereas low entropy means that the distribution of connectivity is more highly concentrated around few edges of the network. This connectivity distribution particularly matters when the spreading population experiences an Allee effect in its population dynamics. That is, when populations are disproportionately likely to go extinct at low population sizes, whether due to mate limitation in the case of sexually reproducing species, or other forms of density dependent growth. The reason this matters is that in dispersal networks with low entropy connectivity distributions, emigrants are not arriving in sufficient numbers to overcome the demographic barriers to establishment. There is a diluting effect in this situation. Whereas in low entropy connectivity networks, there are sufficient ‘hub’ patches, providing emigrants in sufficient numbers to overcome initial barriers to growth.

We can see this happening when we look at rates of spread in the GM and RUM cases. As Allee effects get stronger (going from left to right), the rates of spread in the early phases are much lower under the evenly distributed RUM dispersal network then under the more concentrated edge distribution of the gravity model.

While this is an important finding in the field of invasion dynamics, I think that this has implications more broadly and could be applied when looking at, instead of a population which we would like to suppress, a conservation setting where we are looking to preserve habitat patch connectivity.

But for now, back to populations which we are trying to suppress. I’ve shown so far how we can model the spread of aquatic invasives via their primary vector, which is recreational boaters. But ultimately, we want to know where they are likely to spread so that ideally, we can get out in front of them and actually do something about it.

One measure that has been proposed in some districts, is to place mandatory cleaning stations at strategically selected launch sites. The idea being to both prevent outgoing as well as incoming organisms to a lake. The problem, just like any lunch ever consumed in history, is that these are not free. There is a cost borne to the individual boaters, in the form of time. And there is cost associated with operating these stations, which may be passed on to boaters in the form of a mandatory usage fee.

The question becomes, then, how will boaters react in the face of such a cost, and how will their behaviours effect the efficacy of the policy?

To look at this question, I extended the original behavioral model to incorporate responses to management action. Boaters may choose a substitute lake to visit, redistributing their trips, or they may simply decided to stay home, reducing the total number of trips taken. Or they may do some combination of both. The observation model employs a binomial distribution of trip outcomes to estimate the behavioral parameters.

In order to estimate the behavioural responses, I conducting something that economists call ‘counterfactual’ analysis. This is a fancy word for ‘what if’ scenarios. After having identified which lakes they visited, I presented boaters with a hypothetical situation in which there were a mandatory cleaning procedure at one of the lakes which they indicated visiting. They were then asked how many times they would have used that lake, and how many times they would have used another lake which they had also identified visiting.

As always, using simulation I checked that my observation model was able to recapture the true parameters, given a sample of the same form and size as that which I would observe from the survey, with the expected amount of uncertainty around those estimates.

Once we have estimated the behavioural model parameters from the survey responses, we can start doing things like scenario analysis, wherein we impose any number of strategic interventions to see how the predicted rates of spread will respond.

So, I’ve outlined my case for how I think that we should approach ecological problems within the domain of management decisions, and I’ve given a few examples from my own work of this general methodology in action. Hopefully I’ve peeked your interest in this framework, and hopefully we’ll have lots to discuss, and potentially boisterously argue about over beers.

January 6, 2013

Who is uncomfortable with uncertainty?

By Corey Chivers ¶ Posted in Probability, Teaching ¶ Tagged climate change, economist, forecasts, ipcc, policy, science, uncertainty ¶ 2 Comments

In The Economists’ World in 2013 edition, I came across a very interesting statement about forecast uncertainty, and just who are the ones doing all the squirming about it.

…recent reforms to the IPCC’s procedures will do little to change its tendency to focus on the areas where there is greater consensus, avoiding the uncertainties which, though unpalatable for scientists, are important to policy. (link)

What struck me about this claim is that it runs completely counter to what I’ve been told during my training as a scientist. It is the scientist, it goes, that possesses a deep understanding of uncertainty. The policy maker, on the other hand, is an oaf who wishes to hear only of black and white pronouncements about the effect of x on y. Could it be that this perception is inverted in each camp?

Certainly the scientist and policy maker each wish to decrease uncertainty. However, it ought to be that neither finds it ‘unpalatable’ in and of itself, but rather an inextricable part of our predictions about complex systems (or even the simplest ones, for that matter). Acknowledgement, understanding, and quantification of uncertainty are absolutely crucial to conducting good science as well as informing science directed policy.

December 16, 2012

Dark matter top 10, but an hour too late

By Corey Chivers ¶ Posted in physics, Probability, Rstats ¶ Tagged astronomy, dark matter, gravitational lensing, kaggle ¶ 1 Comment

Well, that’s embarrassing. A little tweak to my dark matter model resulted in a leaderboard score in the top 10. The only problem is that the contest closed about an hour ago.

I ran this final prediction earlier today but then simply forgot to go back to it and submit!! On the bright side, I learned a lot of really interesting things about gravitational lensing and had a tonne of fun doing it. I’ll probably write a post-mortem sometime in the next few days, but for now I’m just kicking myself.

November 24, 2012

Simulating weak gravitational lensing

By Corey Chivers ¶ Posted in Machine learning, physics, Probability, Rstats ¶ Tagged cosmology, dark matter, gravitational lensing, kaggle, physics, shear, simulation, statistical simulation ¶ 7 Comments

In the search for dark matter, I have been having mixed success. It seems that locating DM in single halo skies is a fairly straightforward problem. However, when there are more than one halo, things get quite a bit trickier.

As I have advocated many times before, including here and here, simulation can provide deep insights into many (if not all) problems. I never trust my own understanding of a complicated problem until I have simulated from a hypothesized model of that problem. Once I have a simulation in place, I can then test out all kinds of hypotheses about the system by manipulating the component parts. I think of this process as a kind of computer-assisted set of thought experiments.

So, when I was hitting a wall with the dark matter challenge, I of course turned to simulation for insights. Normally this would have been my very first step, however in this case my level of understanding of the physics involved was insufficient when I started out. After having done a bit of reading on the topic, I built a model which implements a weak lensing regime on an otherwise random background of galaxies. The model assumes an Einasto profile of dark matter mass density, with parameters A and α determining the strength of the tangential shearing caused by foreground dark matter.

A=0.2, alpha=0.5

I can then increase the strength of the lens by either increasing the mass of the dark matter, or by varying the parameters of the Einasto profile.

A=0.059, alpha=0.5

A=0.03, alpha=0.5

You can check out this visualization over a range of A values.

I can also see how two halos interact in terms of the induced tangential ellipticity profile by simulating two halos and then moving them closer to one another.

You can see the effect here. You get the idea – I can also try out any combination of configurations, shapes, and strengths of interacting halos. I can then analyse the characteristics of the resulting observable factors (in this case, galaxy location and ellipticities) in order to build better a predictive model.

Unfortunately, since this is a competition with cold hard cash on the line, I am not releasing the source for this simulation at this time. I will, however, open source the whole thing when the competition ends.

November 9, 2012

Did the sun just explode? The last Dutch Book you’ll ever make

By Corey Chivers ¶ Posted in Probability, Rstats ¶ Tagged dutch book, frequentist, gambling, nova, odds, prior, prior probability, science, solar explosion, xkcd ¶ 11 Comments

In today’s XKCD, a pair of (presumably) physicists are told by their neutrino detector that the sun has gone nova. Problem is, the machine rolls two dice and if they both come up six it lies, otherwise it tells the truth.

The Frequentist reasons that the probability of obtaining this result if the sun had not, infact, gone nova is 1/36 (0.027, p<0.05) and concludes that the sun is exploding. Goodbye cruel world.

The Bayesian sees things a little differently, and bets the Frequentist $50 that he is wrong.

Let’s set aside the obvious supremacy of the Bayesian’s position due to the fact that were he to turn out to be wrong, come the fast approaching sun-up-to-end-all-sun-ups, he would have very little use for that fifty bucks anyway.

What prior probability would we have to ascribe to the sun succumbing to cataclysmic nuclear explosion on any given night in order to take the Bayesian’s bet?

Not surprisingly, we’ll need to use Bayes!

Where M is the machine reading a solar explosion and S is the event of an actual solar explosion.

Assuming we are risk neutral, and we take any bet with an expected value greater than the cost, we will take the Bayesian’s bet if P(S|M)>0.5. At this cutoff, the expected value is 0.5*0+0.5*100=50 and hence the bet is worth at least the cost of entry.

The rub is that this value depends on our prior. That is, the prior probability that we ascribe to global annihilation by complete solar nuclear fusion. We can set P(S|M)=0.5 and solve for P(S) to get the threshold value for a prior that would make the bet a good one (ie not a Dutch book). This turns out to be:

Which is ~0.0277 — the Frequentist’s p-value!

So, assuming 1:1 payout odds on the bet, we should only take it if we already thought that there was at least a 2.7% chance that the sun would explode, before even asking the neutrino detector. From this, we can also see what odds we would be willing to take on the bet for any level of prior belief about the end of the world.

sun_explode<-function(P_S)
{
P_MgS<-35/36
P_MgNS<-1/36
P_NS<-1-P_S

P_SgM<-(P_MgS*P_S)/(P_MgS*P_S + P_MgNS*P_NS)

return(P_SgM)
}

par(cex=1.3,lwd=2,mar=c(5,5,1,2))
curve(sun_explode(x),
xlim=c(0,0.1),
ylab='P(Sun Exploded | Neutrino Machine = YES)',
xlab='P(Sun Exploded) - aka your prior')

text(0.018,0.2,'No\n thanks')
text(0.07,0.6,'A good bet,\n but frightful existence')

abline(h=0.5,lty=2)
abline(v=0.0277,lty=2)

November 1, 2012

Variable probability Bernoulli outcomes – Fast and Slow

By Corey Chivers ¶ Posted in Probability, Rstats, Uncategorized ¶ Tagged bernoulli trials, c++, code optimization, computer, iid, rbenchmark, rng, sampling, simulation, software, uniform random number ¶ 6 Comments

I am working on a project that requires the generation of Bernoulli outcomes. Typically, I would go about this using the built in sample() function like so:

sample(1:0,n,prob=c(p,1-p),replace=TRUE)

This works great and is fast, even for large n. Problem is, I want to generate each sample with its own unique probability. Seems straight forward enough, I just wrapped the function and vectorized to allow the passing of a vector of p.

binomial_sampler<-function(p){
  return(sample(1:0,1,prob=c(p,1-p)))
}
bs<-Vectorize(binomial_sampler)

Naming this function bs() turned out to be rather prophetic. Nevertheless, I can call this function by passing my unique vector of outcome probabilities. And indeed I get the result I’m looking for.

bs(my_p_vec)

Problem is, this turns out to be very slow. It would seem that there is quite a bit of overhead to calling sample() for one sample at a time. R’s RNGs are very fast for generating many iid samples, so I started thinking like my old c++ programming self and tried a different approach.

Nbs<-function(p)
{
  U<-runif(length(p),0,1)
  outcomes<-U<p
  return(outcomes)
}

I call the new version Nbs for “New Bernoulli Sampler”, or “Not Bull Shit”. And what a difference it made indeed!

library(rbenchmark)
p<-runif(1000)
res <- benchmark(bs(p), Nbs(p))
print(res)
test replications elapsed relative user.self sys.self user.child sys.child
2 Nbs(p)          100   0.007        1     0.008    0.000          0         0
1  bs(p)          100   1.099      157     1.080    0.016          0         0

157x faster! Now that’s a speedup to write home about.

Dan “The Man” Bernoulli

October 19, 2012

Introduction to Bayesian lecture: Accompanying handouts and demos

By Corey Chivers ¶ Posted in Probability, Rstats, Teaching ¶ 3 Comments

I recently posted the slides from a guest lecture that I gave on Bayesian methods for biologists/ecologist. In an effort to promote active learning, the class was not a straight forward lecture, but rather a combination of informational input from me and opportunities for students to engage with the concepts via activities and discussion of demonstrations. These active components were designed with the goal of promoting students’ construction of knowledge, as opposed to a passive transfer from teacher to learner.

In order to bring the online reader into closer allignment with the experience of attending the class, I have decided to provide the additional materials that I used to promote active learning.

1) Monte-Carlo activity:

In pairs, students are provided with a random number sheet and a circle plot handout:

One student is the random number generator, the other is the plotter. After students plot a few points, we collect all the data and walk through a discussion of why this works. We then scale up and take a look at the same experiment using a computer simulation to see how our estimate converges toward the correct value.

2) Metropolis-Hastings in action:

In this demonstration, we walk through the steps of the MH algorithm visually.

Discussion is then facilitated regarding the choice of proposal distribution, autocorrelation, and convergence diagnosis around this demonstration.

I hope that you find this helpful. If you are teaching this topic in your class, feel free to borrow, and improve upon, these materials. If you do, drop me a note and let me know how it went!

October 18, 2012

Introduction to Bayesian Methods guest lecture

By Corey Chivers ¶ Posted in Probability, Rstats, Teaching ¶ Tagged bayesian, ecology, mcmc, methods, teaching ¶ 7 Comments

This is a talk I gave this week in Advanced Biostatistics at McGill. The goal was to provide an gentle introduction to Bayesian methodology and to demonstrate how it is used for inference and prediction. There is a link to an accompanying R script in the slides.

October 13, 2012

Observing Dark Worlds – Visualizing dark matter’s distorting effect on galaxies

By Corey Chivers ¶ Posted in Probability, Rstats ¶ Tagged astronomy, dark matter, kaggle, machine learning, prediction, visualization ¶ 9 Comments

Some people like to do crossword puzzles. I like to do machine learning puzzles.

Lucky for me, a new contest was just posted yesterday on Kaggle. So naturally, my lazy Saturday was spent getting elbow deep into the data.

The training set consists of a series of ‘skies’, each containing a bunch of galaxies. Normally, these galaxies would exhibit random ellipticity. That is, if it weren’t for all that dark matter out there! The dark matter, while itself invisible (it is dark after all), tends to aggregate and do some pretty funky stuff. These aggregations of dark matter produce massive halos which bend the heck out of spacetime itself! The result is that any galaxies behind these halos (from our perspective here on earth) appear contorted around the halo.

The tricky bit is to distinguish between the background noise in the ellipticity of galaxies, and the regular effect of the dark matter halos. How hard could it be?

Step one, as always, is to have a look at what you’re working with using some visualization.

An example of the training data. This sky has 3 dark matter halos. I f you squint, you can kind of see the effect on the ellipticity of the surrounding galaxies.

If you want to try it yourself, I’ve posted the code here.

If you don’t feel like running it yourself, here are all 300 skies from the training set.

Now for the simple matter of the predictions. Looks like Sunday will be a fun day too! Stay tuned…

September 27, 2012

Continuous dispersal on a discrete lattice

By Corey Chivers ¶ Posted in Probability, Rstats ¶ Tagged continuous, discrete, discrete space, dispersal, dispersal kernel, ecology, lattice, probability distribution, spatial, spatial spread, uniform random number ¶ 12 Comments

*Edit: I made a video!

Dispersal is a key process in many domains, and particularly in ecology. Individuals move in space, and this movement can be modelled as a random process following some kernel. The dispersal kernel is simply a probability distribution describing the distance travelled in a given time frame. Since space is continuous, it is natural to use a continuous kernel. However, some modelling frameworks are formulated on a lattice, or discrete array of patches.

So how can we implement a continuous kernel in discrete space?

As with many modelling situations, we could approach this in a number of ways. Here is the implementation that I can up with, and I welcome your suggestions, dear reader, for alternatives or improvements to this approach.

Draw a random variate d from the dispersal kernel.
Draw a uniform random number θ between 0 and 2π, which we will use to choose a direction.
Calculate the relative location (in continuous space) of the dispersed individuals using some trig:
x = cos(θ) d
y = sin(θ) d
Determine the new location on the lattice for each individual by adding the relative x and y positions to the current location. Round these locations to the nearest integer and take the modulo of this integer and the lattice size. This creates a torus out of the lattice such that the outer edges loop back on each other. If you remember the old Donkey Kong games, you can think of this like how when you leave out the right side of the screen you enter from the left.

I implemented this approach in R as a function that takes in a population in a lattice, and returns a lattice with the dispersed population. The user can also specify which dispersal kernel to use. Here is the result using a negative-exponential kernel on a 50×50 lattice.

Created by iterating over the dispersal function:

## General function to take in a lattice and disperse
## according to a user provided dispersal kernel
## Author: Corey Chivers
lat_disp<-function(pop,kernel,...)
{
lattice_size<-dim(pop)
new_pop<-array(0,dim=lattice_size)
for(i in 1:lattice_size[1])
{
for(j in 1:lattice_size[2])
{
N<-pop[i,j]
dist<-kernel(N,...)
theta<-runif(N,0,2*pi)
x<-cos(theta)*dist
y<-sin(theta)*dist

for(k in 1:N)
{
x_ind<-(round(i+x[k])-1) %% lattice_size[1] + 1
y_ind<-(round(j+y[k])-1) %% lattice_size[2] + 1
new_pop[x_ind,y_ind]<-new_pop[x_ind,y_ind]+1
}
}
}
return(new_pop)
}

For comparison, I also ran the same population using a Gaussian kernel. I defined the parameters of both kernels to have a mean dispersal distance of 1.

Here is the result using a Gaussian kernel:

The resulting population after 35 time steps has a smaller range than when using the exponential kernel, highlighting the importance of the shape of the dispersal kernel for spreading populations (remember that in both cases the average dispersal distance is the same).

Code for generating the plots:

############## Run and plot #######################

## Custom colour ramp
colours<-c('grey','blue','black')
cus_col<-colorRampPalette(colors=colours, space = c("rgb", "Lab"),interpolate = c("linear", "spline"))

## Initialize population array
Time=35
pop<-array(0,dim=c(Time,50,50))
pop[1,25,25]<-10000

### Normal Kernel ###
par(mfrow=c(1,1))
for(i in 2:Time)
{
image(pop[i-1,,],col=cus_col(100),xaxt='n',yaxt='n')
pop[i,,]<-lat_disp(pop[i-1,,],kernel=rnorm,mean=0,sd=1)
}

## Plot
png('normal_kern.png', width = 800, height = 800)
par(mfrow=c(2,2),pty='s',omi=c(0.1,0.1,0.5,0.1),mar=c(2,0,2,0))
times<-c(5,15,25,35)
for(i in times)
image(pop[i-1,,],
col=cus_col(100),
xaxt='n',
yaxt='n',
useRaster=TRUE,
main=paste("Time =",i))

mtext("Gaussian Kernel",outer=TRUE,cex=1.5)
dev.off()

### Exponential Kernel ###
par(mfrow=c(1,1))
for(i in 2:Time)
{
image(pop[i-1,,],col=cus_col(100),xaxt='n',yaxt='n')
pop[i,,]<-lat_disp(pop[i-1,,],kernel=rexp,rate=1)
}

## Plot
png('exp_kern.png',  width = 800, height = 800)
par(mfrow=c(2,2),pty='s',omi=c(0.1,0.1,0.5,0.1),mar=c(2,0,2,0))
times<-c(5,15,25,35)
for(i in times)
image(pop[i-1,,],
col=cus_col(100),
xaxt='n',
yaxt='n',
useRaster=TRUE,
main=paste("Time =",i))

mtext("Exponential Kernel",outer=TRUE,cex=1.5)
dev.off()

############################################################

bayesianbiologist

Corey Chivers on P(A|B) ∝P(B|A)P(A)

Category Archives: Probability