On Friday, I gave a guest lecture on statistics to a group of biology students at Dawson College in Montreal. We had some fun testing whether name brand cookies contained more chocolate chips, on average, than generic brand cookies.

# Padding integers for use in filenames

If you’ve ever written code that generates a whole whack of files, you may have came across the following problem when processing them. Using a naming convention wherein files are numbered will gum up any ordering which is based on string sorting (ls, for example). What you end up with is something like this:

results10.txt results11.txt results12.txt results1.txt results2.txt ...

Which is just no good, no good at all. A solution to this is to pad the file number with zeros, like so:

results0001.txt results0002.txt ... results0010.txt

I wrote a little function to make this easy:

pad_int<-function(n,scale){ out_string<-paste(10*scale + n,sep='') out_string<-substr(out_string,2,nchar(out_string)) return(out_string) }

***EDIT:** Very soon after posting this, ggplot creator and general rstats rockstar Hadley Wickham noted on twitter that you can do this in one line using:

sprintf(“%03d”, 1)

Then, simply pass this function your file number (n) and the number of zeros that you’d like to pad it with (scale). This should be the order of magnitude of the number of files you’re creating. For example:

for(i in 1:1000){ padded<-pad_int(i,1000) file_name<-paste('results_',padded,'.txt',sep='') print(file_name) }

This little bit of code came in quite handy in generating this video of dispersal on a discrete lattice. Enjoy!

# Continuous dispersal on a discrete lattice

***Edit:** I made a video!

Dispersal is a key process in many domains, and particularly in ecology. Individuals move in space, and this movement can be modelled as a random process following some *kernel.* The dispersal kernel is simply a probability distribution describing the distance travelled in a given time frame. Since space is continuous, it is natural to use a continuous kernel. However, some modelling frameworks are formulated on a lattice, or discrete array of patches.

So how can we implement a continuous kernel in discrete space?

As with many modelling situations, we could approach this in a number of ways. Here is the implementation that I can up with, and I welcome your suggestions, dear reader, for alternatives or improvements to this approach.

- Draw a random variate
*d*from the dispersal kernel. - Draw a uniform random number
*θ*between 0 and 2π, which we will use to choose a direction. - Calculate the relative location (in continuous space) of the dispersed individuals using some trig:

*x = cos(θ) d*

*y = sin(θ) d* - Determine the new location on the lattice for each individual by adding the relative x and y positions to the current location. Round these locations to the nearest integer and take the modulo of this integer and the lattice size. This creates a torus out of the lattice such that the outer edges loop back on each other. If you remember the old Donkey Kong games, you can think of this like how when you leave out the right side of the screen you enter from the left.

I implemented this approach in R as a function that takes in a population in a lattice, and returns a lattice with the dispersed population. The user can also specify which dispersal kernel to use. Here is the result using a negative-exponential kernel on a 50×50 lattice.

Created by iterating over the dispersal function:

## General function to take in a lattice and disperse ## according to a user provided dispersal kernel ## Author: Corey Chivers lat_disp<-function(pop,kernel,...) { lattice_size<-dim(pop) new_pop<-array(0,dim=lattice_size) for(i in 1:lattice_size[1]) { for(j in 1:lattice_size[2]) { N<-pop[i,j] dist<-kernel(N,...) theta<-runif(N,0,2*pi) x<-cos(theta)*dist y<-sin(theta)*dist for(k in 1:N) { x_ind<-(round(i+x[k])-1) %% lattice_size[1] + 1 y_ind<-(round(j+y[k])-1) %% lattice_size[2] + 1 new_pop[x_ind,y_ind]<-new_pop[x_ind,y_ind]+1 } } } return(new_pop) }

For comparison, I also ran the same population using a Gaussian kernel. I defined the parameters of both kernels to have a mean dispersal distance of 1.

Here is the result using a Gaussian kernel:

The resulting population after 35 time steps has a smaller range than when using the exponential kernel, highlighting the importance of the shape of the dispersal kernel for spreading populations (remember that in both cases the average dispersal distance is the same).

Code for generating the plots:

############## Run and plot ####################### ## Custom colour ramp colours<-c('grey','blue','black') cus_col<-colorRampPalette(colors=colours, space = c("rgb", "Lab"),interpolate = c("linear", "spline")) ## Initialize population array Time=35 pop<-array(0,dim=c(Time,50,50)) pop[1,25,25]<-10000 ### Normal Kernel ### par(mfrow=c(1,1)) for(i in 2:Time) { image(pop[i-1,,],col=cus_col(100),xaxt='n',yaxt='n') pop[i,,]<-lat_disp(pop[i-1,,],kernel=rnorm,mean=0,sd=1) } ## Plot png('normal_kern.png', width = 800, height = 800) par(mfrow=c(2,2),pty='s',omi=c(0.1,0.1,0.5,0.1),mar=c(2,0,2,0)) times<-c(5,15,25,35) for(i in times) image(pop[i-1,,], col=cus_col(100), xaxt='n', yaxt='n', useRaster=TRUE, main=paste("Time =",i)) mtext("Gaussian Kernel",outer=TRUE,cex=1.5) dev.off() ### Exponential Kernel ### par(mfrow=c(1,1)) for(i in 2:Time) { image(pop[i-1,,],col=cus_col(100),xaxt='n',yaxt='n') pop[i,,]<-lat_disp(pop[i-1,,],kernel=rexp,rate=1) } ## Plot png('exp_kern.png', width = 800, height = 800) par(mfrow=c(2,2),pty='s',omi=c(0.1,0.1,0.5,0.1),mar=c(2,0,2,0)) times<-c(5,15,25,35) for(i in times) image(pop[i-1,,], col=cus_col(100), xaxt='n', yaxt='n', useRaster=TRUE, main=paste("Time =",i)) mtext("Exponential Kernel",outer=TRUE,cex=1.5) dev.off() ############################################################

# Mapping Bike Accidents in R

At last weekend’s Hack Ta Ville event here in Montreal, I joined up with some talented urban planners and web devs to realize Vélobstacles. The idea of the project is to crowd source information on cycling conditions around the city. As with any crowd sourcing project, we were faced with the problem of seeding the site with some data to draw the attention of users to get the ball rolling.

Fortunately, we had access to a data set of all reported cycling accidents between 2006-2010. Once we seeded Vélobstacles with this data, the web devs went to town adding features to the site, and I had outlived my usefulness as a data geek. So I decided to play with the accident data a little and produce some visualization. I plotted all the accidents on a map and animated it through time. I also calculated and plotted the monthly accident rate using a moving average.

Be sure to select HD quality:

Not surprisingly, the accident rate goes way up in the summer months as Montreal winters are braved on two wheels by only a rarefied few. What is interesting is the mid-summer dip in the accident rate. This dip is notably correlated with Montreal’s much beloved construction holiday – though the causal relationship is unclear. If you have any alternative explanations, or an idea about how to test the construction holiday hypothesis, drop a note in the comments.

As always, you can get the code on my github page.

# The future of Artificial Intelligence – as imagined in 1989

This image comes from the cover of *Preliminary Papers of the Second International Workshop on Artificial Intelligence and Statistics (1989)*. Someone abandoned it in the lobby of my building at school. Whatever for, I’ll never know.

I just love the idea of machine learning/AI/Statistics evoking a robot hand drawing a best fit line through some points on graph paper with a pen.

What other funny, or interesting, metaphorical visual representations of machine learning have you seen? Drop a link in the comments and I’ll get my robot arm to compile the results.

# Next Post

~~The horror of left over variance…~~

After some discussion with Jeremy Fox, and a re-reading of Fox (2006), my understanding of the Price Equation partitioning of effects of biodiversity loss on ecosystem function has been improved. What I thought was a term for residual variance is the context dependent effect. This is analogous to transmission bias in the evolutionary interpretation of the Price Equation.

# Walmart Invasion

As an invasion biologist, the process of spatial spread is at the heart of what I do. When I came across this dataset of Walmart store openings since 1962 I couldn’t help but see it as an invasion front which looks a lot like a biological invasion or (albeit slow) epidemic.

The video shows monthly store openings (in red) between 1962 and 2006.

You can get the code I wrote for generating this visualization on my github page.