# Dark matter benchmarks: All over the map

The three benchmark algorithms for predicting the location of dark matter halos are, for the most part, all over the map. Most of the test skies look something like this:

There are, however, some skies with rather strong halo signals that get a decent amount of agreement:

The Lenstool MLE algorithm is the current state of the art. As such, it’s the algo to beat. As of this morning, there was only one entry on the leader board with a score topping this benchmark.

*cracks fingers* – Let’s see if we can give it a run for it’s money.

# Observing Dark Worlds – Visualizing dark matter’s distorting effect on galaxies

Some people like to do crossword puzzles. I like to do machine learning puzzles.

Lucky for me, a new contest was just posted yesterday on Kaggle. So naturally, my lazy Saturday was spent getting elbow deep into the data.

The training set consists of a series of ‘skies’, each containing a bunch of galaxies. Normally, these galaxies would exhibit random ellipticity. That is, if it weren’t for all that dark matter out there! The dark matter, while itself invisible (it is dark after all), tends to aggregate and do some pretty funky stuff. These aggregations of dark matter produce massive halos which bend the heck out of spacetime itself! The result is that any galaxies behind these halos (from our perspective here on earth) appear contorted around the halo.

The tricky bit is to distinguish between the background noise in the ellipticity of galaxies, and the regular effect of the dark matter halos. How hard could it be?

Step one, as always, is to have a look at what you’re working with using some visualization.

An example of the training data. This sky has 3 dark matter halos. I f you squint, you can kind of see the effect on the ellipticity of the surrounding galaxies.

If you want to try it yourself, I’ve posted the code here.

If you don’t feel like running it yourself, here are all 300 skies from the training set.

Now for the simple matter of the predictions. Looks like Sunday will be a fun day too! Stay tuned…

# Guest lecture at Dawson College

On Friday, I gave a guest lecture on statistics to a group of biology students at Dawson College in Montreal. We had some fun testing whether name brand cookies contained more chocolate chips, on average, than generic brand cookies.

# Padding integers for use in filenames

If you’ve ever written code that generates a whole whack of files, you may have came across the following problem when processing them. Using a naming convention wherein files are numbered will  gum up any ordering which is based on string sorting (ls, for example). What you end up with is something like this:

```results10.txt
results11.txt
results12.txt
results1.txt
results2.txt
...
```

Which is just no good, no good at all. A solution to this is to pad the file number with zeros, like so:

```results0001.txt
results0002.txt
...
results0010.txt
```

I wrote a little function to make this easy:

```pad_int<-function(n,scale){
out_string<-paste(10*scale + n,sep='')
out_string<-substr(out_string,2,nchar(out_string))
return(out_string)
}
```

*EDIT: Very soon after posting this, ggplot creator and general rstats rockstar Hadley Wickham noted on twitter that you can do this in one line using:

```sprintf(“%03d”, 1)
```

Then, simply pass this function your file number (n) and the number of zeros that you’d like to pad it with (scale). This should be the order of magnitude of the number of files you’re creating. For example:

```for(i in 1:1000){
print(file_name)
}
```

This little bit of code came in quite handy in generating this video of dispersal on a discrete lattice. Enjoy!

# Continuous dispersal on a discrete lattice

Dispersal is a key process in many domains, and particularly in ecology. Individuals move in space, and this movement can be modelled as a random process following some kernel. The dispersal kernel is simply a probability distribution describing the distance travelled in a given time frame. Since space is continuous, it is natural to use a continuous kernel. However, some modelling frameworks are formulated on a lattice, or discrete array of patches.

So how can we implement a continuous kernel in discrete space?

As with many modelling situations, we could approach this in a number of ways.  Here is the implementation that I can up with, and I welcome your suggestions, dear reader, for alternatives or improvements to this approach.

1. Draw a random variate d from the dispersal kernel.
2. Draw a uniform random number θ between 0 and 2π, which we will use to choose a direction.
3. Calculate the relative location (in continuous space) of the dispersed individuals using some trig:
x  = cos(θ) d
y = sin(θ) d
4. Determine the new location on the lattice for each individual by adding the relative x and y positions to the current location. Round these locations to the nearest integer and take the modulo of this integer and the lattice size. This creates a torus out of the lattice such that the outer edges loop back on each other. If you remember the old Donkey Kong games, you can think of this like how when you leave out the right side of the screen you enter from the left.

I implemented this approach in R as a function that takes in a population in a lattice, and returns a lattice with the dispersed population. The user can also specify which dispersal kernel to use. Here is the result using a negative-exponential kernel on a 50×50 lattice.

Created by iterating over the dispersal function:

```## General function to take in a lattice and disperse
## according to a user provided dispersal kernel
## Author: Corey Chivers
lat_disp<-function(pop,kernel,...)
{
lattice_size<-dim(pop)
new_pop<-array(0,dim=lattice_size)
for(i in 1:lattice_size[1])
{
for(j in 1:lattice_size[2])
{
N<-pop[i,j]
dist<-kernel(N,...)
theta<-runif(N,0,2*pi)
x<-cos(theta)*dist
y<-sin(theta)*dist

for(k in 1:N)
{
x_ind<-(round(i+x[k])-1) %% lattice_size[1] + 1
y_ind<-(round(j+y[k])-1) %% lattice_size[2] + 1
new_pop[x_ind,y_ind]<-new_pop[x_ind,y_ind]+1
}
}
}
return(new_pop)
}
```

For comparison, I also ran the same population using a Gaussian kernel. I defined the parameters of both kernels to have a mean dispersal distance of 1.

Here is the result using a Gaussian kernel:

The resulting population after 35 time steps has a smaller range than when using the exponential kernel, highlighting the importance of the shape of the dispersal kernel for spreading populations (remember that in both cases the average dispersal distance is the same).

Code for generating the plots:

```############## Run and plot #######################

## Custom colour ramp
colours<-c('grey','blue','black')
cus_col<-colorRampPalette(colors=colours, space = c("rgb", "Lab"),interpolate = c("linear", "spline"))

## Initialize population array
Time=35
pop<-array(0,dim=c(Time,50,50))
pop[1,25,25]<-10000

### Normal Kernel ###
par(mfrow=c(1,1))
for(i in 2:Time)
{
image(pop[i-1,,],col=cus_col(100),xaxt='n',yaxt='n')
pop[i,,]<-lat_disp(pop[i-1,,],kernel=rnorm,mean=0,sd=1)
}

## Plot
png('normal_kern.png', width = 800, height = 800)
par(mfrow=c(2,2),pty='s',omi=c(0.1,0.1,0.5,0.1),mar=c(2,0,2,0))
times<-c(5,15,25,35)
for(i in times)
image(pop[i-1,,],
col=cus_col(100),
xaxt='n',
yaxt='n',
useRaster=TRUE,
main=paste("Time =",i))

mtext("Gaussian Kernel",outer=TRUE,cex=1.5)
dev.off()

### Exponential Kernel ###
par(mfrow=c(1,1))
for(i in 2:Time)
{
image(pop[i-1,,],col=cus_col(100),xaxt='n',yaxt='n')
pop[i,,]<-lat_disp(pop[i-1,,],kernel=rexp,rate=1)
}

## Plot
png('exp_kern.png',  width = 800, height = 800)
par(mfrow=c(2,2),pty='s',omi=c(0.1,0.1,0.5,0.1),mar=c(2,0,2,0))
times<-c(5,15,25,35)
for(i in times)
image(pop[i-1,,],
col=cus_col(100),
xaxt='n',
yaxt='n',
useRaster=TRUE,
main=paste("Time =",i))

mtext("Exponential Kernel",outer=TRUE,cex=1.5)
dev.off()

############################################################
```

# Mapping Bike Accidents in R

At last weekend’s Hack Ta Ville event here in Montreal, I joined up with some talented urban planners and web devs to realize Vélobstacles. The idea of the project is to crowd source information on cycling conditions around the city. As with any crowd sourcing project, we were faced with the problem of seeding the site with some data to draw the attention of users to get the ball rolling.

Fortunately, we had access to a data set of all reported cycling accidents between 2006-2010. Once we seeded Vélobstacles with this data, the web devs went to town adding features to the site, and I had outlived my usefulness as a data geek. So I decided to play with the accident data a little and produce some visualization. I plotted all the accidents on a map and animated it through time. I also calculated and plotted the monthly accident rate using a moving average.

Be sure to select HD quality:

Not surprisingly, the accident rate goes way up in the summer months as Montreal winters are braved on two wheels by only a rarefied few. What is interesting is the mid-summer dip in the accident rate. This dip is notably correlated with Montreal’s much beloved construction holiday – though the causal relationship is unclear. If you have any alternative explanations, or an idea about how to test the construction holiday hypothesis, drop a note in the comments.

As always, you can get the code on my github page.