Montreal R User Group meetup Nov. 14th

After a bit of a summer lull, the Montreal R User Group is meeting up again! We’re trying out a new venue this time. Notman House is the home of the web in Montreal. They hold hackathons and other tech user group meetups, and they are all around great people in an all around great space in downtown Montreal.

Our meetup will feature R super-user Etienne Low-Decarie, who will give a walk through of some of the most powerful packages in R, many of which were built by rstats rock star Hadley Wickham.

I will also kick off the meetup with a short session on how R is revolutionizing data science in academia, journalism, business and beyond.

  • November 14th, 7pm at 51 Sherbrooke W.
  • BYOL&D (Bring Your Own Laptop & Data)

Don’t forget to RSVP. Hope to see you there!

Observing Dark Worlds – Visualizing dark matter’s distorting effect on galaxies

Some people like to do crossword puzzles. I like to do machine learning puzzles.

Lucky for me, a new contest was just posted yesterday on Kaggle. So naturally, my lazy Saturday was spent getting elbow deep into the data.

The training set consists of a series of ‘skies’, each containing a bunch of galaxies. Normally, these galaxies would exhibit random ellipticity. That is, if it weren’t for all that dark matter out there! The dark matter, while itself invisible (it is dark after all), tends to aggregate and do some pretty funky stuff. These aggregations of dark matter produce massive halos which bend the heck out of spacetime itself! The result is that any galaxies behind these halos (from our perspective here on earth) appear contorted around the halo.

The tricky bit is to distinguish between the background noise in the ellipticity of galaxies, and the regular effect of the dark matter halos. How hard could it be?

Step one, as always, is to have a look at what you’re working with using some visualization.

An example of the training data. This sky has 3 dark matter halos. I f you squint, you can kind of see the effect on the ellipticity of the surrounding galaxies.

If you want to try it yourself, I’ve posted the code here.

If you don’t feel like running it yourself, here are all 300 skies from the training set.

 

Now for the simple matter of the predictions. Looks like Sunday will be a fun day too! Stay tuned…

Mapping Bike Accidents in R

At last weekend’s Hack Ta Ville event here in Montreal, I joined up with some talented urban planners and web devs to realize Vélobstacles. The idea of the project is to crowd source information on cycling conditions around the city. As with any crowd sourcing project, we were faced with the problem of seeding the site with some data to draw the attention of users to get the ball rolling.

Fortunately, we had access to a data set of all reported cycling accidents between 2006-2010. Once we seeded Vélobstacles with this data, the web devs went to town adding features to the site, and I had outlived my usefulness as a data geek. So I decided to play with the accident data a little and produce some visualization. I plotted all the accidents on a map and animated it through time. I also calculated and plotted the monthly accident rate using a moving average.

Be sure to select HD quality:

Not surprisingly, the accident rate goes way up in the summer months as Montreal winters are braved on two wheels by only a rarefied few. What is interesting is the mid-summer dip in the accident rate. This dip is notably correlated with Montreal’s much beloved construction holiday – though the causal relationship is unclear. If you have any alternative explanations, or an idea about how to test the construction holiday hypothesis, drop a note in the comments.

As always, you can get the code on my github page.

Walmart Invasion

As an invasion biologist, the process of spatial spread is at the heart of what I do. When I came across this dataset of Walmart store openings since 1962 I couldn’t help but see it as an invasion front which looks a lot like a biological invasion or (albeit slow) epidemic.

The video shows monthly store openings (in red) between 1962 and 2006.

You can get the code I wrote for generating this visualization on my github page.

The essence of a handwritten digit

If you haven’t yet discovered the competitive machine learning site kaggle.com, please do so now. I’ll wait.

Great – so, you checked it out, fell in love and have made it back. I recently downloaded the data for the getting started competition. It consists of 42000 labelled images (28×28) of hand written digits 0-9. The competition is a straight forward supervised learning problem of OCR (Optical Character Recognition). There are two sample R scripts on the site to get you started. They implement the k-nearest neighbours and Random Forest algorithms.

I wanted to get  started by visualizing all of the training data by rendering some sort of an average of each character. Visualizing the data is a great first step to developing a model. Here’s how I did it:

## Read in data
train <- read.csv("../data/train.csv", header=TRUE)
train<-as.matrix(train)

##Color ramp def.
colors<-c('white','black')
cus_col<-colorRampPalette(colors=colors)

## Plot the average image of each digit
par(mfrow=c(4,3),pty='s',mar=c(1,1,1,1),xaxt='n',yaxt='n')
all_img<-array(dim=c(10,28*28))
for(di in 0:9)
{
print(di)
all_img[di+1,]<-apply(train[train[,1]==di,-1],2,sum)
all_img[di+1,]<-all_img[di+1,]/max(all_img[di+1,])*255

z<-array(all_img[di+1,],dim=c(28,28))
z<-z[,28:1] ##right side up
image(1:28,1:28,z,main=di,col=cus_col(256))
}

Which gives you:
Notice the wobbly looking ‘1’. You can see that there is some variance in the angle of the slant, with a tenancy toward leaning right. I imagine that this is due to the bias toward right handed individuals in the sample.

I also wanted to generate a pdf plot of all of the training set, to get myself an idea of what kind of anomalous instances I should expect.

If you are interested, dear reader, here is my code to do just that.

pdf('train_letters.pdf')
par(mfrow=c(4,4),pty='s',mar=c(3,3,3,3),xaxt='n',yaxt='n')
for(i in 1:nrow(train))
{
z<-array(train[i,-1],dim=c(28,28))
z<-z[,28:1] ##right side up
image(1:28,1:28,z,main=train[i,1],col=cus_col(256))
print(i)
}
dev.off()

Which will give you a 2625 page pdf of every character in the training set which you can, um, casually peruse.
As of the time of writing, the current leading submission has a classification accuracy of 99.27%. There is no cash for this competition, but the knowledge gained from taking a stab at it is priceless. So give it a shot!

More Bixi Data Visualization

I mentioned in a previous post that our team at the recent Hack/Reduce hackathon had some fun with a data set which consisted of Bixi station states at minute level temporal resolution. In addition to pulling out and plotting the flux at each station on an hourly basis, we also plotted the system state (number of bikes at each station) at each time-step we had. This totalled to 24,217 individual plots. Each plot was generated using an R script which took in the system state at each time-step, and output a png.

Team member Kamal Marhubi also did some nice post-processing to overlay the information on a map. The results are a little mesmerising. Things don’t get fun until about 40s into the video, as the first part mostly just shows the stations coming online for the first part of the season.

And for the non-Montrealers out there, here’s an image of a Bixi bike; our durable, data generating little hero.

Heartbeat of a Cycling City: Bixi data at Hack/Reduce

The recent Hack/Reduce hackathon in Montreal was a tonne of fun. Our team tackled a data set of consisting of Bixi (Montreal’s bicycle share system) station states at one minute temporal resolution. We used Hadoop and mapreduce to pull out some features of user behaviours. One of the things we extracted was the flux at each station, which we defined as the number of bikes arriving and departing from a given station per unit time. When you plot the total system flux across all stations against time, you can see the pulse of the city. Here are the first few weeks of this year’s Bixi season.(click to enlarge)

A few things jump out: 1) There are clearly defined peaks at both the morning and evening rush hours, but it looks like the evening rush is typically a little stronger. I guess cycling home is a great way to relax after a day at work. 2) The data collector seems to have gone offline in the night on April 18th. 3) Related to the first point, weekdays and weekends have distinct signatures. In fact, you can see a clear signal of Easter Monday, in that it looks like a weekend day. (click to enlarge)

When the system was first being installed, I had the impression that it would be used primarily by tourists. Owning a bike myself, I figured that if other Montrealers wanted to cycle in the city, that they would do so with their own rides. From this data, it really seems as though Montrealers themselves are using the Bixi system, substituting alternative modes of transit for commuting.

We also took the spatial information in the data and plotted the flux at the site level, then animated this across time. Here, I used a kernel smoother from the KernSmooth package to estimate the flux density in space. This allows us to be able to see the spatial configuration of flux a little better than with points, as the spatial density of stations is heterogeneous. The result is this pulsating video:

For the R users out there, I also found the package lubridate to be extremely helpful for wrangling the dates in this project.



Credits (Team Ctr-Freak)

Julia Evans
Kamal Marhubi
Victor Parmar
Pierre-Alexandre Lacerte
Mansoor Siddiqui
Rafik Draoui
Corey Chivers