The recent Hack/Reduce hackathon in Montreal was a tonne of fun. Our team tackled a data set of consisting of Bixi (Montreal’s bicycle share system) station states at one minute temporal resolution. We used Hadoop and mapreduce to pull out some features of user behaviours. One of the things we extracted was the flux at each station, which we defined as the number of bikes arriving and departing from a given station per unit time. When you plot the total system flux across all stations against time, you can see the pulse of the city. Here are the first few weeks of this year’s Bixi season.(click to enlarge)
A few things jump out: 1) There are clearly defined peaks at both the morning and evening rush hours, but it looks like the evening rush is typically a little stronger. I guess cycling home is a great way to relax after a day at work. 2) The data collector seems to have gone offline in the night on April 18th. 3) Related to the first point, weekdays and weekends have distinct signatures. In fact, you can see a clear signal of Easter Monday, in that it looks like a weekend day. (click to enlarge)
When the system was first being installed, I had the impression that it would be used primarily by tourists. Owning a bike myself, I figured that if other Montrealers wanted to cycle in the city, that they would do so with their own rides. From this data, it really seems as though Montrealers themselves are using the Bixi system, substituting alternative modes of transit for commuting.
We also took the spatial information in the data and plotted the flux at the site level, then animated this across time. Here, I used a kernel smoother from the KernSmooth package to estimate the flux density in space. This allows us to be able to see the spatial configuration of flux a little better than with points, as the spatial density of stations is heterogeneous. The result is this pulsating video:
For the R users out there, I also found the package lubridate to be extremely helpful for wrangling the dates in this project.
Julia Evans Kamal Marhubi Victor Parmar Pierre-Alexandre Lacerte Mansoor Siddiqui Rafik Draoui Corey Chivers
Excellent, many thks for that ! ^^
Many thanks for that, I suspected this pulse to be somewhere in there !
Great! I was only dreaming about the Bixi dataset!! Was it made available to you just for that event or is it “open”?
It is data grabbed (in realtime) from the bixi site f8full has it, I think.
Yep, some complete records are this preserved, but I haven’t ever (yet ? :D) gotten around creating a read API
Both URLs sample go to august, but there is one for each month, just substitute the month name in the url, in english
The cycling (pun intended) is so beautiful that I just have to know what happens when you apply fast fourrier (FFT) transform to the data (you should be able to extract the weekly cycle, the daily cycle, and the within day cycles and see the proportion of flux due to each of these cycles). Cheers.
In Paris, locals took advantage of the Vélib rather than using their bike: (a) free maintenance, (b) ability to come home by métro if it rains, (c) no worry about the bike being stolen, (d) practicaly free if living less than 30 minutes from work [27 euros per year], (e) station network growing to the suburbs… So some of my colleagues moved from their own bike to Vélib! I do not, they are way too heavy!
I confirm that info. I can gather data up to date and provide it (a collection of XML status files). The dataset used at the hackaton was a CSV version provided by Sara and Karam if I remember correctly.
Nice work! Have you thought of corrrelating weather conditions and bicycle flux?
Pingback: Heartbeat of a Cycling City: Bixi data at Hack/Reduce | hack/reduce
Pingback: Heartbeat of a Cycling City: Update « bayesianbiologist
Pingback: More Bixi Data Visualization « bayesianbiologist
I think the rise in the use of apps and open source cycling data has led to an explosion of interest in Cycling. Here in London there is a big push to get people to use bicycles as oppose to other forms of transportation.
I’ve written a brief piece on how bicycle commuting data could have broader implications on city transportation in the future.
Reblogged this on bayesianbiologist and commented:
With spring finally making it’s presence known, I thought I’d re-share this cycling data analysis and visualization I did with some great people a while back. Get out there and feel that wind in your hair!
So, I’m curious: How essential was the Hadoop map-reduce framework to these assessments? If the data was large, could it have been done with sampling instead? Given that counts from specific sites are tantamount to sampling surveys, was there any attempt to equalize, say, variances? Any stratification?
I feel these are more than pedantic critiques. The point of the exercise is to ascertain people’s interactions with bicycles and bicycle-providing systems. What’s really wanted is that, and not that conditional upon a particular laydown and policy in a particular city.
I am especially interested in how methods for Bayesian finite population sampling might relate here. Time series are nice, and are one thing, but why aren’t things like “Easter Monday” attributions “just” data dredging? What basis do they have in models? Don’t aspects like a collection “going offline” need to be incorporated into models of conclusions? Any attempts at imputation there? Equivalently, incorporating those kinds of things into a sampling distribution?
All good questions but my main answer is that this was really just a fun little exercise (done in a single day at a hackathon) and not meant to be a rigorous analysis for the purpose of providing policy suggestions. As for the necessity of Hadoop the same answer applies: this was done at a for-fun event that was designed in part to get people some exposure to map-reduce, so this was more of a ‘hello world’ than a production-grade use case. Thanks for the note!