Plotly is a platform for making, editing, and sharing graphs. If you are used to making plots with ggplot2, you can call ggplotly() to make your plots interactive, web-based, and collaborative. For example, see plot.ly/~ggplot2examples/211, shown below and in this Notebook. Notice the hover text!

Visit http://plot.ly. Here, you’ll find a GUI that lets you create graphs from data you enter manually, or upload as a spreadsheet (or CSV file). From there you can edit graphs! Change between types (from bar charts to scatter charts), change colors and formatting, add fits and annotations, try other themes…

Our R API lets you use Plotly with R. Once you have your R visualization in Plotly, you can use the web interface to edit it, or to extract its data. Install and load package “plotly” in your favourite R environment. For a quick start, follow: https://plot.ly/ggplot2/getting-started/

Go social! Like, share, comment, fork and edit plots… Export them, embed them in your website. Collaboration has never been so sweet!

Not ready to publish? Set detailed permissions for who can view and who can edit your project.

Baseball data is the best! Let’s plot a histogram of batting averages. I downloaded data here.

Load the CSV file of interest, take a look at the data, subset at will:

library(RCurl) online_data <- getURL("https://raw.githubusercontent.com/mkcor/baseball-notebook/master/Batting.csv") batting_table <- read.csv(textConnection(online_data)) head(batting_table) summary(batting_table) batting_table <- subset(batting_table, yearID >= 2004)

The batting average is defined by the number of hits divided by at bats:

batting_table$Avg <- with(batting_table, H / AB)

You may want to explore the distribution of your new variable as follows:

library(ggplot2) ggplot(data=batting_table) + geom_histogram(aes(Avg), binwidth=0.05) # Let's filter out entries where players were at bat less than 10 times. batting_table <- subset(batting_table, AB >= 10) hist <- ggplot(data=batting_table) + geom_histogram(aes(Avg), binwidth=0.05) hist

We have created a basic histogram; let us share it, so we can get input from others!

# Install the latest version # of the “plotly” package and load it library(devtools) install_github("ropensci/plotly") library(plotly) # Open a Plotly connection py <- plotly("ggplot2examples", "3gazttckd7")

Use your own credentials if you prefer. You can sign up for a Plotly account online.

Now call the `ggplotly()` method:

collab_hist <- py$ggplotly(hist)

And boom!

You get a nice interactive version of your plot! Go ahead and hover…

Your plot lives at this URL (`collab_hist$response$url`) alongside the data. How great is that?!

If you wanted to keep your project private, you would use your own credentials and specify:

py <- plotly() py$ggplotly(hist, kwargs=list(filename="private_project", world_readable=FALSE))

Now let us click “Fork and edit”. You (and whoever you’ve added as a collaborator) can make edits in the GUI. For instance, you can run a Gaussian fit on this distribution:

You can give a title, edit the legend, add notes, etc.

You can add annotations in a very flexible way, controlling what the arrow and text look like:

When you’re happy with the changes, click “Share” to get your plot’s URL.

If you append a supported extension to the URL, Plotly will translate your plot into that format. Use this to export static images, embed your graph as an iframe, or translate the code between languages. Supported file types include:

Isn’t life wonderful?

The JSON file specifies your plot completely (it contains all the data and layout info). You can view it as your plot’s DNA. The R file (https://plot.ly/~mkcor/305.r) is a conversion of this JSON into a nested list in R. So we can interact with it by programming in R!

Access a plot which lives on plot.ly with the well-named method `get_figure()`:

enhanc_hist <- py$get_figure("mkcor", 305)

Take a look:

str(enhanc_hist) # Data for second trace enhanc_hist$data[[2]]

The second trace is a vertical line at 0.300 named “Good”. Say we get more ambitious and we want to show a vertical line at 0.350 named “Very Good”. We overwrite old values with our new values:

enhanc_hist$data[[2]]$name <- "VeryGood" enhanc_hist$data[[2]]$x[[1]] <- 0.35 enhanc_hist$data[[2]]$x[[2]] <- 0.35

Send this new plot back to plot.ly!

enhanc_hist2 <- py$plotly(enhanc_hist$data, kwargs=list(layout=enhanc_hist$layout)) enhanc_hist2$url

Visit the above URL (`enhanc_hist2$url`).

How do you like this workflow? Let us know!

Tutorials are at plot.ly/learn. You can see more examples and documentatation at plot.ly/ggplot2 and plot.ly/r. Our gallery has the following examples:

**Acknowledgments**

This presentation benefited tremendously from comments by Matt Sundquist and Xavier Saint-Mleux.

Plotly’s R API is part of rOpenSci. It is under active development; you can find it on GitHub. Your thoughts, issues, and pull requests are always welcome!

]]>

The recent Hack/Reduce hackathon in Montreal was a tonne of fun. Our team tackled a data set of consisting of Bixi (Montreal’s bicycle share system) station states at one minute temporal resolution. We used Hadoop and mapreduce to pull out some features of user behaviours. One of the things we extracted…]]>

With spring finally making it’s presence known, I thought I’d re-share this cycling data analysis and visualization I did with some great people a while back. Get out there and feel that wind in your hair!

Originally posted on bayesianbiologist:

The recent Hack/Reduce hackathon in Montreal was a tonne of fun. Our team tackled a data set of consisting of Bixi (Montreal’s bicycle share system) station states at one minute temporal resolution. We used Hadoop and mapreduce to pull out some features of user behaviours. One of the things we extracted was the flux at each station, which we defined as the number of bikes arriving and departing from a given station per unit time. When you plot the total system flux across all stations against time, you can see the *pulse* of the city. Here are the first few weeks of this year’s Bixi season.(click to enlarge)

A few things jump out: 1) There are clearly defined peaks at both the morning and evening rush hours, but it looks like the evening rush is typically a little stronger. I guess cycling home is a great way to relax after…

View original 231 more words

]]>

Plotly is a social graphing and analytics platform. Plotly’s R library lets you make and share publication-quality graphs online. Your work belongs to you, you control privacy and sharing, and public use is free (like GitHub). We are in beta, and would love your feedback, thoughts, and advice.

**1. Installing Plotly**

Let’s install Plotly. Our documentation has more details.

install.packages("devtools") library("devtools") devtools::install_github("R-api","plotly")

Then signup online or like this:

library(plotly) response = signup (username = 'yourusername', email= 'youremail')

…

Thanks for signing up to plotly! Your username is: MattSundquist Your temporary password is: pw. You use this to log into your plotly account at https://plot.ly/plot. Your API key is: “API_Key”. You use this to access your plotly account through the API.

**2. Canadian Population Bubble Chart**

Our first graph was made at a Montreal R Meetup by Plotly’s own Chris Parmer. We’ll be using the maps package. You may need to load it:

install.packages("maps")

Then:

library(plotly) p <- plotly(username="MattSundquist", key="4om2jxmhmn") library(maps) data(canada.cities) trace1 <- list(x=map(regions="canada")$x, y=map(regions="canada")$y) trace2 <- list(x= canada.cities$long, y=canada.cities$lat, text=canada.cities$name, type="scatter", mode="markers", marker=list( "size"=sqrt(canada.cities$pop/max(canada.cities$pop))*100, "opacity"=0.5) ) response <- p$plotly(trace1,trace2) url <- response$url filename <- response$filename browseURL(response$url)

In our graph, the bubble size represents the city population size. Shown below is the GUI, where you can annotate, select colors, analyze and add data, style traces, place your legend, change fonts, and more.

Editing from the GUI, we make a styled version. You can zoom in and hover on the points to find out about the cities. Want to make one for another country? We’d love to see it.

And, here is said meetup, in action:

You can also add in usa and us.cities:

**3. Old Faithful and Multiple Axes**

Ben Chartoff’s graph shows the correlation between a bimodal eruption time and a bimodal distribution of eruption length. The key series are: a histogram scale of probability, Eruption Time scale in minutes, and a scatterplot showing points within each bin on the x axis. The graph was made with this gist.

**4. Plotting Two Histograms Together**

Suppose you are studying correlations in two series (Popular Stack Overflow ?). You want to find overlap. You can plot two histograms together, one for each series. The overlapping sections are the darker orange, automatically rendered if you set barmode to ‘overlay’.

library(plotly) p <- plotly(username="Username", key="API_KEY") x0 <- rnorm(500) x1 <- rnorm(500)+1 data0 <- list(x=x0, name = "Series One", type='histogramx', opacity = 0.8) data1 <- list(x=x1, name = "Series Two", type='histogramx', opacity = 0.8) layout <- list( xaxis = list( ticks = "", gridcolor = "white",zerolinecolor = "white", linecolor = "white" ), yaxis = list( ticks = "", gridcolor = "white", zerolinecolor = "white", linecolor = "white" ), barmode='overlay', # style background color. You can set the alpha by adding an a. plot_bgcolor = 'rgba(249,249,251,.85)' ) response <- p$plotly(data0, data1, kwargs=list(layout=layout)) url <- response$url filename <- response$filename browseURL(response$url)

**5. Plotting y1 and y2 in the Same Plot**

Plotting two lines or graph types in Plotly is straightforward. Here we show y1 and y2 together (Popular SO ?).

library(plotly) p <- plotly(username="Username", key="API_KEY") # enter data x <- seq(-2, 2, 0.05) y1 <- pnorm(x) y2 <- pnorm(x,1,1) # format, listing y1 as your y. First <- list( x = x, y = y1, type = 'scatter', mode = 'lines', marker = list( color = 'rgb(0, 0, 255)', opacity = 0.5) ) # format again, listing y2 as your y. Second <- list( x = x, y = y2, type = 'scatter', mode = 'lines', opacity = 0.8, marker = list( color = 'rgb(255, 0, 0)') )

And a shot of the Plotly gallery, as seen at the Montreal meetup. Happy plotting!

]]>

Being curious, I thought I’d see what the assumptions were that went into that number. It would make sense to start with the assumption that you don’t know a lick about college basketball and you just guess using a coin flip for every match-up. In this scenario you’re pretty bad, but you are no worse than random. If we take this assumption, we can calculate the odds as 1/(0.5)^63. To get precision down to a whole integer I pulled out trusty bc for the heavy lifting:

$ echo "scale=50; 1/(0.5^63)" | bc 9223372036854775808.000000

Well, that was easy. So if you were to just guess randomly, your odds of winning the big prize would be those published on the contest page. We can easily calculate the expected value of entering the contest as P(win)*prize, or 9,223,372,036ths of a dollar (that’s 9 nano dollars, if you’re paying attention). You’ve literally already spent that (and then some) in opportunity cost sunk into the time you are spending thinking about this contest and reading this post (but read on, ’cause it’s fun!).

But of course, you’re cleverer than that. You know everything about college basketball – or more likely if you are reading this blog – you have a kickass predictive model that is going to up your game and get your hands into the pocket of the Oracle from Omaha.

What level of predictiveness would you need to make this bet worth while? Let’s have a look at the expected value as a function of our individual game probability of being correct.

And if you think that you’re *really* good, we can look at the 0.75 to 0.85 range:

So it’s starting to look enticing, you might even be willing to take off work for a while if you thought you could get your model up to a consistent 85% correct game predictions, giving you an expected return of ~$35,000. A recent paper found that even after observing the first 40 scoring events, the outcome of NBA games is only predictable at 80%. In order to be eligible to win, you’ve obviously got to submit your picks *before* the playoff games begin, but even at this herculean level of accuracy, the expected value of an entry in the contest plummets down to $785.

Those are the odds for an individual entrant, but what are the chances that Buffet and co will have to pay out? That, of course, depends on the number of entrants. Lets assume that the skill of all entrants is the same, though they all have unique models which make different predictions. In this case we can get the probability of at *least one of them* hitting it big. It will be the complement of *no one* winning. We already know the odds for a single entrant with a given level of accuracy, so we can just take the probability that each one doesn’t win, then take 1 minus that value.

Just as we saw that the expected value is very sensitive to the predictive accuracy of the participant, so too is the probability that the prize will be awarded at all. If 1 million super talented sporting sages with 80% game-level accuracy enter the contest, there will only be a slightly greater than 50% chance of anyone actually winning. If we substitute in a more reasonable (but let’s face it, still wildly high) figure for participants’ accuracy of 70%, the chance becomes only 1 in 5739 (0.017%) that the top prize will even be awarded even with a 1 million strong entrant pool.

tl;dr You’re not going to win, but you’re still going to play.

If you want to reproduce the numbers and plots in this post, check out this gist.

]]>

From Greek

+

**sim·u·late** *v.
*To create a representation or model of (a physical system or particular situation, for example).

From Latin

=

(If you can get past the mixing of Latin and Greek roots)

**sim·u· di·dactic **

———————————————————————

This concept has been floating around in my head for a little while. I’ve written before on how I believe that simulation can be used to improve one’s understanding of just about anything, but have never had a nice shorthand for this process.

**Simudidactic inquiry** is the process of understanding aspects of the world by abstracting them into a computational model, then conducting experiments in this model world by changing the underlying properties and parameters. In this way, one can ask questions like:

- What type of observations might we make if
*x*were true? - If my model of the process is accurate, can I recapture the underlying parameters given the type of observations I can make in the real world? How often will I be wrong?
- Will I be able to distinguish between competing models given the observations I can make in the real world?

In addition to being able to ask these types of questions, the simudidact solidifies their understanding of the model by actually building it.

So go on, get simudidactic and learn via simulation!

]]>

Monday, October 28, 2013. 6:00pm at Notman House 51 Sherbrooke W., Montreal, QC.

We are very pleased to welcome back Dr. Ramnath Vaidyanathan for a talk on interactive documents as it relates to his excellent rCharts package.

Bringing a laptop to follow along is highly encouraged. I would recommend installing rCharts prior to the workshop.

library(devtools)

pkgs <- c(‘rCharts’, ‘slidify’, ‘slidifyLibraries’)

install_github(pkgs, ‘ramnathv’, ref = ‘dev’)

Alternately, you would also be able to try out rCharts online at

RSVP at http://www.meetup.com/Montreal-R-User-Group/events/144636812/

]]>

There has been a lot of discussion of researcher degrees of freedom lately (e.g. Jeremy here or Andrew Gelman here – PS by my read Gelman got the specific example wrong because I think the authors really did have a genuine a priori hypothesis but the general point remains true…]]>

Originally posted on Dynamic Ecology:

There has been a lot of discussion of researcher degrees of freedom lately (e.g. Jeremy here or Andrew Gelman here – PS by my read Gelman got the specific example wrong because I think the authors really did have a genuine *a priori* hypothesis but the general point remains true and the specific example is revealing of how hard this is to sort out in the current research context).

I would argue that this problem comes about because people fail to be clear about their goals in using statistics (mostly the researchers, this is not a critique of Jeremy or Andrew’s posts). When I teach a 2nd semester graduate stats class, I teach that there are three distinct goals for which one might use statistics:

- Hypothesis testing
- Prediction
- Exploration

These three goals are all pretty much mutually exclusive (although there is some overlap between prediction and exploration). Hypothesis testing is of…

View original 1,385 more words

]]>

AUC gets around this threshold problem by integrating across all possible thresholds. Typically, it is calculated by plotting the rate of false positives against false negatives across the range of possible thresholds (this is the Receiver Operator Curve) and then integrating (calculating the area under the curve). The result is typically something like this:

I’ve implemented this algorithm in an R script (https://gist.github.com/cjbayesian/6921118) which I use quite frequently. Whenever I am tasked with explaining the *meaning* of the AUC value however, I will usually just say that you want it to be 1 and that 0.5 is no better than random. This usually suffices, but if my interlocutor is of the particularly curious sort they will tend to want more. At which point I will offer the interpretation that the AUC gives you the probability that a randomly selected positive case (1) will be ranked higher in your predictions than a randomly selected negative case (0).

Which got me thinking – if this is true, why bother with all this false positive, false negative, ROC business in the first place? Why not just use Monte Carlo to estimate this probability directly?

So, of course, I did just that and by golly it works.

source("http://polaris.biol.mcgill.ca/AUC.R") bs<-function(p) { U<-runif(length(p),0,1) outcomes<-U<p return(outcomes) } # Simulate some binary outcomes # n <- 100 x <- runif(n,-3,3) p <- 1/(1+exp(-x)) y <- bs(p) # Using my overly verbose code at https://gist.github.com/cjbayesian/6921118 AUC(d=y,pred=p,res=500,plot=TRUE) ## The hard way (but with fewer lines of code) ## N <- 10000000 r_pos_p <- sample(p[y==1],N,replace=TRUE) r_neg_p <- sample(p[y==0],N,replace=TRUE) # Monte Carlo probability of a randomly drawn 1 having a higher score than # a randomly drawn 0 (AUC by definition): rAUC <- mean(r_pos_p > r_neg_p) print(rAUC)

By randomly sampling positive and negative cases to see how often the positives have larger predicted probability than the negatives, the AUC can be calculated without the ROC or thresholds or anything. Now, before you object that this is necessarily an approximation, I’ll stop you right there – it is. And it is more computationally expensive too. The real value for me in this method is for my understanding of the meaning of AUC. I hope that it has helped yours too!

]]>

It’s now 2013 and unfortunately our data ends in 2010. However, the pattern does seem to be quite regular (that is, exhibits annual periodicity) so I decided to have a go at forecasting the time series for the missing years. I used a seasonal decomposition of time series by LOESS to accomplish this.

You can see the code on github but here are the results. First, I looked at the four components of the decomposition:

Indeed the seasonal component is quite regular and does contain the intriguing dip in the middle of the summer that I mentioned in the first post.

This figure shows just the seasonal deviation from the average rates. The peaks seem to be early July and again in late September. Before doing any seasonal aggregation I thought that the mid-summer dip may correspond with the mid-August construction holiday, however it looks now like it is a broader summer-long reprieve. It could be a population wide vacation effect.

Finally, I used an exponential smoothing model to project the accident rates into the 2011-2013 seasons.

It would be great to get the data from these years to validate the forecast, but for now lets just hope that we’re not pushing up against those upper confidence bounds.

]]>

If you want the slides, head on over to my speakerdeck page.

]]>