Another great turnout at the DataPhilly meetup last night. Was great to see all you random data nerds!
Code snippets to generate animated examples here.
Another great turnout at the DataPhilly meetup last night. Was great to see all you random data nerds!
Code snippets to generate animated examples here.
6:00 PM to 9:00 PM
Abstract: Corey will present a brief introduction to machine learning. In his talk he will demystify what is often seen as a dark art. Corey will describe how we “teach” machines to learn patterns from examples by breaking the process into its easy-to-understand component parts. By using examples from fields as diverse as biology, health-care, astrophysics, and NBA basketball, Corey will show how data (both big and small) is used to teach machines to predict the future so we can make better decisions.
Bio: Corey Chivers is a Senior Data Scientist at Penn Medicine where he is building machine learning systems to improve patient outcomes by providing real-time predictive applications that empower clinicians to identify at risk individuals. When he’s not pouring over data, he’s likely to be found cycling around his adoptive city of Philadelphia or blogging about all things probability and data at bayesianbiologist.com.
Automating data science through tree-based pipeline optimization
Abstract: Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in business, academia, and government. In this talk, I’m going to introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning — pipeline design. All of the work presented in this talk is based on the open source Tree-based Pipeline Optimization Tool (TPOT), which is available on GitHub at https://github.com/rhiever/tpot.
Bio: Randy Olson is an artificial intelligence researcher at the University of Pennsylvania Institute for Biomedical Informatics, where he develops state-of-the-art machine learning algorithms to solve biomedical problems. He regularly writes about his latest adventures in data science at RandalOlson.com/blog, and tweets about the latest data science news at http://twitter.com/randal_olson.
Abstract: Bayesian optimization is a technique for finding the extrema of functions which are expensive, difficult, or time-consuming to evaluate. It has many applications to optimizing the hyperparameters of machine learning models, optimizing the inputs to real-world experiments and processes, etc. This talk will introduce the Gaussian process approach to Bayesian optimization, with sample code in Python.
Bio: Austin Rochford is a Data Scientist at Monetate. He is a former mathematician who is interested in Bayesian nonparametrics, multilevel models, probabilistic programming, and efficient Bayesian computation.
The default plot method for dataframes in R is to show each numeric variable in a pair-wise scatter plot. I find this to be a really useful first look at a dataset, both to see correlations and joint distributions between variables, but also to quickly diagnose potential strangeness like bands of repeating values or outliers.
From what I can tell, there are no builtins in the python data ecosystem (numpy, pandas, matplotlib) for this so I coded up a function to emulate the R behaviour. You can get it in this gist (feedback welcomed).
Here’s an example of it in action showing derived time-series features (12 hour rates of change) for some clinical variables.
Over the years of my graduate studies I made a lot of plots. I mean tonnes. To get an extremely conservative estimate I grep’ed for every instance of “plot\(” in all of the many R scripts I wrote over the past five years.
find . -iname "*.R" -print0 | xargs -L1 -0 egrep -r "plot(" | wc -l 2922
The actual number is very likely orders of magnitude larger as 1) many of these plot statements are in loops, 2) it doesn’t capture how many times I may have ran a given script, 3) it doesn’t look at previous versions, 4) plot is not the only command to generate figures in R (eg hist), and 5) early in my graduate career I mainly used gnuplot and near the end I was using more and more matplotlib. But even at this lower bound, that’s nearly 3,000 plots. A quick look at the TOC of my thesis reveals a grand total of 33 figures. Were all the rest a waste? (Hint: No.)
The overwhelming majority of the plots that I created served a very different function than these final, publication-ready figures. Generally, visualizations are either:
These two modes serve very different purposes and can require taking different approaches in their creation. Visualizations in the first mode need only be quick and dirty. You can often forget about all that nice axis labeling, optimal color contrast, and whiz-bang interactivity. As per my estimates above, this made up at the very least 10:1 of visuals created. The important thing is that, in this mode, you already have all of the context. You know what the variables are, you know what the colors, shapes, sizes, and layouts mean – after all, you just coded it. The beauty of this is that you can iterate on these plots very quickly. The conversation between you and the data can dialogue back and forth as you intrepidly explore and shine your light into all of it’s dark little corners.
In the second mode, you are telling a story to someone else. Much more thought and care needs to be placed on ensuring that the whole story is being told with the visualization. It is all too easy to produce something that makes sense to you, but is completely unintelligible to your intended audience. I’ve learned the hard way that this kind of visual should always be test-driven by someone who, ideally, is a member of your intended audience. When you are as steeped in the data as you most likely are, your mind will fill in any missing pieces of the story – something your audience won’t do.
In my new role as part of the Data Science team at Penn Medicine, I’ll be making more and more data visualizations in the second mode. A little less talking to myself with data, and a little more communicating with others through data. I’ll be sharing some of my experiences, tools, wins, and disasters here. Stay tuned!
Plotly is a platform for making, editing, and sharing graphs. If you are used to making plots with ggplot2, you can call ggplotly() to make your plots interactive, web-based, and collaborative. For example, see plot.ly/~ggplot2examples/211, shown below and in this Notebook. Notice the hover text!
Visit http://plot.ly. Here, you’ll find a GUI that lets you create graphs from data you enter manually, or upload as a spreadsheet (or CSV file). From there you can edit graphs! Change between types (from bar charts to scatter charts), change colors and formatting, add fits and annotations, try other themes…
Our R API lets you use Plotly with R. Once you have your R visualization in Plotly, you can use the web interface to edit it, or to extract its data. Install and load package “plotly” in your favourite R environment. For a quick start, follow: https://plot.ly/ggplot2/getting-started/
Go social! Like, share, comment, fork and edit plots… Export them, embed them in your website. Collaboration has never been so sweet!
Not ready to publish? Set detailed permissions for who can view and who can edit your project.
Baseball data is the best! Let’s plot a histogram of batting averages. I downloaded data here.
Load the CSV file of interest, take a look at the data, subset at will:
library(RCurl) online_data <- getURL("https://raw.githubusercontent.com/mkcor/baseball-notebook/master/Batting.csv") batting_table <- read.csv(textConnection(online_data)) head(batting_table) summary(batting_table) batting_table <- subset(batting_table, yearID >= 2004)
The batting average is defined by the number of hits divided by at bats:
batting_table$Avg <- with(batting_table, H / AB)
You may want to explore the distribution of your new variable as follows:
library(ggplot2) ggplot(data=batting_table) + geom_histogram(aes(Avg), binwidth=0.05) # Let's filter out entries where players were at bat less than 10 times. batting_table <- subset(batting_table, AB >= 10) hist <- ggplot(data=batting_table) + geom_histogram(aes(Avg), binwidth=0.05) hist
We have created a basic histogram; let us share it, so we can get input from others!
# Install the latest version # of the “plotly” package and load it library(devtools) install_github("ropensci/plotly") library(plotly) # Open a Plotly connection py <- plotly("ggplot2examples", "3gazttckd7")
Use your own credentials if you prefer. You can sign up for a Plotly account online.
Now call the `ggplotly()` method:
collab_hist <- py$ggplotly(hist)
You get a nice interactive version of your plot! Go ahead and hover…
Your plot lives at this URL (`collab_hist$response$url`) alongside the data. How great is that?!
If you wanted to keep your project private, you would use your own credentials and specify:
py <- plotly() py$ggplotly(hist, kwargs=list(filename="private_project", world_readable=FALSE))
Now let us click “Fork and edit”. You (and whoever you’ve added as a collaborator) can make edits in the GUI. For instance, you can run a Gaussian fit on this distribution:
You can give a title, edit the legend, add notes, etc.
You can add annotations in a very flexible way, controlling what the arrow and text look like:
When you’re happy with the changes, click “Share” to get your plot’s URL.
If you append a supported extension to the URL, Plotly will translate your plot into that format. Use this to export static images, embed your graph as an iframe, or translate the code between languages. Supported file types include:
Isn’t life wonderful?
The JSON file specifies your plot completely (it contains all the data and layout info). You can view it as your plot’s DNA. The R file (https://plot.ly/~mkcor/305.r) is a conversion of this JSON into a nested list in R. So we can interact with it by programming in R!
Access a plot which lives on plot.ly with the well-named method `get_figure()`:
enhanc_hist <- py$get_figure("mkcor", 305)
Take a look:
str(enhanc_hist) # Data for second trace enhanc_hist$data[]
The second trace is a vertical line at 0.300 named “Good”. Say we get more ambitious and we want to show a vertical line at 0.350 named “Very Good”. We overwrite old values with our new values:
enhanc_hist$data[]$name <- "VeryGood" enhanc_hist$data[]$x[] <- 0.35 enhanc_hist$data[]$x[] <- 0.35
Send this new plot back to plot.ly!
enhanc_hist2 <- py$plotly(enhanc_hist$data, kwargs=list(layout=enhanc_hist$layout)) enhanc_hist2$url
Visit the above URL (`enhanc_hist2$url`).
How do you like this workflow? Let us know!
This presentation benefited tremendously from comments by Matt Sundquist and Xavier Saint-Mleux.
Guest post by Matt Sundquist of plot.ly.
Plotly is a social graphing and analytics platform. Plotly’s R library lets you make and share publication-quality graphs online. Your work belongs to you, you control privacy and sharing, and public use is free (like GitHub). We are in beta, and would love your feedback, thoughts, and advice.
1. Installing Plotly
Let’s install Plotly. Our documentation has more details.
install.packages("devtools") library("devtools") devtools::install_github("R-api","plotly")
Then signup online or like this:
library(plotly) response = signup (username = 'yourusername', email= 'youremail')
Thanks for signing up to plotly! Your username is: MattSundquist Your temporary password is: pw. You use this to log into your plotly account at https://plot.ly/plot. Your API key is: “API_Key”. You use this to access your plotly account through the API.
2. Canadian Population Bubble Chart
library(plotly) p <- plotly(username="MattSundquist", key="4om2jxmhmn") library(maps) data(canada.cities) trace1 <- list(x=map(regions="canada")$x, y=map(regions="canada")$y) trace2 <- list(x= canada.cities$long, y=canada.cities$lat, text=canada.cities$name, type="scatter", mode="markers", marker=list( "size"=sqrt(canada.cities$pop/max(canada.cities$pop))*100, "opacity"=0.5) ) response <- p$plotly(trace1,trace2) url <- response$url filename <- response$filename browseURL(response$url)
In our graph, the bubble size represents the city population size. Shown below is the GUI, where you can annotate, select colors, analyze and add data, style traces, place your legend, change fonts, and more.
Editing from the GUI, we make a styled version. You can zoom in and hover on the points to find out about the cities. Want to make one for another country? We’d love to see it.
And, here is said meetup, in action:
You can also add in usa and us.cities:
3. Old Faithful and Multiple Axes
Ben Chartoff’s graph shows the correlation between a bimodal eruption time and a bimodal distribution of eruption length. The key series are: a histogram scale of probability, Eruption Time scale in minutes, and a scatterplot showing points within each bin on the x axis. The graph was made with this gist.
4. Plotting Two Histograms Together
Suppose you are studying correlations in two series (Popular Stack Overflow ?). You want to find overlap. You can plot two histograms together, one for each series. The overlapping sections are the darker orange, automatically rendered if you set barmode to ‘overlay’.
library(plotly) p <- plotly(username="Username", key="API_KEY") x0 <- rnorm(500) x1 <- rnorm(500)+1 data0 <- list(x=x0, name = "Series One", type='histogramx', opacity = 0.8) data1 <- list(x=x1, name = "Series Two", type='histogramx', opacity = 0.8) layout <- list( xaxis = list( ticks = "", gridcolor = "white",zerolinecolor = "white", linecolor = "white" ), yaxis = list( ticks = "", gridcolor = "white", zerolinecolor = "white", linecolor = "white" ), barmode='overlay', # style background color. You can set the alpha by adding an a. plot_bgcolor = 'rgba(249,249,251,.85)' ) response <- p$plotly(data0, data1, kwargs=list(layout=layout)) url <- response$url filename <- response$filename browseURL(response$url)
5. Plotting y1 and y2 in the Same Plot
library(plotly) p <- plotly(username="Username", key="API_KEY") # enter data x <- seq(-2, 2, 0.05) y1 <- pnorm(x) y2 <- pnorm(x,1,1) # format, listing y1 as your y. First <- list( x = x, y = y1, type = 'scatter', mode = 'lines', marker = list( color = 'rgb(0, 0, 255)', opacity = 0.5) ) # format again, listing y2 as your y. Second <- list( x = x, y = y2, type = 'scatter', mode = 'lines', opacity = 0.8, marker = list( color = 'rgb(255, 0, 0)') )
And a shot of the Plotly gallery, as seen at the Montreal meetup. Happy plotting!
A friend of mine just alerted me to a story on NPR describing a prize on offer from Warren Buffett and Quicken Loans. The prize is a billion dollars (1B USD) for correctly predicting all 63 games in the men’s Division I college basketball tournament this March. The facebook page announcing the contest puts the odds at 1:9,223,372,036,854,775,808, which they note “may vary depending upon the knowledge and skill of entrant”.
Being curious, I thought I’d see what the assumptions were that went into that number. It would make sense to start with the assumption that you don’t know a lick about college basketball and you just guess using a coin flip for every match-up. In this scenario you’re pretty bad, but you are no worse than random. If we take this assumption, we can calculate the odds as 1/(0.5)^63. To get precision down to a whole integer I pulled out trusty bc for the heavy lifting:
$ echo "scale=50; 1/(0.5^63)" | bc 9223372036854775808.000000
Well, that was easy. So if you were to just guess randomly, your odds of winning the big prize would be those published on the contest page. We can easily calculate the expected value of entering the contest as P(win)*prize, or 9,223,372,036ths of a dollar (that’s 9 nano dollars, if you’re paying attention). You’ve literally already spent that (and then some) in opportunity cost sunk into the time you are spending thinking about this contest and reading this post (but read on, ’cause it’s fun!).
But of course, you’re cleverer than that. You know everything about college basketball – or more likely if you are reading this blog – you have a kickass predictive model that is going to up your game and get your hands into the pocket of the Oracle from Omaha.
What level of predictiveness would you need to make this bet worth while? Let’s have a look at the expected value as a function of our individual game probability of being correct.
And if you think that you’re really good, we can look at the 0.75 to 0.85 range:
So it’s starting to look enticing, you might even be willing to take off work for a while if you thought you could get your model up to a consistent 85% correct game predictions, giving you an expected return of ~$35,000. A recent paper found that even after observing the first 40 scoring events, the outcome of NBA games is only predictable at 80%. In order to be eligible to win, you’ve obviously got to submit your picks before the playoff games begin, but even at this herculean level of accuracy, the expected value of an entry in the contest plummets down to $785.
Those are the odds for an individual entrant, but what are the chances that Buffet and co will have to pay out? That, of course, depends on the number of entrants. Lets assume that the skill of all entrants is the same, though they all have unique models which make different predictions. In this case we can get the probability of at least one of them hitting it big. It will be the complement of no one winning. We already know the odds for a single entrant with a given level of accuracy, so we can just take the probability that each one doesn’t win, then take 1 minus that value.
Just as we saw that the expected value is very sensitive to the predictive accuracy of the participant, so too is the probability that the prize will be awarded at all. If 1 million super talented sporting sages with 80% game-level accuracy enter the contest, there will only be a slightly greater than 50% chance of anyone actually winning. If we substitute in a more reasonable (but let’s face it, still wildly high) figure for participants’ accuracy of 70%, the chance becomes only 1 in 5739 (0.017%) that the top prize will even be awarded even with a 1 million strong entrant pool.
tl;dr You’re not going to win, but you’re still going to play.
If you want to reproduce the numbers and plots in this post, check out this gist.