This is a guest post by Brian D’Alessandro, who daylights as the Head of Data Science at Zocdoc and as an Adjunct Professor with NYU’s Center for Data Science. When not thinking probabilistically, he’s drumming with the indie surf rock quarter Coastgaard. I’d like to address the recent study by Roland Fryer Jr from…]]>

When reading headlines about findings from data, always ask: “To what population does this conclusion apply?” Brian D’Alessandro explains eloquently why sampling matters.

*This is a guest post by Brian D’Alessandro, who daylights as the Head of Data Science at Zocdoc and as an Adjunct Professor with NYU’s Center for Data Science. When not thinking probabilistically, he’s drumming with the indie surf rock quarter Coastgaard.*

I’d like to address the recent study by Roland Fryer Jr from Harvard University, and associated NY Times coverage, that claims to show zero racial bias in police shootings. While this paper certainly makes an honest attempt to study this very important and timely problem, it ultimately suffers from issues of data sampling and subjective data preparation. Given the media attention it is receiving, and the potential policy and public perceptual implications of this attention, we as a community of data people need to comb through this work and make sure the headlines are consistent with the underlying statistics.

First thing’s first: is there really zero…

View original post 1,204 more words

]]>

Code snippets to generate animated examples here.

]]>

@CjBayesian giving an awesome talk on #machinelearning at tonight’s dataphilly! pic.twitter.com/Hojn5t7tDl

— DataPhilly (@DataPhilly) February 19, 2016

]]>

6:00 PM to 9:00 PM

**Speakers:**

- Corey Chivers
- Randy Olson
- Austin Rochford

**Abstract**: Corey will present a brief introduction to machine learning. In his talk he will demystify what is often seen as a dark art. Corey will describe how we “teach” machines to learn patterns from examples by breaking the process into its easy-to-understand component parts. By using examples from fields as diverse as biology, health-care, astrophysics, and NBA basketball, Corey will show how data (both big and small) is used to teach machines to predict the future so we can make better decisions.

**Bio**: Corey Chivers is a Senior Data Scientist at Penn Medicine where he is building machine learning systems to improve patient outcomes by providing real-time predictive applications that empower clinicians to identify at risk individuals. When he’s not pouring over data, he’s likely to be found cycling around his adoptive city of Philadelphia or blogging about all things probability and data at bayesianbiologist.com.

**Randy Olson** (University of Pennsylvania Institute for Biomedical Informatics):

**Automating data science through tree-based pipeline optimization**

**Abstract**: Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in business, academia, and government. In this talk, I’m going to introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning — pipeline design. All of the work presented in this talk is based on the open source Tree-based Pipeline Optimization Tool (TPOT), which is available on GitHub at https://github.com/rhiever/tpot.

**Bio**: Randy Olson is an artificial intelligence researcher at the University of Pennsylvania Institute for Biomedical Informatics, where he develops state-of-the-art machine learning algorithms to solve biomedical problems. He regularly writes about his latest adventures in data science at RandalOlson.com/blog, and tweets about the latest data science news at http://twitter.com/randal_olson.

**Abstract**: Bayesian optimization is a technique for finding the extrema of functions which are expensive, difficult, or time-consuming to evaluate. It has many applications to optimizing the hyperparameters of machine learning models, optimizing the inputs to real-world experiments and processes, etc. This talk will introduce the Gaussian process approach to Bayesian optimization, with sample code in Python.

**Bio**: Austin Rochford is a Data Scientist at Monetate. He is a former mathematician who is interested in Bayesian nonparametrics, multilevel models, probabilistic programming, and efficient Bayesian computation.

]]>

**Let’s get hypothetical**

You’ve taken a bet that pays off if you guess the exact date of the next occurrence of a rare event (p = 0.0001 on any given day i.i.d). What day do you choose? In other words, what is the most likely day for this rare event to occur?

Setting aside for now why in the world you’ve taken such a silly sounding bet, it would seem as though a reasonable way to think about it would be to ask: what is the expected number of days until the event? That must be the best bet, right?

We can work out the expected number of days quite easily as 1/p = 10000. So using the logic of expectation, we would choose day 10000 as our bet.

Let’s simulate to see how often we would win with this strategy. We’ll simulate the outcomes by flipping a weighted coin until it comes out heads. We’ll do this 100,000 times and record how many flips it took each time.

The event occurred on day 10,000 exactly 35 times. However, if we look at a histogram of our simulation experiment, we can see that the time it took for the rare event to happen was more often short, than long. In fact, the event occurred 103 times on the very first flip (the most common Time to Event in our set)!

So from the experiment it would seem that the most likely amount of time to pass until the rare event occurs is 0. Maybe our hypothetical event was just not rare *enough*. Let’s try it again with p=0.0000001, or an event with a 1 in 1million chance of occurring each day.

While now our event is extremely unlikely to occur, it’s still *most likely* to occur right away.

**Existential Risk**

What does this all have to do with seizing the day? Everything we do in a given day comes with some degree of risk. The Stanford professor Ronald A. Howard conceived of a way of measuring the riskiness of various day-to-day activities, which he termed the **micromort**. One micromort is a unit of risk equal to p = 0.000001 (1 in a million chance) of death. We are all subject to a baseline level of risk in micromorts, and additional activities may add or subtract from that level (skiing, for instance adds 0.7 micromorts per day).

While minimizing the risks we assume in our day-to-day lives can increase our expected life span, the most likely exact day of our demise is always our next one. So *carpe diem*!!

Post Script:

Don’t get too freaked out by all of this. It’s just a bit of fun that comes from viewing the problem in a very specific way. That is, as a question of which exact day is most likely. The much more natural way to view it is to ask, what is the relative probability of the unlikely event occurring tomorrow vs * any other day but tomorrow*. I leave it to the reader to confirm that for events with

]]>

Being a machine learning conference, it’s only reasonable that we apply a little machine learning to this (decidedly _small_) data.

Building off of the great example code in a post by Jordan Barber on Latent Dirichlet Allocation (LDA) with Python, I scraped the paper titles and built an LDA topic model with 5 topics. All of the code to reproduce this post is available on github. Here are the top 10 most probable words from each of the derived topics:

0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|

0 | learning | learning | optimization | learning | via |

1 | models | inference | networks | bayesian | models |

2 | neural | sparse | time | sample | inference |

3 | high | models | stochastic | analysis | networks |

4 | stochastic | non | model | data | deep |

5 | dimensional | optimization | convex | inference | learning |

6 | networks | algorithms | monte | spectral | fast |

7 | graphs | multi | carlo | networks | variational |

8 | optimal | linear | neural | bandits | neural |

9 | sampling | convergence | information | methods | convolutional |

Normally, we might try to attach some kind of label to each topic using our beefy human brains and subject matter expertise, but I didn’t bother with this — nothing too obvious stuck out at me. If you think that you have appropriate names for them feel free to let me know. Given that we are only working with the titles (no abstracts or full paper text), I think that there aren’t obvious human-interpretable topics jumping out. But let’s not let that stop us from proceeding.

We can also represent the inferred topics with the much maligned, but handy-dandy wordcloud visualization:

Since we are modeling the paper title generating process as a probability distribution of topics, each of which is a probability distribution of words, we can use this generating process to suggest keywords for each title. These keywords may or may not show up in the title itself. Here are some from the first 10 titles:

================ Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing Generated Keywords: [u'iteration', u'inference', u'theory'] ================ Learning with Symmetric Label Noise: The Importance of Being Unhinged Generated Keywords: [u'uncertainty', u'randomized', u'neural'] ================ Algorithmic Stability and Uniform Generalization Generated Keywords: [u'spatial', u'robust', u'dimensional'] ================ Adaptive Low-Complexity Sequential Inference for Dirichlet Process Mixture Models Generated Keywords: [u'rates', u'fast', u'based'] ================ Covariance-Controlled Adaptive Langevin Thermostat for Large-Scale Bayesian Sampling Generated Keywords: [u'monte', u'neural', u'stochastic'] ================ Robust Portfolio Optimization Generated Keywords: [u'learning', u'online', u'matrix'] ================ Logarithmic Time Online Multiclass prediction Generated Keywords: [u'complexity', u'problems', u'stein'] ================ Planar Ultrametric Rounding for Image Segmentation Generated Keywords: [u'deep', u'graphs', u'neural'] ================ Expressing an Image Stream with a Sequence of Natural Sentences Generated Keywords: [u'latent', u'process', u'stochastic'] ================ Parallel Correlation Clustering on Big Graphs Generated Keywords: [u'robust', u'learning', u'learning']

While some titles are strongly associated with a single topic, others seem to be generated from more even distributions over topics than others. Paper titles with more equal representation over topics could be considered to be, in some way, more *interdisciplinary,* or at least, *intertopicular* (yes, I just made that word up). To find these papers, we’ll find which paper titles have the highest information entropy in their inferred topic distribution.

Here are the top 10 along with their associated entropies:

So it looks like by this method, the ‘Where are they looking’ has the highest entropy as a result of topic uncertainty, more than any real multi-topic content.

]]>

Also on the line-up was (sometimes contributor to bayesianbiologist) Matt Sunquist. He demo’d some of plot.ly‘s most recent features to audible gasps of delight for the audience.

]]>

From what I can tell, there are no builtins in the python data ecosystem (numpy, pandas, matplotlib) for this so I coded up a function to emulate the R behaviour. You can get it in this gist (feedback welcomed).

Here’s an example of it in action showing derived time-series features (12 hour rates of change) for some clinical variables.

plot_correlogram(df)

]]>

find . -iname "*.R" -print0 | xargs -L1 -0 egrep -r "plot(" | wc -l 2922

The actual number is very likely orders of magnitude larger as 1) many of these plot statements are in loops, 2) it doesn’t capture how many times I may have ran a given script, 3) it doesn’t look at previous versions, 4) plot is not the only command to generate figures in R (eg hist), and 5) early in my graduate career I mainly used gnuplot and near the end I was using more and more matplotlib. But even at this lower bound, that’s nearly 3,000 plots. A quick look at the TOC of my thesis reveals a grand total of 33 figures. Were all the rest a waste? (Hint: No.)

The overwhelming majority of the plots that I created served a very different function than these final, publication-ready figures. Generally, visualizations are either:

- A) Communication between
**you and data**, or - B) Communication between
**you and someone else**,*through*data.

These two modes serve very different purposes and can require taking different approaches in their creation. Visualizations in the first mode need only be quick and dirty. You can often forget about all that nice axis labeling, optimal color contrast, and whiz-bang interactivity. As per my estimates above, this made up *at the very least* 10:1 of visuals created. The important thing is that, in this mode, ** you already have all of the context**. You know what the variables are, you know what the colors, shapes, sizes, and layouts mean – after all, you just coded it. The beauty of this is that you can iterate on these plots very quickly. The conversation between you and the data can dialogue back and forth as you intrepidly explore and shine your light into all of it’s dark little corners.

In the second mode, you are telling a story to someone else. Much more thought and care needs to be placed on ensuring that * the whole story is being told with the visualization*. It is all too easy to produce something that makes sense to you, but is completely unintelligible to your intended audience. I’ve learned the hard way that this kind of visual should always be test-driven by someone who, ideally, is a member of your intended audience. When you are as steeped in the data as you most likely are, your mind will fill in any missing pieces of the story – something your audience won’t do.

In my new role as part of the Data Science team at Penn Medicine, I’ll be making more and more data visualizations in the second mode. A little less talking to myself with data, and a little more communicating with others *through* data. I’ll be sharing some of my experiences, tools, wins, and disasters here. Stay tuned!

]]>

Plotly is a platform for making, editing, and sharing graphs. If you are used to making plots with ggplot2, you can call ggplotly() to make your plots interactive, web-based, and collaborative. For example, see plot.ly/~ggplot2examples/211, shown below and in this Notebook. Notice the hover text!

Visit http://plot.ly. Here, you’ll find a GUI that lets you create graphs from data you enter manually, or upload as a spreadsheet (or CSV file). From there you can edit graphs! Change between types (from bar charts to scatter charts), change colors and formatting, add fits and annotations, try other themes…

Our R API lets you use Plotly with R. Once you have your R visualization in Plotly, you can use the web interface to edit it, or to extract its data. Install and load package “plotly” in your favourite R environment. For a quick start, follow: https://plot.ly/ggplot2/getting-started/

Go social! Like, share, comment, fork and edit plots… Export them, embed them in your website. Collaboration has never been so sweet!

Not ready to publish? Set detailed permissions for who can view and who can edit your project.

Baseball data is the best! Let’s plot a histogram of batting averages. I downloaded data here.

Load the CSV file of interest, take a look at the data, subset at will:

library(RCurl) online_data <- getURL("https://raw.githubusercontent.com/mkcor/baseball-notebook/master/Batting.csv") batting_table <- read.csv(textConnection(online_data)) head(batting_table) summary(batting_table) batting_table <- subset(batting_table, yearID >= 2004)

The batting average is defined by the number of hits divided by at bats:

batting_table$Avg <- with(batting_table, H / AB)

You may want to explore the distribution of your new variable as follows:

library(ggplot2) ggplot(data=batting_table) + geom_histogram(aes(Avg), binwidth=0.05) # Let's filter out entries where players were at bat less than 10 times. batting_table <- subset(batting_table, AB >= 10) hist <- ggplot(data=batting_table) + geom_histogram(aes(Avg), binwidth=0.05) hist

We have created a basic histogram; let us share it, so we can get input from others!

# Install the latest version # of the “plotly” package and load it library(devtools) install_github("ropensci/plotly") library(plotly) # Open a Plotly connection py <- plotly("ggplot2examples", "3gazttckd7")

Use your own credentials if you prefer. You can sign up for a Plotly account online.

Now call the `ggplotly()` method:

collab_hist <- py$ggplotly(hist)

And boom!

You get a nice interactive version of your plot! Go ahead and hover…

Your plot lives at this URL (`collab_hist$response$url`) alongside the data. How great is that?!

If you wanted to keep your project private, you would use your own credentials and specify:

py <- plotly() py$ggplotly(hist, kwargs=list(filename="private_project", world_readable=FALSE))

Now let us click “Fork and edit”. You (and whoever you’ve added as a collaborator) can make edits in the GUI. For instance, you can run a Gaussian fit on this distribution:

You can give a title, edit the legend, add notes, etc.

You can add annotations in a very flexible way, controlling what the arrow and text look like:

When you’re happy with the changes, click “Share” to get your plot’s URL.

If you append a supported extension to the URL, Plotly will translate your plot into that format. Use this to export static images, embed your graph as an iframe, or translate the code between languages. Supported file types include:

Isn’t life wonderful?

The JSON file specifies your plot completely (it contains all the data and layout info). You can view it as your plot’s DNA. The R file (https://plot.ly/~mkcor/305.r) is a conversion of this JSON into a nested list in R. So we can interact with it by programming in R!

Access a plot which lives on plot.ly with the well-named method `get_figure()`:

enhanc_hist <- py$get_figure("mkcor", 305)

Take a look:

str(enhanc_hist) # Data for second trace enhanc_hist$data[[2]]

The second trace is a vertical line at 0.300 named “Good”. Say we get more ambitious and we want to show a vertical line at 0.350 named “Very Good”. We overwrite old values with our new values:

enhanc_hist$data[[2]]$name <- "VeryGood" enhanc_hist$data[[2]]$x[[1]] <- 0.35 enhanc_hist$data[[2]]$x[[2]] <- 0.35

Send this new plot back to plot.ly!

enhanc_hist2 <- py$plotly(enhanc_hist$data, kwargs=list(layout=enhanc_hist$layout)) enhanc_hist2$url

Visit the above URL (`enhanc_hist2$url`).

How do you like this workflow? Let us know!

Tutorials are at plot.ly/learn. You can see more examples and documentatation at plot.ly/ggplot2 and plot.ly/r. Our gallery has the following examples:

**Acknowledgments**

This presentation benefited tremendously from comments by Matt Sundquist and Xavier Saint-Mleux.

Plotly’s R API is part of rOpenSci. It is under active development; you can find it on GitHub. Your thoughts, issues, and pull requests are always welcome!

]]>