Lately I’ve been thinking a lot about the connection between prediction models and the decisions that they influence. There is a lot of theory around this, but communicating how the various pieces all fit together with the folks who will use and be impacted by these decisions can be challenging.
One of the important conceptual pieces is the link between the decision threshold (how high does the score need to be to predict positive) and the resulting distribution of outcomes (true positives, false positives, true negatives and false negatives). As a starting point, I’ve built this interactive tool for exploring this.
The idea is to take a validation sample of predictions from a model and experiment with the consequences of varying the decision threshold. The hope is that the user will be able to develop an intuition around the tradeoffs involved by seeing the link to the individual data points involved.
Code for this experiment is available here. I hope to continue to build on this with other interactive, visual tools aimed at demystifying the concepts at the interface between predictions and decisions.
I’ve had this book on pre-order since spring and it finally arrived on Friday. I subsequently devoured it over the weekend.
The book lays out a clear and compelling case for how data-driven algorithms can become — in contrast to their promise of amoral objectivism — efficient means for reproducing and even exacerbating social inequalities and injustices. From predictive policing and recidivism risk models to targeted marketing for predatory loans and for-profit universities, O’Neil explains how to recognize WMDs by 3 distinct features:
- The model is either hidden, or opaque to the individuals affected by its calculations, restricting any possibility of seeking recourse against – or understanding of – its results or conclusions.
- The model works against the subject’s interest (eg. it is unfair).
- The model scales, giving it the opportunity to negatively affect a very large segment of the population.
The taxonomy provides a simple framework for identifying WMDs in the wild. However, importantly for data scientists and other data practitioners, it forms a checklist (or rather an anti-checklist) to keep in mind when developing models that will be deployed into the real world. As data scientists, many of us are strongly incentivized to achieve feature 3, and doing so only makes it increasingly important to be constantly questioning the degree to which our models could fall victim to features 2 and 1.
Feature 2, as O’Neil lays out, can occur despite the best intentions of a model’s creators. This can (and does!) happen in two ways: First, when a modeler seeks to create an objective system for rating individuals (say, for acceptance to a prestigious university, or for a payday loan), the data used to build the model is already encoded with the socially constructed biases of the conditions under which it was generated. Even when attempting to exclude potentially bias-laden factors such as race or gender, this information seeps into the model nonetheless via correlations to seemingly benign variables such as zip codes or the makeup of a subject’s social connections.
Second, when the outcome of the model results in the reinforcement of the unjust conditions from which it was created, a negative feedback loop is created. Such a negative feedback loop is particularly present and pernicious in the use of recidivism risk models to guide sentencing decisions. An individual may be labeled as high risk due not to qualities of the individual himself, but his circumstances of living in a poor, high crime neighborhood. Being incarcerated based on the results of this model renders him more likely to end up back in that neighborhood, subject to continued poverty and disproportionate policing. Thus the model has set up the conditions to fulfill its own prediction.
As machine learning algorithms become more and more accurate at a variety of tasks, their inner workings become harder and harder to understand. The trend will make it increasingly difficult to avoid feature 1 of the WMD taxonomy. Current advanced techniques like deep learning are creating models that are remarkably performant, yet not fully understood by the researchers creating them, much less the individuals affected by their results. In light of this, we need to think carefully as data scientists about how to communicate these models with as much transparency as possible. How to do so remains an open question. But the internal ‘black box’ nature of these algorithms does not obviate our responsibility to disclose exactly what input data went into a given model, what assumptions were made of that data, and on what criteria the model was trained.
Overall, WMD provides an incredibly important framework for thinking about the consequences of uncritically applying data and algorithms to people’s lives. For those of us, like O’Neil herself, who make our living using mathematics to create data-driven algorithms, taking to heart the lessons contained in Weapons Of Math Destruction will be our best defense against unwittingly creating the bomb ourselves.
Another great turnout at the DataPhilly meetup last night. Was great to see all you random data nerds!
Code snippets to generate animated examples here.
Thursday, February 18, 2016
6:00 PM to 9:00 PM
- Corey Chivers
- Randy Olson
- Austin Rochford
Abstract: Corey will present a brief introduction to machine learning. In his talk he will demystify what is often seen as a dark art. Corey will describe how we “teach” machines to learn patterns from examples by breaking the process into its easy-to-understand component parts. By using examples from fields as diverse as biology, health-care, astrophysics, and NBA basketball, Corey will show how data (both big and small) is used to teach machines to predict the future so we can make better decisions.
Bio: Corey Chivers is a Senior Data Scientist at Penn Medicine where he is building machine learning systems to improve patient outcomes by providing real-time predictive applications that empower clinicians to identify at risk individuals. When he’s not pouring over data, he’s likely to be found cycling around his adoptive city of Philadelphia or blogging about all things probability and data at bayesianbiologist.com.
Automating data science through tree-based pipeline optimization
Abstract: Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in business, academia, and government. In this talk, I’m going to introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning — pipeline design. All of the work presented in this talk is based on the open source Tree-based Pipeline Optimization Tool (TPOT), which is available on GitHub at https://github.com/rhiever/tpot.
Bio: Randy Olson is an artificial intelligence researcher at the University of Pennsylvania Institute for Biomedical Informatics, where he develops state-of-the-art machine learning algorithms to solve biomedical problems. He regularly writes about his latest adventures in data science at RandalOlson.com/blog, and tweets about the latest data science news at http://twitter.com/randal_olson.
Abstract: Bayesian optimization is a technique for finding the extrema of functions which are expensive, difficult, or time-consuming to evaluate. It has many applications to optimizing the hyperparameters of machine learning models, optimizing the inputs to real-world experiments and processes, etc. This talk will introduce the Gaussian process approach to Bayesian optimization, with sample code in Python.
Bio: Austin Rochford is a Data Scientist at Monetate. He is a former mathematician who is interested in Bayesian nonparametrics, multilevel models, probabilistic programming, and efficient Bayesian computation.
The default plot method for dataframes in R is to show each numeric variable in a pair-wise scatter plot. I find this to be a really useful first look at a dataset, both to see correlations and joint distributions between variables, but also to quickly diagnose potential strangeness like bands of repeating values or outliers.
From what I can tell, there are no builtins in the python data ecosystem (numpy, pandas, matplotlib) for this so I coded up a function to emulate the R behaviour. You can get it in this gist (feedback welcomed).
Here’s an example of it in action showing derived time-series features (12 hour rates of change) for some clinical variables.
Over the years of my graduate studies I made a lot of plots. I mean tonnes. To get an extremely conservative estimate I grep’ed for every instance of “plot\(” in all of the many R scripts I wrote over the past five years.
find . -iname "*.R" -print0 | xargs -L1 -0 egrep -r "plot(" | wc -l 2922
The actual number is very likely orders of magnitude larger as 1) many of these plot statements are in loops, 2) it doesn’t capture how many times I may have ran a given script, 3) it doesn’t look at previous versions, 4) plot is not the only command to generate figures in R (eg hist), and 5) early in my graduate career I mainly used gnuplot and near the end I was using more and more matplotlib. But even at this lower bound, that’s nearly 3,000 plots. A quick look at the TOC of my thesis reveals a grand total of 33 figures. Were all the rest a waste? (Hint: No.)
The overwhelming majority of the plots that I created served a very different function than these final, publication-ready figures. Generally, visualizations are either:
- A) Communication between you and data, or
- B) Communication between you and someone else, through data.
These two modes serve very different purposes and can require taking different approaches in their creation. Visualizations in the first mode need only be quick and dirty. You can often forget about all that nice axis labeling, optimal color contrast, and whiz-bang interactivity. As per my estimates above, this made up at the very least 10:1 of visuals created. The important thing is that, in this mode, you already have all of the context. You know what the variables are, you know what the colors, shapes, sizes, and layouts mean – after all, you just coded it. The beauty of this is that you can iterate on these plots very quickly. The conversation between you and the data can dialogue back and forth as you intrepidly explore and shine your light into all of it’s dark little corners.
In the second mode, you are telling a story to someone else. Much more thought and care needs to be placed on ensuring that the whole story is being told with the visualization. It is all too easy to produce something that makes sense to you, but is completely unintelligible to your intended audience. I’ve learned the hard way that this kind of visual should always be test-driven by someone who, ideally, is a member of your intended audience. When you are as steeped in the data as you most likely are, your mind will fill in any missing pieces of the story – something your audience won’t do.
In my new role as part of the Data Science team at Penn Medicine, I’ll be making more and more data visualizations in the second mode. A little less talking to myself with data, and a little more communicating with others through data. I’ll be sharing some of my experiences, tools, wins, and disasters here. Stay tuned!