If you build and/or use classifiers in your life, feel free to print this out and keep it above you desk.

# Category Archives: decision making

# Weapons of Math Destruction – A Data Scientist’s Guide to Disarmament

I’ve had this book on pre-order since spring and it finally arrived on Friday. I subsequently devoured it over the weekend.

The book lays out a clear and compelling case for how data-driven algorithms can become — in contrast to their promise of amoral objectivism — efficient means for reproducing and even exacerbating social inequalities and injustices. From predictive policing and recidivism risk models to targeted marketing for predatory loans and for-profit universities, O’Neil explains how to recognize WMDs by 3 distinct features:

- The model is either hidden, or opaque to the individuals affected by its calculations, restricting any possibility of seeking recourse against – or understanding of – its results or conclusions.
- The model works against the subject’s interest (eg. it is unfair).
- The model scales, giving it the opportunity to negatively affect a very large segment of the population.

The taxonomy provides a simple framework for identifying WMDs in the wild. However, importantly for data scientists and other data practitioners, it forms a checklist (or rather an anti-checklist) to keep in mind when developing models that will be deployed into the real world. As data scientists, many of us are strongly incentivized to achieve feature 3, and doing so only makes it increasingly important to be constantly questioning the degree to which our models could fall victim to features 2 and 1.

Feature 2, as O’Neil lays out, can occur despite the best intentions of a model’s creators. This can (and does!) happen in two ways: First, when a modeler seeks to create an objective system for rating individuals (say, for acceptance to a prestigious university, or for a payday loan), the data used to build the model is *already encoded with the socially constructed biases* of the conditions under which it was generated. Even when attempting to exclude potentially bias-laden factors such as race or gender, this information seeps into the model nonetheless via correlations to seemingly benign variables such as zip codes or the makeup of a subject’s social connections.

Second, when the outcome of the model results in the reinforcement of the unjust conditions from which it was created, a negative feedback loop is created. Such a negative feedback loop is particularly present and pernicious in the use of recidivism risk models to guide sentencing decisions. An individual may be labeled as high risk due not to qualities of the individual himself, but his circumstances of living in a poor, high crime neighborhood. Being incarcerated based on the results of this model renders him more likely to end up back in that neighborhood, subject to continued poverty and disproportionate policing. Thus the model has set up the conditions to fulfill its own prediction.

As machine learning algorithms become more and more accurate at a variety of tasks, their inner workings become harder and harder to understand. The trend will make it increasingly difficult to avoid feature 1 of the WMD taxonomy. Current advanced techniques like deep learning are creating models that are remarkably performant, yet not fully understood by the researchers creating them, much less the individuals affected by their results. In light of this, we need to think carefully as data scientists about how to communicate these models with as much transparency as possible. How to do so remains an open question. But the internal ‘black box’ nature of these algorithms does not obviate our responsibility to disclose exactly what input data went into a given model, what assumptions were made of that data, and on what criteria the model was trained.

Overall, WMD provides an incredibly important framework for thinking about the consequences of uncritically applying data and algorithms to people’s lives. For those of us, like O’Neil herself, who make our living using mathematics to create data-driven algorithms, taking to heart the lessons contained in *Weapons Of Math Destruction* will be our best defense against unwittingly creating the bomb ourselves.

# Introduction to Machine Learning Talk

There was an amazing turnout at last night’s DataPhilly meetup (~200 people!). I was completely delighted by the turnout and people’s engagement level. Here are the slides of the talk I gave to set up the evening with a high-level introduction to machine learning.

@CjBayesian giving an awesome talk on #machinelearning at tonight’s dataphilly! pic.twitter.com/Hojn5t7tDl

— DataPhilly (@DataPhilly) February 19, 2016

# Speaking at DataPhilly February 2016

The next DataPhilly meetup will feature a medley of machine-learning talks, including an Intro to ML from yours truly. Check out the speakers list and be sure to RSVP. Hope to see you there!

### Thursday, February 18, 2016

6:00 PM to 9:00 PM

**Speakers:**

- Corey Chivers
- Randy Olson
- Austin Rochford

**Abstract**: Corey will present a brief introduction to machine learning. In his talk he will demystify what is often seen as a dark art. Corey will describe how we “teach” machines to learn patterns from examples by breaking the process into its easy-to-understand component parts. By using examples from fields as diverse as biology, health-care, astrophysics, and NBA basketball, Corey will show how data (both big and small) is used to teach machines to predict the future so we can make better decisions.

**Bio**: Corey Chivers is a Senior Data Scientist at Penn Medicine where he is building machine learning systems to improve patient outcomes by providing real-time predictive applications that empower clinicians to identify at risk individuals. When he’s not pouring over data, he’s likely to be found cycling around his adoptive city of Philadelphia or blogging about all things probability and data at bayesianbiologist.com.

**Randy Olson** (University of Pennsylvania Institute for Biomedical Informatics):

**Automating data science through tree-based pipeline optimization**

**Abstract**: Over the past decade, data science and machine learning has grown from a mysterious art form to a staple tool across a variety of fields in business, academia, and government. In this talk, I’m going to introduce the concept of tree-based pipeline optimization for automating one of the most tedious parts of machine learning — pipeline design. All of the work presented in this talk is based on the open source Tree-based Pipeline Optimization Tool (TPOT), which is available on GitHub at https://github.com/rhiever/tpot.

**Bio**: Randy Olson is an artificial intelligence researcher at the University of Pennsylvania Institute for Biomedical Informatics, where he develops state-of-the-art machine learning algorithms to solve biomedical problems. He regularly writes about his latest adventures in data science at RandalOlson.com/blog, and tweets about the latest data science news at http://twitter.com/randal_olson.

**Abstract**: Bayesian optimization is a technique for finding the extrema of functions which are expensive, difficult, or time-consuming to evaluate. It has many applications to optimizing the hyperparameters of machine learning models, optimizing the inputs to real-world experiments and processes, etc. This talk will introduce the Gaussian process approach to Bayesian optimization, with sample code in Python.

**Bio**: Austin Rochford is a Data Scientist at Monetate. He is a former mathematician who is interested in Bayesian nonparametrics, multilevel models, probabilistic programming, and efficient Bayesian computation.

# A probabilistic justification to carpe diem

There’s a curious thing about unlikely independent events: no matter how rare, they’re most likely to happen right away.

**Let’s get hypothetical**

You’ve taken a bet that pays off if you guess the exact date of the next occurrence of a rare event (p = 0.0001 on any given day i.i.d). What day do you choose? In other words, what is the most likely day for this rare event to occur?

Setting aside for now why in the world you’ve taken such a silly sounding bet, it would seem as though a reasonable way to think about it would be to ask: what is the expected number of days until the event? That must be the best bet, right?

We can work out the expected number of days quite easily as 1/p = 10000. So using the logic of expectation, we would choose day 10000 as our bet.

Let’s simulate to see how often we would win with this strategy. We’ll simulate the outcomes by flipping a weighted coin until it comes out heads. We’ll do this 100,000 times and record how many flips it took each time.

The event occurred on day 10,000 exactly 35 times. However, if we look at a histogram of our simulation experiment, we can see that the time it took for the rare event to happen was more often short, than long. In fact, the event occurred 103 times on the very first flip (the most common Time to Event in our set)!

So from the experiment it would seem that the most likely amount of time to pass until the rare event occurs is 0. Maybe our hypothetical event was just not rare *enough*. Let’s try it again with p=0.0000001, or an event with a 1 in 1million chance of occurring each day.

While now our event is extremely unlikely to occur, it’s still *most likely* to occur right away.

**Existential Risk**

What does this all have to do with seizing the day? Everything we do in a given day comes with some degree of risk. The Stanford professor Ronald A. Howard conceived of a way of measuring the riskiness of various day-to-day activities, which he termed the **micromort**. One micromort is a unit of risk equal to p = 0.000001 (1 in a million chance) of death. We are all subject to a baseline level of risk in micromorts, and additional activities may add or subtract from that level (skiing, for instance adds 0.7 micromorts per day).

While minimizing the risks we assume in our day-to-day lives can increase our expected life span, the most likely exact day of our demise is always our next one. So *carpe diem*!!

Post Script:

Don’t get too freaked out by all of this. It’s just a bit of fun that comes from viewing the problem in a very specific way. That is, as a question of which exact day is most likely. The much more natural way to view it is to ask, what is the relative probability of the unlikely event occurring tomorrow vs * any other day but tomorrow*. I leave it to the reader to confirm that for events with

*p < 0.5*, the latter is always more likely.

# What’s Warren Buffett’s $1 Billion Basketball Bet Worth?

A friend of mine just alerted me to a story on NPR describing a prize on offer from Warren Buffett and Quicken Loans. The prize is a billion dollars (1B USD) for correctly predicting all 63 games in the men’s Division I college basketball tournament this March. The facebook page announcing the contest puts the odds at 1:9,223,372,036,854,775,808, which they note “may vary depending upon the knowledge and skill of entrant”.

Being curious, I thought I’d see what the assumptions were that went into that number. It would make sense to start with the assumption that you don’t know a lick about college basketball and you just guess using a coin flip for every match-up. In this scenario you’re pretty bad, but you are no worse than random. If we take this assumption, we can calculate the odds as 1/(0.5)^63. To get precision down to a whole integer I pulled out trusty bc for the heavy lifting:

$ echo "scale=50; 1/(0.5^63)" | bc 9223372036854775808.000000

Well, that was easy. So if you were to just guess randomly, your odds of winning the big prize would be those published on the contest page. We can easily calculate the expected value of entering the contest as P(win)*prize, or 9,223,372,036ths of a dollar (that’s 9 nano dollars, if you’re paying attention). You’ve literally already spent that (and then some) in opportunity cost sunk into the time you are spending thinking about this contest and reading this post (but read on, ’cause it’s fun!).

But of course, you’re cleverer than that. You know everything about college basketball – or more likely if you are reading this blog – you have a kickass predictive model that is going to up your game and get your hands into the pocket of the Oracle from Omaha.

What level of predictiveness would you need to make this bet worth while? Let’s have a look at the expected value as a function of our individual game probability of being correct.

And if you think that you’re *really* good, we can look at the 0.75 to 0.85 range:

So it’s starting to look enticing, you might even be willing to take off work for a while if you thought you could get your model up to a consistent 85% correct game predictions, giving you an expected return of ~$35,000. A recent paper found that even after observing the first 40 scoring events, the outcome of NBA games is only predictable at 80%. In order to be eligible to win, you’ve obviously got to submit your picks *before* the playoff games begin, but even at this herculean level of accuracy, the expected value of an entry in the contest plummets down to $785.

Those are the odds for an individual entrant, but what are the chances that Buffet and co will have to pay out? That, of course, depends on the number of entrants. Lets assume that the skill of all entrants is the same, though they all have unique models which make different predictions. In this case we can get the probability of at *least one of them* hitting it big. It will be the complement of *no one* winning. We already know the odds for a single entrant with a given level of accuracy, so we can just take the probability that each one doesn’t win, then take 1 minus that value.

Just as we saw that the expected value is very sensitive to the predictive accuracy of the participant, so too is the probability that the prize will be awarded at all. If 1 million super talented sporting sages with 80% game-level accuracy enter the contest, there will only be a slightly greater than 50% chance of anyone actually winning. If we substitute in a more reasonable (but let’s face it, still wildly high) figure for participants’ accuracy of 70%, the chance becomes only 1 in 5739 (0.017%) that the top prize will even be awarded even with a 1 million strong entrant pool.

tl;dr You’re not going to win, but you’re still going to play.

If you want to reproduce the numbers and plots in this post, check out this gist.

# Calculating AUC the hard way

The Area Under the Receiver Operator Curve is a commonly used metric of model performance in machine learning and many other binary classification/prediction problems. The idea is to generate a threshold independent measure of how well a model is able to distinguish between two possible outcomes. Threshold independent here just means that for any model which makes continuous predictions about binary outcomes, the conversion of the continuous predictions to binary requires making the choice of an arbitrary threshold above which will be a prediction of 1, below which will be 0.

AUC gets around this threshold problem by integrating across all possible thresholds. Typically, it is calculated by plotting the rate of false positives against false negatives across the range of possible thresholds (this is the Receiver Operator Curve) and then integrating (calculating the area under the curve). The result is typically something like this:

I’ve implemented this algorithm in an R script (https://gist.github.com/cjbayesian/6921118) which I use quite frequently. Whenever I am tasked with explaining the *meaning* of the AUC value however, I will usually just say that you want it to be 1 and that 0.5 is no better than random. This usually suffices, but if my interlocutor is of the particularly curious sort they will tend to want more. At which point I will offer the interpretation that the AUC gives you the probability that a randomly selected positive case (1) will be ranked higher in your predictions than a randomly selected negative case (0).

Which got me thinking – if this is true, why bother with all this false positive, false negative, ROC business in the first place? Why not just use Monte Carlo to estimate this probability directly?

So, of course, I did just that and by golly it works.

source("http://polaris.biol.mcgill.ca/AUC.R") bs<-function(p) { U<-runif(length(p),0,1) outcomes<-U<p return(outcomes) } # Simulate some binary outcomes # n <- 100 x <- runif(n,-3,3) p <- 1/(1+exp(-x)) y <- bs(p) # Using my overly verbose code at https://gist.github.com/cjbayesian/6921118 AUC(d=y,pred=p,res=500,plot=TRUE) ## The hard way (but with fewer lines of code) ## N <- 10000000 r_pos_p <- sample(p[y==1],N,replace=TRUE) r_neg_p <- sample(p[y==0],N,replace=TRUE) # Monte Carlo probability of a randomly drawn 1 having a higher score than # a randomly drawn 0 (AUC by definition): rAUC <- mean(r_pos_p > r_neg_p) print(rAUC)

By randomly sampling positive and negative cases to see how often the positives have larger predicted probability than the negatives, the AUC can be calculated without the ROC or thresholds or anything. Now, before you object that this is necessarily an approximation, I’ll stop you right there – it is. And it is more computationally expensive too. The real value for me in this method is for my understanding of the meaning of AUC. I hope that it has helped yours too!