Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

41
Black Boxes and Unicorns Jeremy Achin | Data Scientist & CEO| DataRobot

Transcript of Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Page 1: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Black Boxes and Unicorns

Jeremy Achin | Data Scientist & CEO| DataRobot

Page 2: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Jeremy Achin?

Page 3: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

3

DataRobot Company History

2012 2H 2013 1H 2013 2H 2014 1H 2014 2H 2015 1H

June ‘12Founded

June ‘13Seed Funding

$3.3M

July ‘14Series A

$21M

2015 2H

Bigger & Better Announcements Coming Soon!

Page 4: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]
Page 5: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

DataRobot: better predictive models faster

Page 6: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]
Page 7: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726

Leo Breiman (classification & regression trees, random forest, and my personal hero)

Page 8: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726

Leo Breiman (classification & regression trees, random forest, and my personal hero)

2001: Statistical Modeling: The Two Cultures

● An attack on statisticians who rely solely on regression models

● Argued we should be using the techniques that obtain the best results

● Even a carefully built regression model is just one of many possible representations of the underlying reality

Page 9: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

“If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data [regression] models and adopt a more diverse set of tools.”

https://projecteuclid.org/download/pdf_1/euclid.ss/1009213726

Page 10: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

14 Years LaterExcellent progress in recent years but...

● still armies of people taking months to manually build regression models (especially in larger companies)

● non-regression methods still thought of as “black box”

Page 11: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Black Box (n) /blak bäks/

Page 12: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Black Box (n) /blak bäks/A phrase people use when they’re scared of technology they don’t understand and want to keep doing the same thing they’ve been doing for the last twenty years.

Page 13: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]
Page 14: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

What do we really need to know about a predictive model?

1. Overall Performance on Out-of-Sample (Validation) Data

2. Predicted vs Actual by Variable

3. How a model’s predictions change as values of input

variables change

Page 15: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

What do we really need to know about a predictive model?

1. Overall Performance on Out-of-Sample (Validation) Data

2. Predicted vs Actual by Variable

3. How a model’s predictions change as values of input

variables change

None of these depend on the specific algorithm you are using. Even #3!

Page 16: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Overall Out-of-Sample Performance

Mean Absolute Error

Weighted Mean Absolute Error

Root Mean Squared Error

Root Mean Squared Mean F Score

Mean Consequential Error

Mean Average Precision

Multi-class Log Loss

Hamming Loss

Mean Utility

Continuous Ranked

AUC

Average Precision (column-wise)

GiniAverage Among Top P

Mean Average Precision (row-wise)

`

Normalized Discounted Cumulative Gain@k

Mean Average Precision@n

Levenshtein Distance

Average Precision

Absolute Error

Probability ScoreLogarithmic Error

Page 17: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Hospital Readmission Model Assessment and Interpretation

Number of Prior Visits to Hospital

Hos

pita

l Rea

dmis

sion

Rat

e

Page 18: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Hospital Readmission Model Assessment and Interpretation

Number of Prior Visits to Hospital

Hos

pita

l Rea

dmis

sion

Rat

e

Actual Hospital Readmission

Rate

Page 19: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Hospital Readmission Model Assessment and Interpretation

Number of Prior Visits to Hospital

Hos

pita

l Rea

dmis

sion

Rat

e

Predicted Hospital

Readmission Rate

Page 20: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Hospital Readmission Model Assessment and Interpretation

Number of Prior Visits to Hospital

Hos

pita

l Rea

dmis

sion

Rat

e

Page 21: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Hospital Readmission Model Assessment and Interpretation

Number of Prior Visits to Hospital

Hos

pita

l Rea

dmis

sion

Rat

e

Partial Dependence

Page 22: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Partial Dependence

10.13.2 Partial Dependence Plots . . . . . . . . . . . . . 369

https://web.stanford.edu/~hastie/local.ftp/Springer/OLD/ESLII_print4.pdf

Page 23: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Compliance (n) /kəmˈplīəns/

Page 24: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Compliance (n) /kəmˈplīəns/A word people use as a last resort to defend the status quo after they realize that their 100 variable regression model is an arbitrary representation of reality that is less accurate, robust, and interpretable than modern alternatives.

Page 25: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Arbitrary Representations of RealityThree statisticians sitting at a bar...

One more round?

ftp://ftp.nhtsa.dot.gov/GES/GES12/

● 153,077 Police-reported accidents

● 58 Variables

Goal: Try to Predict Probability of a Fatality

Page 26: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Variable Name Restraint Misuse: Roll Over: Alcohol Involved:Is Driver:

Regression Coefficient 0.509 0.355 0.089-0.694

Arbitrary Representations of Reality

Model Performance (Log Loss): 0.469

"Looks like as long as we use seat belts and don't rollover, we’ll survive. Having alcohol in the system doesn’t make much of a difference..

Also, being the driver is safe, so I'm driving home"

Statistician #1

Page 27: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Variable Name Restraint Misuse: Roll Over: Alcohol Involved:Is Driver:

Regression Coefficient 0.509 0.355 0.089-0.694

Arbitrary Representations of Reality

Model Performance (Log Loss): 0.469

"Looks like as long as we use seat belts and don't rollover, we’ll survive. Having alcohol in the system doesn’t make much of a difference..

Also, being the driver is safe, so I'm driving home"

Model Performance (Log Loss): 0.467

"Hmmm... looks like drinking and driving leads to fatal crashes. Probably shouldn't have another round."

Also, the later the better, so let's just wait here until midnight"

Variable Name Alcohol Involved: Age: Restraint Misuse:Hour of Accident:

RegressionCoefficient 1.866 0.008 0.000-0.019

Statistician #2Statistician #1

Page 28: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Variable Name Restraint Misuse: Roll Over: Alcohol Involved:Is Driver:

Regression Coefficient 0.509 0.355 0.089-0.694

Arbitrary Representations of Reality

Model Performance (Log Loss): 0.469

"Looks like as long as we use seat belts and don't rollover, we’ll survive. Having alcohol in the system doesn’t make much of a difference..

Also, being the driver is safe, so I'm driving home"

Model Performance (Log Loss): 0.422

"No, no, no, we just need to wear lap and shoulder belts with our booster seats, and be police officers. Look at those coefficients!

Furthermore, my model is better, so I'm right."

Variable Name Alcohol Involved: Age: Restraint Misuse:Hour of Accident:

RegressionCoefficient 1.866 0.008 0.000-0.019

Variable Name Opening Door In Motion: Is Police Officer: Booster Seat Used:Lap And Shoulder Belt:

RegressionCoefficient 0.449-0.412-0.787-1.897

Statistician #3Statistician #2Statistician #1

Model Performance (Log Loss): 0.467

"Hmmm... looks like drinking and driving leads to fatal crashes. Probably shouldn't have another round."

Also, the later the better, so let's just wait here until midnight"

Page 29: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

The Killer Potato

Page 30: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

The Killer Potato

Page 31: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]
Page 32: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Obligatory Data Scientist Definition Slide

Hacking Skills

Maths & Stats

Domain Knowledge

Data Science

● Foundational Statistics● Internals of Algorithms● Practical Knowledge

and Experience

● Programming○ Get Data○ Manipulate Data○ Explore Data○ Build Models○ Implement Models

● Understand the Business Problem

● Understanding of the Data

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Page 33: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

The current path to becoming a Data Scientist

Page 34: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

A Better Way

AUTOMATED USINGMODERN TOOLS AND

COMPUTATIONAL POWER

Page 35: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Takeaways● There are technique-agnostic ways to

assess and interpret predictive models.

● The shortage of Data Scientists will be solved by a combination of pragmatic education and levels of automation currently not thought possible.

Page 36: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Three quick tips for entrepreneurs

Page 37: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Watch out for Lean Startup & MVP Zealots

Minimum viable product (MVP) get the smallest functional product into the market asap to derisk the investment.

Page 38: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Watch out for Lean Startup & MVP Zealots

Minimum viable product (MVP) is the product with the highest return on investment versus risk.

Minimum viable product (MVP) get the smallest functional product into the market asap to derisk the investment.

Page 39: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Be Paranoid and Don’t Rely on Hope.

Page 40: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

Choose the Right Investors & Advisors

CHRIS LYNCH HARRY WELLER

Jason Seats Jit Saxena Kevin Dick

Ray Tacoma

Brad Gillespie

Page 41: Black Boxes and Unicorns // Jeremy Achin, DataRobot [FirstMark's Data Driven]

© DataRobot, Inc. All rights reserved.Confidential

[email protected]