Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016

23 June 2016

AUC - at what cost(s)?Evaluating and comparing machine learning modelsAlex Korbonits, Data Scientist

2

Introduction

• Remitly, Inc.

• Founded in 2011

• Largest independent digital remittance company in the US

• Sending over USD $1.4 billion overseas annually

• Series C

• CEO Matt Oppenheimer is EY’s 2016 “Entrepreneur of the Year” in the Services category

About Remitly• Alex Korbonits

• Data Scientist at Remitly, Inc.

• Seattleite

• Co-organizer of Seattle DAML

• Previous work in Data Science at Nuiku, Inc.

About Me

3

Introduction

• Model selection: data and algorithms aren’t the only knobs

• Problems with typical model selection strategies

• Review of typical model evaluation metrics

• Augmenting these metrics to address practical problems

• Why this matters to Remitly

Agenda

You may think in order to solve all of your machine learning problems, you only need to have…

… but that is not the case. You need to think carefully about model evaluation and comparison metrics.

8

Why is model selection important?

• Big data is not enough:

• Not everyone has it. Or maybe the big data you have isn’t useful.

• Fancy algorithms are not enough:

• No Free Lunch Theorem (Wolpert, 1997). There isn’t a ”one-size-fits-all” model class. Deep learning not a silver bullet.

• Inadequate coverage in the literature:

• This is a practical problem, it’s hard, and it matters.

• Problems such as class imbalance and inclusion of economic constraints.

Model Selection

9

ML + Economics

• Loss matrices inadequate:

• Penalty of misclassification may vary per instance.

• E.g., size of transaction. Not all misclassifications result in same penalty even if misclassified from same class.

• Indifference curves good for post-training selection:

• We can compare tradeoffs of selecting different classification thresholds.

• EXTREMELY IMPORTANT when costs of false positives and false negatives are very, very different.

Economics: including costs/revenue into model selection

10

Classic machine learning

• Test positive and test negative (prediction outcomes)

• Condition positive and condition negative (actual values)

• True positive: condition positive and test positive

• True negative: condition negative and test negative

• False positive (Type I error): condition negative and test positive

• False negative (Type II error): condition positive and test negative

Confusion matrix

11

Radar in WWII

• Classic approach measuring area under the receiver operating characteristic (ROC)

• Pros:

• Standard in the literature

• Descriptive of predictive power across thresholds

• Cons:

• Ignores class imbalances

• Ignores constraints such as costs of FP vs. FN

My curve is better than your curve

12

Metrics affected by class imbalance

• X axis is recall == tpr == TP / (TP + FN)

• I.e., of the total positive instances, what proportion didour model classify as positive?

• Y axis is precision == TP / (TP + FP).

• I.e., of the positive classifications, what proportion were positive instances?

• Class imbalance affects this: WLOG, class imbalance shiftscurves down (for smaller positive classes).

• There exists a one-to-one mapping from ROC space to PR space. But optimizing ROC AUC != optimizing PR AUC.

Precision and Recall curves

13

Inclusion of costs in ROC Space

• Indifference Curve:

• Level set that defines, e.g., where your classifier implies business profitability vs. loss.

• Defined via constraint optimization (e.g., costs of quadrants in your confusion matrix).

• Points above this curve satisfy the constraint and are good. Points below == bad.

• Why we care:

• Orange model doesn’t have a threshold that crosses your indifference curve, even if its AUC is larger. No threshold for orange model can satisfy your constraint.

Cost curves in ROC Space

14

How do I pick the right threshold?

• Threshold choices:

• Find point with maximum distance from indifference curve.

• Of your threshold choices, this point maximizes your utility.

• Technically you’re at a higher indifference curve

• Other things to consider:

• Changes in your constraints – costs changes, therefore your indifference curve can change.

• Update models and thresholds subject to such changes.

Picking the right classifier threshold

15

Other topics

• F1 Score

• Harmonic mean of precision and recall

• Matthews Correlation Coefficient

• Balanced measure which can be used with imbalanced data.

• Proficiency

• Mutual information of ground truths and predictions divided by the entropy of your ground truth, i.e., what proportion of the data does your model explain.

• C.f. Sam Steingold’s ML Conf Seattle 2016 talk.

Other interesting metrics

16

Citing our sources

BibliographyDavis, Jesse, and Mark Goadrich. "The relationship between Precision-Recall and ROC curves." In Proceedings of the 23rd international conference on Machine learning, pp. 233-240. ACM, 2006.

Raghavan, V., Bollmann, P., & Jung, G. S. (1989). A critical investigation of recall and precision as measures of retrieval system performance. ACM Trans. Inf. Syst., 7, 205–229

Provost, F., Fawcett, T., & Kohavi, R. (1998). The case against accuracy estimation for comparing induction algorithms. Proceeding of the 15th International Conference on Machine Learning (pp. 445–453). Morgan Kaufmann, San Francisco, CA

Drummond, C., & Holte, R. (2000). Explicitly representing expected cost: an alternative to ROC representation. Proceeding of Knowledge Discovery and Datamining (pp. 198–207).

Drummond, C., & Holte, R. C. (2004). What ROC curves can’t do (and cost curves can). ROCAI (pp. 19–26)

Bradley, A. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognition, 30, 1145–1159

Fawcett, Tom. "An introduction to ROC analysis." Pattern recognition letters27, no. 8 (2006): 861-874

Metz, Charles E. "Basic principles of ROC analysis." In Seminars in nuclear medicine, vol. 8, no. 4, pp. 283-298. WB Saunders, 1978

Saito, Takaya, and Marc Rehmsmeier. "The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets." PloS one 10, no. 3 (2015): e0118432

"Information Theoretic Metrics for Multi-class Predictor Evaluation", Sam Steingold, 2016, accessed 23 June 2016, http://www.slideshare.net/SessionsEvents/sam-steingold-lead-data-scientist-magnetic-media-online-at-mlconf-sea-5201

“Machine Learning Meets Economics”, Datacratic 2016, accessed 23 June 2016, http://blog.mldb.ai/blog/posts/2016/01/ml-meets-economics/

17

What we talked about

• Model selection: data and algorithms aren’t the only knobs

• Problems with typical model selection strategies

• Review of typical model evaluation metrics

• Augmenting these metrics to address practical problems

• Why this matters to Remitly

Summary

18

Remitly’s Data Science team uses ML for a variety of purposes.

ML applications are core to our business – therefore our business must be core to our ML applications.

Machine learning at Remitly

19

What we do at Remitly

• Fraud classification: Left swipe for fraudsters; Right swipe for lovers

• Anomaly detection: stopping money launderers in their tracks

• Customer behavior: learning why users love using Remitly so much and forecasting future use/growth

• Market behavior: forecasting supply and demand for our services

… and we’re just getting started

ML Applications

www.remitly.com/careers

We’re [email protected]

Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016

Science

Transcript of Alex Korbonits, "AUC at what costs?" Seattle DAML June 2016