Michael Kelly - Lessons Learned from Software Testing at Startups - EuroSTAR 2012
Three lessons learned from building a ... - Michael...
Transcript of Three lessons learned from building a ... - Michael...
![Page 1: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/1.jpg)
Three lessons learned from building a production machine learning system
Michael Manapat Stripe @mlmanapat
![Page 2: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/2.jpg)
Fraud• Card numbers are stolen by hacking, malware, etc.
• “Dumps” are sold in “carding” forums
• Fraudsters use numbers in dumps to buy goods, which they then resell
• Cardholders dispute transactions
• Merchant ends up bearing cost of fraud
![Page 3: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/3.jpg)
• We train binary classifiers to predict fraud
• We use open source tools
• Scalding/Summingbird for feature generation
• scikit-learn for model training(eventually: github.com/stripe/brushfire)
![Page 4: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/4.jpg)
1
Don’t treat models as black boxes
![Page 5: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/5.jpg)
Early ML at Stripe• Focused on training with more and more data and
adding more and more features
• Didn’t think much about
• ML algorithms (tuning hyperparameters, e.g.)
• The deeper reasons behind any particular set of results
Substantial reduction in fraud rate
![Page 6: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/6.jpg)
Product development
From a product standpoint:
• We were blocking high risk charges and surfacing just the decision
• We wanted to provide Stripe users insight into our actions—reasons for scores
![Page 7: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/7.jpg)
Score reasons
X = 5, Y = 3: score = 0.1
Which feature is “driving” the score more?
X < 10
Y < 5 X < 15
0.1 (20) 0.3 (30) 0.5 (10) 0.9 (40)
True False
![Page 8: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/8.jpg)
Score reasons
X = ?, Y = 3:(20/70) * 0.1 + (10/70) * 0.5 + (40/70) * 0.9 = 0.61
Score Δ = |holdout - original| = |0.61 - 0.1| = 0.51
Now producing richer reasons with multiple predicates
X < ?
Y < 5 X < ?
0.1 (20) 0.3 (30) 0.5 (10) 0.9 (40)
![Page 9: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/9.jpg)
Model introspectionIf a model didn’t look good in validation, it wasn’t clear what to do (besides trying more features/data)
What if we used our “score reasons” to debug model issues?
![Page 10: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/10.jpg)
• Take all false positives (in validation data or in production) and group by generated reason
• Were a substantial fraction of the false positives driven by a few features?
• Did all the comparisons in the explanation predicates make sense? (Were they comparisons a human might make for fraud?)
• Our models were overfit!
![Page 11: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/11.jpg)
Actioning insights• Hyperparameter optimization
• Feature selection
PrecisionRecall
![Page 12: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/12.jpg)
Summary• Don’t treat models as black boxes
• Thinking about the learning process (vs. just features and data) can yield significant payoffs
• Tooling for introspection can accelerate model development/“debugging”
Julia Evans, Alyssa Frazee, Erik Osheim, Sam Ritchie, Jocelyn Ross, Tom Switzer
![Page 13: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/13.jpg)
2
Have a plan for counterfactual evaluation
![Page 14: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/14.jpg)
• December 31st, 2013
• Train a binary classifier for disputes on data from Jan 1st to Sep 30th
• Validate on data from Oct 1st to Oct 31st (need to wait ~60 days for labels)
• Based on validation data, pick a policy for actioning scores: block if score > 50
![Page 15: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/15.jpg)
Questions (1)
• Business complains about high false positive rate: what would happen if we changed the policy to "block if score > 70"?
• What are the production precision and recall of the model?
![Page 16: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/16.jpg)
• December 31st, 2014. We repeat the exercise from a year earlier
• Train a model on data from Jan 1st to Sep 30th
• Validate on data from Oct 1st to Oct 31st (need to wait ~60 days for labels)
• Validation results look ~ok (but not great)
• We put the model into production and the results are terrible
![Page 17: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/17.jpg)
Questions (2)
• Why did the validation results for the new model look so much worse?
• How do we know if the retrained model really is better than the original model?
![Page 18: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/18.jpg)
Counterfactual evaluation
• Our model changes reality (the world is different because of its existence)
• We can answer some questions (around model comparisons) with A/B tests
• For all these questions, we want an approximation of the charge/outcome distribution that would exist if there were no model
![Page 19: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/19.jpg)
One approach• Probabilistically reverse a
small fraction of our block decisions
• The higher the score, the lower probability we let the charge through
• Weight samples by 1 / P(allow)
• Get information on the area we want to improve on
![Page 20: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/20.jpg)
ID Score p(Allow) Original Action
Selected Action Outcome
1 10 1.0 Allow Allow OK
2 45 1.0 Allow Allow Fraud
3 55 0.30 Block Block -
4 65 0.20 Block Allow Fraud
5 100 0.0005 Block Block -
6 60 0.25 Block Allow OK
![Page 21: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/21.jpg)
ID Score P(Allow) Weight Original Action
Selected Action Outcome
1 10 1.0 1 Allow Allow OK
2 45 1.0 1 Allow Allow Fraud
4 65 0.20 5 Block Allow Fraud
6 60 0.25 4 Block Allow OK
Evaluating the "block if score > 50" policy
Precision = 5 / 9 = 0.56Recall = 5 / 6 = 0.83
![Page 22: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/22.jpg)
• The propensity function controls the exploration/exploitation tradeoff
• Precision, recall, etc. are estimators
• Variance of the estimators decreases the more we allow through
• Bootstrap to get error bars (pick rows from the table uniformly at random with replacement)
• Li, Chen, Kleban, Gupta: "Counterfactual Estimation and Optimization of Click Metrics for Search Engines"
![Page 23: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/23.jpg)
Summary• Have a plan for counterfactual evaluation before
you productionize your first model
• You can back yourself into a corner (with no data to retrain on) if you address this later
• You should be monitoring the production performance of your model anyway (cf. next lesson)
Alyssa Frazee, Julia Evans, Roban Kramer, Ryan Wang
![Page 24: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/24.jpg)
3
Invest in production monitoring for your models
![Page 25: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/25.jpg)
Production vs. data stack• Ruby/Mongo vs. Scala/Hadoop/Thrift
• Some issues
• Divergence between production and training definitions
• Upstream changes to library code in production feature generation can change feature definitions
• True vs. “True”
![Page 26: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/26.jpg)
Domain-specific scoring service (business logic)
“Pure” model evaluation service
Aggregation jobs
Aggregation jobs keep track of • Overall action rate and
rate per Stripe user • Score distributions • Feature distributions (%
null, p50/p90 for numerical values, etc.)
Logged scoring requests
![Page 27: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/27.jpg)
Aggregation jobs (get all aggregates per model)
Logged scoring requests
Domain-specific scoring service (business logic)
“Pure” model evaluation service
![Page 28: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/28.jpg)
Summary• Monitor the production inputs to and outputs of
your models
• Have dashboards that can be watched on deploys and alerting for significant anomalies
• Bake the monitoring into generic ML infrastructure (so that each ML application isn’t redoing this)
Steve Mardenfeld, Tom Switzer
![Page 29: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/29.jpg)
• Don’t treat models as black boxes
• Have a plan for counterfactual evaluation before productionizing your first model
• Build production monitoring for action rates, score distributions, and feature distributions (and bake into ML infra)
![Page 30: Three lessons learned from building a ... - Michael Manapatmlmanapat.com/talks/dataengconf2016.pdf · Three lessons learned from building a production machine learning system Michael](https://reader034.fdocuments.net/reader034/viewer/2022042801/5af889967f8b9a8d1c91a3c0/html5/thumbnails/30.jpg)
ThanksStripe is hiring data scientists, engineers, and engineering managers!
[email protected] | @mlmanapat