Machine learning the high interest credit card of technical debt [PWL]

40
Machine Learning The High Interest Credit Card of Technical Debt The Market Intelligence Company of the Digital World

Transcript of Machine learning the high interest credit card of technical debt [PWL]

Page 1: Machine learning the high interest credit card of technical debt [PWL]

Machine LearningThe High Interest Credit Card of Technical Debt

The Market Intelligence Company of the Digital World

Page 2: Machine learning the high interest credit card of technical debt [PWL]

$65MFunding

2007Founded

6Offices

300+Employees

Market Intelligence Companyof the Digital World

The

Page 3: Machine learning the high interest credit card of technical debt [PWL]
Page 4: Machine learning the high interest credit card of technical debt [PWL]

Learned | Estimated

Page 5: Machine learning the high interest credit card of technical debt [PWL]

Machine learning: The high interest credit card of technical debt (2014)

Hidden technical debt in machine learning systems (2015)

D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison

A Few Words About The Papers

Page 6: Machine learning the high interest credit card of technical debt [PWL]

Systems engineering papers

About Machine Learning systems

Give a lot of names to a lot of things (which we know is hard)

We found them in 2015 and liked them a lot

A Few Words About The Papers

Page 7: Machine learning the high interest credit card of technical debt [PWL]

What is ML and what is Technical Debt?

Sources of Technical Debt in ML systems

Mitigation

Today

Page 8: Machine learning the high interest credit card of technical debt [PWL]

Machine Learning

Train Predict

Data

Algorithm

Data

Data

Hyperparameters

Page 9: Machine learning the high interest credit card of technical debt [PWL]

Why Machine Learning?

Allows us to convert data to software

We often already have data

Some problems are hard or impossible to solve otherwise

http

://xk

cd.c

om/1

425/

Page 10: Machine learning the high interest credit card of technical debt [PWL]

A metaphor for the long term costs of moving quickly

Lack of testing, bad modularity, non-redundant systems, etc.

Somewhat similar to fiscal debt

There are good reasons to take it, but it needs to be serviced

Hidden technical debt - a special, evil, variant

Technical Debt

Page 11: Machine learning the high interest credit card of technical debt [PWL]

Boundary Erosion

Page 12: Machine learning the high interest credit card of technical debt [PWL]

Components, interfaces, all that jazz

Think MVC, microservices

Implicitly assumed in “good” systems

Makes components easy to:- Test- Change- Reason about- Monitor

Boundaries in Systems Engineering

Page 13: Machine learning the high interest credit card of technical debt [PWL]

Entanglement

ML System “Inputs”

Learning settings

Hyperparams

Data prep settings

Real world inputs

?

Other systems outputs

Issues

Change in distribution of any input influences all outputs

Adding/Removing a feature changes the model and output distribution

Any configuration parameter is just as coupled

Retraining not reproducible

Changing Anything Changes Everything (CACE)

Model parts

Page 14: Machine learning the high interest credit card of technical debt [PWL]

Correction Cascades

Output Output Output

We sometimes use output from an existing model as a feature to get a small correction

Easier than training a new model

Easier than teaching an existing model new tricks

A

B

C

Page 15: Machine learning the high interest credit card of technical debt [PWL]

Correction Cascades

Output Output Output

Improvement

Degradation

Model improvements cause degradation down the line

Corrections might lead to an “improvement deadlock”

A

B

C

Page 16: Machine learning the high interest credit card of technical debt [PWL]

Outputs of ML systems include:- Predictions- Weights and other state

Data is easy to consume

In turn makes it hard to improve model

May create hidden feedback loops

Undeclared Consumers

Page 17: Machine learning the high interest credit card of technical debt [PWL]

Data Dependencies

Page 18: Machine learning the high interest credit card of technical debt [PWL]

Data Dependencies

Regular system

ComponentInput

Component Output

ComponentInput Output

Page 19: Machine learning the high interest credit card of technical debt [PWL]

Data Dependencies

Regular system

ComponentInput

Component Output

ComponentInput Output

Data dependency

Page 20: Machine learning the high interest credit card of technical debt [PWL]

Data Dependencies

Regular system

ComponentInput

ML System

Component Output

ComponentInput Output

Input Logs

Weights

Output

ML Component

Trainer

PredictInput Output

Data dependency

Page 21: Machine learning the high interest credit card of technical debt [PWL]

Data Dependencies

Regular system

ComponentInput

ML System

Component Output

ComponentInput Output

Input Logs

Weights

Output

ML Component

Trainer

PredictInput Output

Data dependency

Page 22: Machine learning the high interest credit card of technical debt [PWL]

Features for training can be outputs of other models

IDF tables, Word2Vec embeddings..

Logs, intermediate results, monitoring feeds..

But if they change schema? Stop being updated? Disappear?

Unstable Dependencies

Page 23: Machine learning the high interest credit card of technical debt [PWL]

Legacy features - Nobody maintains / wants to maintain them

Bundled features - Not sure which ones we need

Correlated features - May mask features with actual causality

Epsilon features - Improve the result by very little

Underutilized Dependencies

Page 24: Machine learning the high interest credit card of technical debt [PWL]

Software Issues

Page 25: Machine learning the high interest credit card of technical debt [PWL]

ML as Software

Actual machine learning is a lot more than modeling

ConfigurationData

Collection

Feature Extraction

DataVerification

Process Management

Resource Management

Analysis Tools Serving Infrastructure

Monitoring

Model

Glue code

Page 26: Machine learning the high interest credit card of technical debt [PWL]

Software issues

Pipeline jungles

Dead experimental paths

Abstraction Debt

Multiple languages, systems, packages

Page 27: Machine learning the high interest credit card of technical debt [PWL]

Need to configure/test/deploy:- Hyper-parameters- Schema (including semantics)- Data dependencies

Hard to understand or visualize what changed

Configuration Debt

Page 28: Machine learning the high interest credit card of technical debt [PWL]

Interactions

Page 29: Machine learning the high interest credit card of technical debt [PWL]

Experience has shown that the external world is rarely stable

- Word2Vec for “Pokemon”- Population of Sudan- Gregorian dates of holidays

Makes monitoring essential.

Makes testing very hard.

Changes in The External World

Page 30: Machine learning the high interest credit card of technical debt [PWL]

A model sometimes influences its future training data

This is common in:- Recommendation systems- Ad placement- Systems that affect the physical world

Especially hard if change is gradual and model updates infrequently

Direct Feedback Loops

Page 31: Machine learning the high interest credit card of technical debt [PWL]

Often happen when two different systems learn from each other’s outputs

Classic example is algo-trading

But two independent content generation systems running on the same page also qualify

Undeclared consumers can be a cause

Hidden Feedback Loops

Page 32: Machine learning the high interest credit card of technical debt [PWL]

..But waitThere’s More!

Page 33: Machine learning the high interest credit card of technical debt [PWL]

Data Testing

Reproducibility

Process Management

Cultural Debt

More!

Page 34: Machine learning the high interest credit card of technical debt [PWL]

Mitigation

Page 35: Machine learning the high interest credit card of technical debt [PWL]

How easily can an entirely new algorithmic approach be tested at full scale?

What is the transitive closure of all data dependencies?

How precisely can the impact of a new change to the system be measured?

Be Aware of Debt

Page 36: Machine learning the high interest credit card of technical debt [PWL]

Does improving one model or signal degrade others?

How quickly can new members of the team be brought up to speed?

Be Aware of Debt

Page 37: Machine learning the high interest credit card of technical debt [PWL]

Merge mature models into a single, well defined, well tested system

Prune experimental code paths

Make each feature count

Monitor

Map consumers

Test data

Paying per model

Page 38: Machine learning the high interest credit card of technical debt [PWL]

Configuration system - versioned, comprehensive, testable

Data dependency system - versioned, comprehensive, testable

Consolidate mature systems

Reproducibility is awesome

Pay off cultural debt

Paying for Systems

Page 39: Machine learning the high interest credit card of technical debt [PWL]

Other Questions?

Page 40: Machine learning the high interest credit card of technical debt [PWL]

We Are Hiring!similarweb.com/corp/jobs