DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN:...

25
DAWN: Infrastructure for Usable Machine Learning Peter Bailis, Kunle Olukotun, Chris Ré, Matei Zaharia

Transcript of DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN:...

Page 1: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

DAWN: Infrastructure for Usable Machine LearningPeter Bailis, Kunle Olukotun, Chris Ré, Matei Zaharia

Page 2: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

It’s the Golden Age of DataIncredible advances in image recognition, natural language processing, planning, info retrieval

Society-scale impact: autonomous vehicles, personalized medicine, human trafficking

No end in sight for advances in ML

*

*for the best-funded, best-trained engineering teams

Page 3: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

Building ML Products is Too Hard

Major successes (e.g., AlphaGo, ImageNet) require hundreds to thousands of engineers

Huge effort in data preparation, model tuning, experimentation, and productionizing

Domain experts cannot easily or cheaply build ML products

Page 4: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

“Only a fraction of real-world ML systemsis composed of ML code”

Page 5: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

The DAWN QuestionWhat if anyone with domain expertise could build their own production-quality ML products?• Without a PhD in machine learning• Without being an expert in systems• Without understanding the latest hardware

It’s happened before

Page 6: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

It’s happened before: SearchBefore: Decades of research on information retrieval, indexes, ranking, etc

After: any developer can add search to an application by linking a library (e.g. Solr, Lucene); everyone (i.e., non-expert users) uses search

Page 7: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

It’s happened before: SQLBefore: raw access to disk, manual layout of records, network databases (CODASYL)

After: SQL forms basis for transactional engines, data warehousing, business intelligence tools

Key idea: end-to-end systems that tackle the barriers to access & production use

Page 8: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

The DAWN StackData Acquisition Feature Engineering Model Training Productionizing

Inte

rface

sAl

gorit

hms

Syst

ems

Hard

war

e

Snorkel

DeepDive

MacroBase (Streaming Data)

NoScope (Video)

AutoRec, SimDex (Recommendation)

Data Fusion

Mulligan (SQL+graph+ML)

CPU GPU FPGA Cluster Mobile

New Hardware: FuzzyBit, Plasticine CGRA

End-to-End Compilers: Weld, Delite

ModelQAModelSnap

Page 9: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

Example: MacroBasefor Continuous Analytics

End-to-end system to prioritize user attention

MacroBasemulti-dimensionaldata streams

anomalies &explanations

Page 10: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

Too much data for manual inspectionEven harder when data is streaming

Page 11: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,
Page 12: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,
Page 13: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,
Page 14: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

MacroBase SummaryEnd-to-end system to prioritize user attention• No ML expertise needed: MacroBase uses general

models and tunes them automatically• No separate step for production use• Co-design from algorithms to HW

Open source: github.com/stanford-futuredata/macrobase

Early users: automotive, cloud, mobile apps, manufacturing

Page 15: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

The DAWN StackData Acquisition Feature Engineering Model Training Productionizing

Inte

rface

sAl

gorit

hms

Syst

ems

Hard

war

e

Snorkel

DeepDive

MacroBase (Streaming Data)

NoScope (Video)

AutoRec, SimDex (Recommendation)

Data Fusion

Mulligan (SQL+graph+ML)

CPU GPU FPGA Cluster Mobile

New Hardware: FuzzyBit, Plasticine CGRA

End-to-End Compilers: Weld, Delite

ModelQAModelSnap

Page 16: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

Weld: Rethinking the Interface to Data Analytics Libraries

Standard approach: users combine libraries using function calls that pass data via memory

Problem: for data-intensive apps, data movementcost dominates on modern hardware!

5-30x slowdowns in NumPy, Spark, TensorFlow, …

func1

func2

Page 17: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

machine learningSQL graph

algorithms

DiverseAnalytics Tasks

CPUs GPUs FPGAsDiverseHardwarePlatforms

Weld IRCommonRuntime

Weld’s Approach

Open source: weld.stanford.edu

Page 18: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

TPC-H Logistic RegressionVector Sum

Results: Existing Frameworks

0 5

10 15 20 25 30 35 40 45

TPC-H Q1 TPC-H Q6

Runtim

e [se

cs]

Workload

SparkSQLWeld

0 0.02 0.04 0.06 0.08 0.1

0.12 0.14 0.16 0.18 0.2

Runtim

e [se

cs]

NPNExpr

Weld

Integration effort: 500 lines glue, 30 lines/operator

0.1

1

10

100

1000

LR (1T) LR (12T)

Runtim

e [se

cs; lo

g10]

Workload

TFHand-opt

Weld

1 Core 12 Cores

Page 19: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

Results: Cross-Library Optimization

0.01

0.1

1

10

100

Runt

ime

(sec,

log1

0)

CurrentWeld, no CLOWeld, CLOWeld, 12 core

Pandas + NumPy

290x

31x

0.0

0.5

1.0

1.5

2.0

Runt

ime

(sec)

Scala UDFWeld

Spark SQL UDF

14x

CLO = cross-library optimizationOpen source: weld.stanford.edu

Page 20: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

The DAWN StackData Acquisition Feature Engineering Model Training Productionizing

Inte

rface

sAl

gorit

hms

Syst

ems

Hard

war

e

Snorkel

DeepDive

MacroBase (Streaming Data)

NoScope (Video)

AutoRec, SimDex (Recommendation)

Data Fusion

Mulligan (SQL+graph+ML)

CPU GPU FPGA Cluster Mobile

New Hardware: FuzzyBit, Plasticine CGRA

End-to-End Compilers: Weld, Delite

ModelQAModelSnap

Page 21: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

NoScope: Fast CNN-BasedVideo Queries

Opportunity: CNNs allow more accurate queries on visual data than ever

Challenge : processing 1 video in real time requires a $1000 GPU

Result: same accuracy but100-3000x faster through:• Scene-specific distillation• Temporal + spatial locality

bit.ly/NoScopeArxiv

Page 22: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

The DAWN StackData Acquisition Feature Engineering Model Training Productionizing

Inte

rface

sAl

gorit

hms

Syst

ems

Hard

war

e

Snorkel

DeepDive

MacroBase (Streaming Data)

NoScope (Video)

AutoRec, SimDex (Recommendation)

Data Fusion

Mulligan (SQL+graph+ML)

CPU GPU FPGA Cluster Mobile

New Hardware: FuzzyBit, Plasticine CGRA

End-to-End Compilers: Weld, Delite

ModelQAModelSnap

Page 23: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

Training data is key enabler, barrier to entry

How can we leverage data that’s expensive to label at scale?

Page 24: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

github.com/HazyResearch/snorkel

Snorkel’s Approach:Weak Supervision

1) User writes labeling functions: short programs that may not always give right label• E.g. regex to search in text

2) Snorkel simultaneously learns noise in LFs and a noise-aware target model (e.g. LSTM)

4 hours LF coding with bio experts: match months of hand-labeling

high-quality models from low-quality, scalable labeling functions

System NCBI Disease (F1)

CDR Disease(F1)

CDR Chem. (F1)

TaggerOne (Dogan, 2012)* 81.5 79.6 88.4Snorkel: Logistic Regression 79.1 79.6 88.4Snorkel: LSTM + Embeddings 79.2 80.4 88.2

Page 25: DAWN: Infrastructure for Usable Machine Learningdawn.cs.stanford.edu/assets/dawn-overview.pdfDAWN: machine learning for everyonevia novel techniques and interfaces that span hardware,

DAWN: machine learning for everyone via novel techniques and interfaces that span hardware, systems, and algorithms

Find out more at dawn.cs.stanford.edu

Peter Bailis Chris Ré Kunle Olukotun Matei Zaharia