mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical...

117
Collaborators: Evan Sparks, Michael Franklin, Michael I. Jordan, Tim Kraska UC Berkeley Ameet Talwalkar Towards an OpBmizer for MLbase

Transcript of mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical...

Page 1: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Collaborators:  Evan  Sparks,  Michael  Franklin,  Michael  I.  Jordan,  Tim  Kraska  

UC Berkeley

Ameet  Talwalkar

Towards  an  OpBmizer  for  MLbase

Page 2: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Problem:  Scalable  implementa.ons  difficult  for  ML  Developers…

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 3: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Problem:  Scalable  implementa.ons  difficult  for  ML  Developers…

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Key Features

-

-

® ®

The Language of Technical Computing

MATLAB® is a high-level language and interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop algorithms, and create models and applications. The language, tools, and built-in math functions enable you to explore multiple approaches and reach a solution faster than with spreadsheets or traditional programming languages, such as C/C++ or Java™.

You can use MATLAB for a range of appli-cations, including signal processing and communications, image and video process-ing, control systems, test and measurement, computational finance, and computational biology. More than a million engineers and scientists in industry and academia use MATLAB, the language of technical computing.

MATLAB Overview 2:04

Analyzing and visualizing data using the MATLAB desktop. The MATLAB environment also lets you write programs and develop algorithms and applications.

Key Features

-

-

® ®

The Language of Technical Computing

MATLAB® is a high-level language and interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop algorithms, and create models and applications. The language, tools, and built-in math functions enable you to explore multiple approaches and reach a solution faster than with spreadsheets or traditional programming languages, such as C/C++ or Java™.

You can use MATLAB for a range of appli-cations, including signal processing and communications, image and video process-ing, control systems, test and measurement, computational finance, and computational biology. More than a million engineers and scientists in industry and academia use MATLAB, the language of technical computing.

MATLAB Overview 2:04

Analyzing and visualizing data using the MATLAB desktop. The MATLAB environment also lets you write programs and develop algorithms and applications.

Page 4: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Problem:  Scalable  implementa.ons  difficult  for  ML  Developers…

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 5: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Problem:  Scalable  implementa.ons  difficult  for  ML  Developers…

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

CHALLENGE:  Can  we  simplify  distributed  ML  development?

Page 6: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Problem:  ML  is  difficultfor  End  Users…

Page 7: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Too  many  ways  to  preprocess…

Problem:  ML  is  difficultfor  End  Users…

Page 8: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Too  many  ways  to  preprocess…

Problem:  ML  is  difficultfor  End  Users…

Too  many  algorithms…

Page 9: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Too  many  ways  to  preprocess…

Too  many  knobs…

Problem:  ML  is  difficultfor  End  Users…

Too  many  algorithms…

Page 10: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Too  many  ways  to  preprocess…

Too  many  knobs…

Problem:  ML  is  difficultfor  End  Users…

Difficult  to  debug…

Too  many  algorithms…

Page 11: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Too  many  ways  to  preprocess…

Too  many  knobs…

Problem:  ML  is  difficultfor  End  Users…

Difficult  to  debug…

Doesn’t  scale…

Too  many  algorithms…

Page 12: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Too  many  ways  to  preprocess…

Too  many  knobs…

Problem:  ML  is  difficultfor  End  Users…

Difficult  to  debug…

Doesn’t  scale…

CHALLENGE:  Can  we  automate  ML  pipeline  

construcBon?

Too  many  algorithms…

Page 13: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

MLbase

4

MLbase  aims  to  simplify  development  and  deployment  of  ML  

pipelines

Page 14: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

MLbase

4

Apache  Spark

Spark:  Cluster  compuBng  system  designed  for  iteraBve  computaBon

MLbase  aims  to  simplify  development  and  deployment  of  ML  

pipelines

Page 15: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

MLbase

4

MLlibApache  Spark

Spark:  Cluster  compuBng  system  designed  for  iteraBve  computaBon

MLlib:  Spark’s  core  ML  library

MLbase  aims  to  simplify  development  and  deployment  of  ML  

pipelines

Page 16: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

MLbase

4

MLlib

MLI

Apache  Spark

Spark:  Cluster  compuBng  system  designed  for  iteraBve  computaBon

MLlib:  Spark’s  core  ML  library

MLI:  API  to  simplify  ML  development

MLbase  aims  to  simplify  development  and  deployment  of  ML  

pipelines

Page 17: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

MLbase

4

MLlib

MLIMLOpt

Apache  Spark

Spark:  Cluster  compuBng  system  designed  for  iteraBve  computaBon

MLlib:  Spark’s  core  ML  library

MLI:  API  to  simplify  ML  development

MLOpt:  DeclaraBve  layer  that  aims  to  automate  ML  pipeline  construcBon  via  search  over  feature  extractors  and  models

MLbase  aims  to  simplify  development  and  deployment  of  ML  

pipelines

Page 18: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

MLbase

4

MLlib

MLIMLOpt

Apache  Spark

Spark:  Cluster  compuBng  system  designed  for  iteraBve  computaBon

MLlib:  Spark’s  core  ML  library

MLI:  API  to  simplify  ML  development

MLOpt:  DeclaraBve  layer  that  aims  to  automate  ML  pipeline  construcBon  via  search  over  feature  extractors  and  models

MLbase  aims  to  simplify  development  and  deployment  of  ML  

pipelinesMLOpt and MLI are experimental testbeds

Page 19: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Vision  MLlib  and  MLI  MLOpt

Page 20: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

6

MLlib+    Scalable  and  fast    +    Simple  development  environment  +    Part  of  Spark’s  robust  ecosystem

SparkSQL

Apache Spark

Spark Streaming

MLlib (machine learning)

GraphX (graph)

Page 21: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

AcBve  Development

Page 22: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

AcBve  DevelopmentIni.al  Release• Developed  by  MLbase  team  in  AMPLab  (11  contributors)

• Scala,  Java

• Shipped  with  Spark  v0.8  (Sep  2013)

Page 23: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

AcBve  DevelopmentIni.al  Release• Developed  by  MLbase  team  in  AMPLab  (11  contributors)

• Scala,  Java

• Shipped  with  Spark  v0.8  (Sep  2013)

11  months  later…• 55+  contributors  from  various  organizaBons

• Scala,  Java,  Python

• Improved  documentaBon  /  code  examples,  API  stability

• Latest  release  part  of  Spark  v1.0  (May  2014)

Page 24: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Algorithms  in  v0.8• classifica.on:  logisBc  regression,  linear  support  vector  machines  (SVM)  

• regression:  linear  regression,  

• collabora.ve  filtering:  alternaBng  least  squares  (ALS)  

• clustering:  k-­‐means  

• op.miza.on:  stochasBc  gradient  descent  (SGD)

Page 25: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Algorithms  in  v1.0• classifica.on:  logisBc  regression,  linear  support  vector  machines  (SVM),  naive  Bayes,  decision  trees  

• regression:  linear  regression,  regression  trees  

• collabora.ve  filtering:  alternaBng  least  squares  (ALS)  

• clustering:  k-­‐means  

• op.miza.on:  stochasBc  gradient  descent  (SGD),  limited-­‐memory  BFGS  (L-­‐BFGS)  

• dimensionality  reduc.on:  singular  value  decomposiBon  (SVD),  principal  component  analysis  (PCA)

Page 26: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

MLlib,  MLI  and  Roadmap

Page 27: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

MLlib,  MLI  and  Roadmap• MLI:  Shield  ML  Developers  from  low-­‐details  

• Provide  familiar  mathemaBcal  operators  in  distributed  sebng  (tables,  matrices,  opBmizaBon  primiBves)  

• Standard  APIs  defining  ML  algorithms  and  feature  extractors

Page 28: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

MLlib,  MLI  and  Roadmap• MLI:  Shield  ML  Developers  from  low-­‐details  

• Provide  familiar  mathemaBcal  operators  in  distributed  sebng  (tables,  matrices,  opBmizaBon  primiBves)  

• Standard  APIs  defining  ML  algorithms  and  feature  extractors

• Many  of  these  ideas  are  (or  soon  will  be)  in  MLlib

Page 29: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

MLlib,  MLI  and  Roadmap• MLI:  Shield  ML  Developers  from  low-­‐details  

• Provide  familiar  mathemaBcal  operators  in  distributed  sebng  (tables,  matrices,  opBmizaBon  primiBves)  

• Standard  APIs  defining  ML  algorithms  and  feature  extractors

• Many  of  these  ideas  are  (or  soon  will  be)  in  MLlib

• Next  release  of  Spark  and  MLlib  being  tested  now  • staBsBcal  toolbox,  python  decision  tree  API,  online  logisBc  regression,  …

Page 30: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

MLlib,  MLI  and  Roadmap• MLI:  Shield  ML  Developers  from  low-­‐details  

• Provide  familiar  mathemaBcal  operators  in  distributed  sebng  (tables,  matrices,  opBmizaBon  primiBves)  

• Standard  APIs  defining  ML  algorithms  and  feature  extractors

• Many  of  these  ideas  are  (or  soon  will  be)  in  MLlib

• Next  release  of  Spark  and  MLlib  being  tested  now  • staBsBcal  toolbox,  python  decision  tree  API,  online  logisBc  regression,  …

• Longer  term  • Scalable  implementaBons  of  standard  ML  algorithms  and  underlying  opBmizaBon  primiBves  

• Support  for  ML  pipeline  development  (related  to  MLOpt)

Page 31: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

MLlib,  MLI  and  Roadmap• MLI:  Shield  ML  Developers  from  low-­‐details  

• Provide  familiar  mathemaBcal  operators  in  distributed  sebng  (tables,  matrices,  opBmizaBon  primiBves)  

• Standard  APIs  defining  ML  algorithms  and  feature  extractors

• Many  of  these  ideas  are  (or  soon  will  be)  in  MLlib

• Next  release  of  Spark  and  MLlib  being  tested  now  • staBsBcal  toolbox,  python  decision  tree  API,  online  logisBc  regression,  …

• Longer  term  • Scalable  implementaBons  of  standard  ML  algorithms  and  underlying  opBmizaBon  primiBves  

• Support  for  ML  pipeline  development  (related  to  MLOpt)

Feedback  and  Contribu9ons  Encouraged!

Page 32: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Vision  MLlib  and  MLI  MLOpt

Page 33: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Grand  Vision

✦ User  declaraBvely  specifies  a  task  ✦ Search  through  MLlib/MLI  to  find  

the  best  model/pipeline

SQL Result ‘MQL’ Model

Page 34: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

A  Standard  ML  Pipeline

Page 35: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

!Data

A  Standard  ML  Pipeline

Page 36: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

!Data

Feature  ExtracBon

A  Standard  ML  Pipeline

Page 37: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

!Data

Feature  ExtracBon

Model  Training

A  Standard  ML  Pipeline

Page 38: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

!Data

Feature  ExtracBon

Model  Training

Final    Model

A  Standard  ML  Pipeline

Page 39: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

!Data

Feature  ExtracBon

Model  Training

Final    Model

A  Standard  ML  Pipeline

✦ In  pracBce,  model  building  is  an  iteraBve  process  of  conBnuous  refinement

Page 40: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

!Data

Feature  ExtracBon

Model  Training

Final    Model

A  Standard  ML  Pipeline

✦ In  pracBce,  model  building  is  an  iteraBve  process  of  conBnuous  refinement

✦ Our  grand  vision  is  to  automate  the  construcBon  of  these  pipelines

Page 41: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Training  A  Model

Page 42: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Training  A  Model

✦ For  each  point  in  dataset  ✦ compute  gradient  ✦ update  model  ✦ repeat  unBl  converged

Page 43: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Training  A  Model

✦ For  each  point  in  dataset  ✦ compute  gradient  ✦ update  model  ✦ repeat  unBl  converged

✦ Requires  mul.ple  passes

Page 44: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Training  A  Model

✦ For  each  point  in  dataset  ✦ compute  gradient  ✦ update  model  ✦ repeat  unBl  converged

✦ Requires  mul.ple  passes✦ Common  access  pafern  

✦ Naive  Bayes,  Trees,  etc.

Page 45: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Training  A  Model

✦ For  each  point  in  dataset  ✦ compute  gradient  ✦ update  model  ✦ repeat  unBl  converged

✦ Requires  mul.ple  passes✦ Common  access  pafern  

✦ Naive  Bayes,  Trees,  etc.

✦ Minutes  to  train  an  SVM  on  200GB  of  data  on  a  16-­‐node  cluster

Page 46: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

The  Tricky  Part✦ Algorithms  

✦ LogisBc  Regression,  SVM,  Tree-­‐based,  etc.  

✦ Algorithm  hyper-­‐parameters  ✦ Learning  Rate,  RegularizaBon,  etc.

Algorithms

Hyper  Parameters

Page 47: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

The  Tricky  Part✦ Algorithms  

✦ LogisBc  Regression,  SVM,  Tree-­‐based,  etc.  

✦ Algorithm  hyper-­‐parameters  ✦ Learning  Rate,  RegularizaBon,  etc.

Algorithms

Hyper  Parameters

FeaturizaBon

✦ FeaturizaBon  ✦ Text:  n-­‐grams,  TF-­‐IDF  ✦ Images:  Gabor  filters,  random  

convoluBons  ✦ Random  projecBon?  Scaling?

Page 48: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

The  Tricky  Part✦ Algorithms  

✦ LogisBc  Regression,  SVM,  Tree-­‐based,  etc.  

✦ Algorithm  hyper-­‐parameters  ✦ Learning  Rate,  RegularizaBon,  etc.

Algorithms

Hyper  Parameters

FeaturizaBon

✦ FeaturizaBon  ✦ Text:  n-­‐grams,  TF-­‐IDF  ✦ Images:  Gabor  filters,  random  

convoluBons  ✦ Random  projecBon?  Scaling?

Page 49: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

A  Standard  ML  Pipeline!

DataFeature  ExtracBon

Model  Training

Final    Model

Page 50: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

A  Standard  ML  Pipeline

✦ In  pracBce,  model  building  is  an  iteraBve  process  of  conBnuous  refinement

✦ Our  grand  vision  is  to  automate  the  construcBon  of  these  pipelines

!Data

Feature  ExtracBon

Model  Training

Final    Model

Page 51: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

A  Standard  ML  Pipeline

✦ In  pracBce,  model  building  is  an  iteraBve  process  of  conBnuous  refinement

✦ Our  grand  vision  is  to  automate  the  construcBon  of  these  pipelines

✦ Start  with  one  aspect  of  the  pipeline  -­‐  model  selecBon

!Data

Feature  ExtracBon

Model  Training

Final    Model

Automated  Model  Selec.on

Page 52: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

One  Approach

Learning  Rate

RegularizaBon

Page 53: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

One  Approach

Learning  Rate

RegularizaBon✦ Try  it  all!  

✦ Search  over  all  hyperparameters,  algorithms,  features,  etc.

Page 54: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

One  Approach

Learning  Rate

RegularizaBon

Best  answer

✦ Try  it  all!  ✦ Search  over  all  

hyperparameters,  algorithms,  features,  etc.

Page 55: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

One  Approach

Learning  Rate

RegularizaBon

Best  answer

✦ Try  it  all!  ✦ Search  over  all  

hyperparameters,  algorithms,  features,  etc.

Page 56: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

One  Approach

Learning  Rate

RegularizaBon

Best  answer

✦ Try  it  all!  ✦ Search  over  all  

hyperparameters,  algorithms,  features,  etc.

Page 57: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

One  Approach

Learning  Rate

RegularizaBon

Best  answer

✦ Try  it  all!  ✦ Search  over  all  

hyperparameters,  algorithms,  features,  etc.

Page 58: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

One  Approach

Learning  Rate

RegularizaBon

Best  answer

✦ Try  it  all!  ✦ Search  over  all  

hyperparameters,  algorithms,  features,  etc.

Page 59: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

One  Approach

Learning  Rate

RegularizaBon

Best  answer

✦ Try  it  all!  ✦ Search  over  all  

hyperparameters,  algorithms,  features,  etc.

Page 60: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

One  Approach

Learning  Rate

RegularizaBon

Best  answer

✦ Try  it  all!  ✦ Search  over  all  

hyperparameters,  algorithms,  features,  etc.

Page 61: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

One  Approach

Learning  Rate

RegularizaBon

Best  answer

✦ Try  it  all!  ✦ Search  over  all  

hyperparameters,  algorithms,  features,  etc.

Page 62: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

One  Approach

Learning  Rate

RegularizaBon

Best  answer

✦ Try  it  all!  ✦ Search  over  all  

hyperparameters,  algorithms,  features,  etc.

Page 63: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

One  Approach

Learning  Rate

RegularizaBon

Best  answer

✦ Try  it  all!  ✦ Search  over  all  

hyperparameters,  algorithms,  features,  etc.

✦ Drawbacks  ✦ Expensive  to  compute  models  ✦ Hyperparameter  space  is  large

✦ Some  version  of  this  sBll  oken  done  in  pracBce!

Page 64: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

A  Befer  Approach

✦ Befer  resource  uBlizaBon  ✦ through  batching  

✦ Algorithmic  Speedups  ✦ via  early  stopping  

✦ Improved  Search  ✦ e.g.,  via  randomizaBon

Learning  Rate

RegularizaBon

Best  answer

Page 65: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

A  Befer  Approach

✦ Befer  resource  uBlizaBon  ✦ through  batching  

✦ Algorithmic  Speedups  ✦ via  early  stopping  

✦ Improved  Search  ✦ e.g.,  via  randomizaBon

Learning  Rate

RegularizaBon

Best  answer

Page 66: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

A  Befer  Approach

✦ Befer  resource  uBlizaBon  ✦ through  batching  

✦ Algorithmic  Speedups  ✦ via  early  stopping  

✦ Improved  Search  ✦ e.g.,  via  randomizaBon

Learning  Rate

RegularizaBon

Best  answer

Page 67: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

A  Befer  Approach

✦ Befer  resource  uBlizaBon  ✦ through  batching  

✦ Algorithmic  Speedups  ✦ via  early  stopping  

✦ Improved  Search  ✦ e.g.,  via  randomizaBon

Learning  Rate

RegularizaBon

Best  answer

Page 68: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

A  Tale  Of  3  OpBmizaBons

Be4er  Resource  U.liza.on  

Algorithmic  Speedups  

Improved  Search

Page 69: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Befer  Resource  UBlizaBon

Page 70: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Befer  Resource  UBlizaBon

✦ Modern  memory  slower  than  processors

✦ Can  read:  0.6b  doubles/sec/core  (4.8  GB/s)

✦ Can  compute:  15b  flops/sec/core

Page 71: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Befer  Resource  UBlizaBon

✦ Modern  memory  slower  than  processors

✦ Can  read:  0.6b  doubles/sec/core  (4.8  GB/s)

✦ Can  compute:  15b  flops/sec/core

✦ We  can  do  25  flops/double  read

Page 72: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

What  Does  This  Mean  For  Modeling?

A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g Doge

Mod

el 2

Mod

el 1

Page 73: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

What  Does  This  Mean  For  Modeling?

✦ Typical  model  update  requires  2-­‐4  flops/double  ✦ recall:  25  flops  /  double  read

A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g Doge

Mod

el 2

Mod

el 1

Page 74: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

What  Does  This  Mean  For  Modeling?

✦ Typical  model  update  requires  2-­‐4  flops/double  ✦ recall:  25  flops  /  double  read

✦ Can  do  7-­‐10  model  updates  per  double  we  read  ✦ Assuming  that  models  fit  in  

cache

A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g Doge

Mod

el 2

Mod

el 1

Page 75: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

What  Does  This  Mean  For  Modeling?

✦ Typical  model  update  requires  2-­‐4  flops/double  ✦ recall:  25  flops  /  double  read

✦ Can  do  7-­‐10  model  updates  per  double  we  read  ✦ Assuming  that  models  fit  in  

cache

✦ Train  mul.ple  models  simultaneously

A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g Doge

Mod

el 2

Mod

el 1

Page 76: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

What  Do  We  See  In  Spark?

✦ 2x  and  5x  increase  in  models  trained/sec  with  batching  

✦ Overhead  from  virtualizaBon,  network,  etc.

Page 77: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

What  Do  We  See  In  Spark?

Page 78: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

What  Do  We  See  In  Spark?

✦ These  numbers  are  with  vector-­‐matrix  mulBplies

Page 79: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

What  Do  We  See  In  Spark?

✦ These  numbers  are  with  vector-­‐matrix  mulBplies

✦ Can  do  befer  when  rewriBng  in  terms  of  matrix-­‐matrix  mulBplies

Page 80: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

What  Do  We  See  In  Spark?

✦ These  numbers  are  with  vector-­‐matrix  mulBplies

✦ Can  do  befer  when  rewriBng  in  terms  of  matrix-­‐matrix  mulBplies

Page 81: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

A  Tale  Of  3  OpBmizaBons

Be4er  Resource  U.liza.on  

Algorithmic  Speedups  

Improved  Search

Page 82: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Learning  Rate

RegularizaBon

Best  answer

Algorithmic    Speedups

Page 83: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Learning  Rate

RegularizaBon

Best  answer

Algorithmic    Speedups

✦ Each  point  in  hyper-­‐parameter  space  represents  trained  model

Page 84: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Learning  Rate

RegularizaBon

Best  answer

Algorithmic    Speedups

✦ Each  point  in  hyper-­‐parameter  space  represents  trained  model

✦ SomeBmes  we  see  early  on  that  a  model  is  no  good

Page 85: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Learning  Rate

RegularizaBon

Best  answer

Algorithmic    Speedups

✦ Each  point  in  hyper-­‐parameter  space  represents  trained  model

✦ SomeBmes  we  see  early  on  that  a  model  is  no  good

✦ So  we  stop  early  -­‐  give  up  on  models  that  are  not  progressing

Page 86: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Learning  Rate

RegularizaBon

Best  answer

Algorithmic    Speedups

✦ Each  point  in  hyper-­‐parameter  space  represents  trained  model

✦ SomeBmes  we  see  early  on  that  a  model  is  no  good

✦ So  we  stop  early  -­‐  give  up  on  models  that  are  not  progressing

Page 87: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Algorithmic    Speedups

✦ Each  point  in  hyper-­‐parameter  space  represents  trained  model

✦ SomeBmes  we  see  early  on  that  a  model  is  no  good

✦ So  we  stop  early  -­‐  give  up  on  models  that  are  not  progressing

Page 88: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Algorithmic    Speedups

✦ Each  point  in  hyper-­‐parameter  space  represents  trained  model

✦ SomeBmes  we  see  early  on  that  a  model  is  no  good

✦ So  we  stop  early  -­‐  give  up  on  models  that  are  not  progressing

Page 89: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Be4er  Resource  U.liza.on  

Algorithmic  Speedups  

Improved  Search

A  Tale  Of  3  OpBmizaBons

Page 90: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

What  Search  Method?  

Page 91: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

What  Search  Method?  

✦ Various  derivaBve-­‐free  opBmizaBon  techniques  ✦ Simple  ones  (Grid,  Random)  ✦ Classic  DerivaBve-­‐Free  (Nelder-­‐Mead,  Powell’s  method)  ✦ Bayesian  (SMAC,  TPE,  Spearmint)

Page 92: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

What  Search  Method?  

✦ Various  derivaBve-­‐free  opBmizaBon  techniques  ✦ Simple  ones  (Grid,  Random)  ✦ Classic  DerivaBve-­‐Free  (Nelder-­‐Mead,  Powell’s  method)  ✦ Bayesian  (SMAC,  TPE,  Spearmint)

✦ What  should  we  do?  ✦ Tried  on  5  datasets,  opBmized  over  4  hyperparameters!

Page 93: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

What  Search  Method?  GRID NELDER_MEAD POWELL RANDOM SMAC SPEARMINT TPE

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

australianbreast

diabetesfourclass

splice

16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625Method and Maximum Calls

Data

set a

nd V

alid

atio

n Er

ror

Maximum Calls1681256625

Comparison of Search Methods Across Learning Problems

✦ Various  derivaBve-­‐free  opBmizaBon  techniques  ✦ Simple  ones  (Grid,  Random)  ✦ Classic  DerivaBve-­‐Free  (Nelder-­‐Mead,  Powell’s  method)  ✦ Bayesian  (SMAC,  TPE,  Spearmint)

✦ What  should  we  do?  ✦ Tried  on  5  datasets,  opBmized  over  4  hyperparameters!

Page 94: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

What  Search  Method?  GRID NELDER_MEAD POWELL RANDOM SMAC SPEARMINT TPE

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

australianbreast

diabetesfourclass

splice

16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625Method and Maximum Calls

Data

set a

nd V

alid

atio

n Er

ror

Maximum Calls1681256625

Comparison of Search Methods Across Learning Problems

✦ Various  derivaBve-­‐free  opBmizaBon  techniques  ✦ Simple  ones  (Grid,  Random)  ✦ Classic  DerivaBve-­‐Free  (Nelder-­‐Mead,  Powell’s  method)  ✦ Bayesian  (SMAC,  TPE,  Spearmint)

✦ What  should  we  do?  ✦ Tried  on  5  datasets,  opBmized  over  4  hyperparameters!

Page 95: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

What  Search  Method?  GRID NELDER_MEAD POWELL RANDOM SMAC SPEARMINT TPE

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

australianbreast

diabetesfourclass

splice

16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625Method and Maximum Calls

Data

set a

nd V

alid

atio

n Er

ror

Maximum Calls1681256625

Comparison of Search Methods Across Learning Problems

✦ Various  derivaBve-­‐free  opBmizaBon  techniques  ✦ Simple  ones  (Grid,  Random)  ✦ Classic  DerivaBve-­‐Free  (Nelder-­‐Mead,  Powell’s  method)  ✦ Bayesian  (SMAC,  TPE,  Spearmint)

✦ What  should  we  do?  ✦ Tried  on  5  datasets,  opBmized  over  4  hyperparameters!

Page 96: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

What  Search  Method?  GRID NELDER_MEAD POWELL RANDOM SMAC SPEARMINT TPE

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

0.00.10.20.30.40.5

australianbreast

diabetesfourclass

splice

16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625 16 81 256 625Method and Maximum Calls

Data

set a

nd V

alid

atio

n Er

ror

Maximum Calls1681256625

Comparison of Search Methods Across Learning Problems

Page 97: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Pubng  It  All  Together

Page 98: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Pubng  It  All  Together✦ First  version  of  MLbase  opBmizer

Page 99: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Pubng  It  All  Together✦ First  version  of  MLbase  opBmizer✦ 30GB  dense  images  (240K  x  16K)✦ 2  model  families,  5  hyperparams

Page 100: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Pubng  It  All  Together✦ First  version  of  MLbase  opBmizer✦ 30GB  dense  images  (240K  x  16K)✦ 2  model  families,  5  hyperparams✦ Baseline:  grid  search✦ Our  method:  combinaBon  of  

✦ Batching  ✦ Early  stopping  ✦ Random  or  TPE  

Page 101: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0.25

0.50

0.75

0 200 400 600 800Time elapsed (m)

Best

Val

idat

ion

Erro

r See

n So

Far

Search Method●

Grid − UnoptimizedRandom − OptimizedTPE − Optimized

Model Convergence Over Time

Pubng  It  All  Together✦ First  version  of  MLbase  opBmizer✦ 30GB  dense  images  (240K  x  16K)✦ 2  model  families,  5  hyperparams✦ Baseline:  grid  search✦ Our  method:  combinaBon  of  

✦ Batching  ✦ Early  stopping  ✦ Random  or  TPE  

Page 102: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

●●●●●●●●●●●●●●●●

●●●●

●●●●

●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●

●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0.25

0.50

0.75

0 200 400 600 800Time elapsed (m)

Best

Val

idat

ion

Erro

r See

n So

Far

Search Method●

Grid − UnoptimizedRandom − OptimizedTPE − Optimized

Model Convergence Over Time

Pubng  It  All  Together✦ First  version  of  MLbase  opBmizer✦ 30GB  dense  images  (240K  x  16K)✦ 2  model  families,  5  hyperparams✦ Baseline:  grid  search✦ Our  method:  combinaBon  of  

✦ Batching  ✦ Early  stopping  ✦ Random  or  TPE  

20x  speedup  compared  to  grid  search  15  minutes  vs  5  hours!    

Page 103: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Does  It  Scale?

●●●●●●●● ●●●●●● ●●●●● ●●●●

●●●●

●● ●● ●

0.25

0.50

0.75

5 10Time elapsed (h)

Best

Val

idat

ion

Erro

r See

n So

Far

Convergence of Model Accuracy on 1.5TB Dataset

Page 104: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Does  It  Scale?

✦ 1.5TB  dataset  (1.2M  x  160K)●●●●●●●● ●●●●●● ●●●●● ●●●●

●●●●

●● ●● ●

0.25

0.50

0.75

5 10Time elapsed (h)

Best

Val

idat

ion

Erro

r See

n So

Far

Convergence of Model Accuracy on 1.5TB Dataset

Page 105: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Does  It  Scale?

✦ 1.5TB  dataset  (1.2M  x  160K)✦ 128  nodes,  thousands  of  passes  

over  data✦ Tried  32  models  in  15  hours  

✦ Good  answer  aker  11  hours

●●●●●●●● ●●●●●● ●●●●● ●●●●

●●●●

●● ●● ●

0.25

0.50

0.75

5 10Time elapsed (h)

Best

Val

idat

ion

Erro

r See

n So

Far

Convergence of Model Accuracy on 1.5TB Dataset

Page 106: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Future  Work

!Data

Feature  ExtracBon

Model  Training

Final    Model

Page 107: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Future  Work

!Data

Feature  ExtracBon

Model  Training

Final    Model

Automated  Model  Selec.on

Page 108: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Future  Work

!Data

Feature  ExtracBon

Model  Training

Final    Model

Page 109: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Future  Work

!Data

Feature  ExtracBon

Model  Training

Final    Model

Automated  ML  Pipeline  Construc.on

Page 110: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

A  Real  Pipeline  for  Image  ClassificaBon  

Inspired  by  Coates  &  Ng,  2012

Data Image Parser Normalizer Convolver

sqrt,mean

Zipper

Linear Solver

Symmetric Rectifier

ident,absident,mean

Global Pooling

Pooler

Patch Extractor

Patch Whitener

KMeans Clusterer

Feature Extractor

Label Extractor

ModelLinear Mapper

Test Data

Label Extractor

Feature Extractor

Test Error

Error Computer

Slide courtesy of Evan Sparks

Page 111: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Data Image Parser Normalizer Convolver

sqrt,mean

Zipper

Linear Solver

Symmetric Rectifier

ident,absident,mean

Global

Pooler

Patch Extractor

Patch Whitener

KMeans Clusterer

Feature Extractor

Label Extractor

Linear Mapper Model

Test Data

Label Extractor

Feature Extractor

Test Error

Error Computer

No Hyperparameters A few Hyperparameters Lotsa Hyperparameters

Slide courtesy of Evan Sparks

Page 112: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Other  Future  Work

Page 113: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Other  Future  Work

✦ Ensembling

Page 114: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Other  Future  Work

✦ Ensembling

✦ Leverage  sampling

Page 115: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Other  Future  Work

✦ Ensembling

✦ Leverage  sampling

✦ Befer  parallelism  for  smaller  datasets

Page 116: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

Other  Future  Work

✦ Ensembling

✦ Leverage  sampling

✦ Befer  parallelism  for  smaller  datasets

✦ MulBple  hypothesis  tesBng  issues

Page 117: mlbase stanford sparkrezab/sparkclass/slides/ameet...interactive environment for numerical com-putation, visualization, and programming. Using MATLAB, you can analyze data, develop

MLOpt:  DeclaraBve  layer  that  aims  to  automate  ML  pipeline  construcBon  

MLlib:  Spark’s  core  ML  library  

MLI:  API  to  simplify  ML  development  

Spark:  Cluster  compuBng  system  designed  for  iteraBve  computaBon

THANKS! QUESTIONS?

baseML

baseML

baseML

baseML

ML base

ML base

ML base

ML base

ML base

www.mlbase.org

MLlib

MLIMLOpt

Apache  Spark