Download - MLlib sparkmeetup_8_6_13_final_reduced

Transcript
Page 1: MLlib sparkmeetup_8_6_13_final_reduced

Evan  Sparks  and  Ameet  TalwalkarUC  Berkeley

UC Berkeley

baseML

baseML

baseML

baseML

ML base

ML base

ML base

ML base

ML base

Page 2: MLlib sparkmeetup_8_6_13_final_reduced

Three  Converging  Trends

Page 3: MLlib sparkmeetup_8_6_13_final_reduced

Big  Data

Three  Converging  Trends

Page 4: MLlib sparkmeetup_8_6_13_final_reduced

Distributed  Compu2ng

Big  Data

Three  Converging  Trends

Page 5: MLlib sparkmeetup_8_6_13_final_reduced

Distributed  Compu2ng

Big  Data

Three  Converging  Trends

Machine  Learning

Page 6: MLlib sparkmeetup_8_6_13_final_reduced

Distributed  Compu2ng

Big  Data

Three  Converging  Trends

Machine  LearningMLbase

Page 7: MLlib sparkmeetup_8_6_13_final_reduced

VisionMLlibMLIML  OpAmizerRelease  Plan

Page 8: MLlib sparkmeetup_8_6_13_final_reduced

Problem:  Scalable  implementaAons  difficult  for  ML  Developers…

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 9: MLlib sparkmeetup_8_6_13_final_reduced

Problem:  Scalable  implementaAons  difficult  for  ML  Developers…

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 10: MLlib sparkmeetup_8_6_13_final_reduced

Problem:  Scalable  implementaAons  difficult  for  ML  Developers…

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 11: MLlib sparkmeetup_8_6_13_final_reduced

Too  many  algorithms…

Problem:  ML  is  difficultfor  End  Users…

Page 12: MLlib sparkmeetup_8_6_13_final_reduced

Too  many  algorithms…

Too  many  knobs…

Problem:  ML  is  difficultfor  End  Users…

Page 13: MLlib sparkmeetup_8_6_13_final_reduced

Too  many  algorithms…

Too  many  knobs…

Problem:  ML  is  difficultfor  End  Users…

Difficult  to  debug…

Page 14: MLlib sparkmeetup_8_6_13_final_reduced

Too  many  algorithms…

Too  many  knobs…

Problem:  ML  is  difficultfor  End  Users…

Difficult  to  debug…

Doesn’t  scale…

Page 15: MLlib sparkmeetup_8_6_13_final_reduced

Too  many  algorithms…

Too  many  knobs…

Problem:  ML  is  difficultfor  End  Users…

Difficult  to  debug…

Reliable

Fast

Accurate

Provable

Doesn’t  scale…

Page 16: MLlib sparkmeetup_8_6_13_final_reduced

ML  Experts Systems  ExpertsMLbase

Page 17: MLlib sparkmeetup_8_6_13_final_reduced

1. Easy  scalable  ML  development  (ML  Developers)2. User-­‐friendly  ML  at  scale  (End  Users)

ML  Experts Systems  ExpertsMLbase

Page 18: MLlib sparkmeetup_8_6_13_final_reduced

1. Easy  scalable  ML  development  (ML  Developers)2. User-­‐friendly  ML  at  scale  (End  Users)

Along  the  way,  we  gain  insight  into  data  intensive  compu2ng

ML  Experts Systems  ExpertsMLbase

Page 19: MLlib sparkmeetup_8_6_13_final_reduced

Matlab  Stack

Page 20: MLlib sparkmeetup_8_6_13_final_reduced

Matlab  Stack

Single Machine

Page 21: MLlib sparkmeetup_8_6_13_final_reduced

Lapack

Matlab  Stack

Single Machine

✦ Lapack:  low-­‐level  Fortran  linear  algebra  library

Page 22: MLlib sparkmeetup_8_6_13_final_reduced

Lapack

Matlab Interface

Matlab  Stack

Single Machine

✦ Lapack:  low-­‐level  Fortran  linear  algebra  library✦ Matlab  Interface

✦ Higher-­‐level  abstrac2ons  for  data  access  /  processing✦ More  extensive  func2onality  than  Lapack✦ Leverages  Lapack  whenever  possible

Page 23: MLlib sparkmeetup_8_6_13_final_reduced

Lapack

Matlab Interface

Matlab  Stack

Single Machine

✦ Lapack:  low-­‐level  Fortran  linear  algebra  library✦ Matlab  Interface

✦ Higher-­‐level  abstrac2ons  for  data  access  /  processing✦ More  extensive  func2onality  than  Lapack✦ Leverages  Lapack  whenever  possible

✦ Similar  stories  for  R  and  Python

Page 24: MLlib sparkmeetup_8_6_13_final_reduced

MLbase  Stack

Lapack

Matlab Interface

Single Machine

Page 25: MLlib sparkmeetup_8_6_13_final_reduced

MLbase  Stack

Runtime(s)

Lapack

Matlab Interface

Single Machine

Page 26: MLlib sparkmeetup_8_6_13_final_reduced

MLbase  Stack

Runtime(s)Spark

Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on

Lapack

Matlab Interface

Single Machine

Page 27: MLlib sparkmeetup_8_6_13_final_reduced

MLbase  Stack

Runtime(s)

MLlib

Spark

Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on

MLlib:  low-­‐level  ML  library  in  Spark✦ Callable  from  Scala,  Java

Lapack

Matlab Interface

Single Machine

Page 28: MLlib sparkmeetup_8_6_13_final_reduced

MLbase  Stack

Runtime(s)

MLlib

MLI

Spark

Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on

MLlib:  low-­‐level  ML  library  in  Spark✦ Callable  from  Scala,  Java

MLI:  API  /  plaHorm  for  feature  extrac=on  and  algorithm  development✦ Includes  higher-­‐level  func2onality  with  faster  dev  cycle  than  MLlib

Lapack

Matlab Interface

Single Machine

Page 29: MLlib sparkmeetup_8_6_13_final_reduced

MLbase  Stack

Runtime(s)

MLlib

MLI

ML Optimizer

Spark

Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on

MLlib:  low-­‐level  ML  library  in  Spark✦ Callable  from  Scala,  Java

MLI:  API  /  plaHorm  for  feature  extrac=on  and  algorithm  development✦ Includes  higher-­‐level  func2onality  with  faster  dev  cycle  than  MLlib

ML  OpAmizer:  automates  model  selec=on✦ Solves  a  search  problem  over  feature  extractors  and  algorithms  in  MLI

Lapack

Matlab Interface

Single Machine

Page 30: MLlib sparkmeetup_8_6_13_final_reduced

MLlib

MLI

ML OptimizerEnd  User

MLbase  Stack  Status

Spark

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 31: MLlib sparkmeetup_8_6_13_final_reduced

MLlib

MLI

ML OptimizerEnd  User

MLbase  Stack  Status

Spark

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 32: MLlib sparkmeetup_8_6_13_final_reduced

MLlib

MLI

ML OptimizerEnd  User

MLbase  Stack  Status

Goal 1:Summer Release

Spark

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 33: MLlib sparkmeetup_8_6_13_final_reduced

MLlib

MLI

ML OptimizerEnd  User

MLbase  Stack  Status

Goal 1:Summer Release

Goal 2:Winter Release

Spark

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 34: MLlib sparkmeetup_8_6_13_final_reduced

Example:  MLlib

Page 35: MLlib sparkmeetup_8_6_13_final_reduced

Example:  MLlib

✦ Goal:  Classifica2on  of  text  file

Page 36: MLlib sparkmeetup_8_6_13_final_reduced

Example:  MLlib

✦ Goal:  Classifica2on  of  text  file✦ Featurize  data  manually

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Page 37: MLlib sparkmeetup_8_6_13_final_reduced

Example:  MLlib

✦ Goal:  Classifica2on  of  text  file✦ Featurize  data  manually✦ Calls  MLlib’s  LR  func2on

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Page 38: MLlib sparkmeetup_8_6_13_final_reduced

Example:  MLI

Page 39: MLlib sparkmeetup_8_6_13_final_reduced

Example:  MLI

✦ Use  built-­‐in  feature  extrac2on  func2onality

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Page 40: MLlib sparkmeetup_8_6_13_final_reduced

Example:  MLI

✦ Use  built-­‐in  feature  extrac2on  func2onality✦ MLI  Logis2c  Regression  leverages  MLlib

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Page 41: MLlib sparkmeetup_8_6_13_final_reduced

Example:  MLI

✦ Use  built-­‐in  feature  extrac2on  func2onality✦ MLI  Logis2c  Regression  leverages  MLlib✦ Extensions:

✦ Embed  in  cross-­‐valida2on  rou2ne✦ Use  different  feature  extractors  /  algorithms  or  

write  new  ones

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Page 42: MLlib sparkmeetup_8_6_13_final_reduced

Example:  ML  Op2mizer

var  X  =  load(”text_file”,  2  to  10)var  y  =  load(”text_file”,  1)var  (fn-­‐model,  summary)  =  doClassify(X,  y)

✦ User  declara2vely  specifies  task✦ ML  Op2mizer  searches  through  MLI

Page 43: MLlib sparkmeetup_8_6_13_final_reduced

VisionMLlibMLIML  OpAmizerRelease  Plan

Page 44: MLlib sparkmeetup_8_6_13_final_reduced

Ease  of  u

se

Performance,  Scalability

Lay  of  the  Land

Page 45: MLlib sparkmeetup_8_6_13_final_reduced

Matlab,  Rx

Ease  of  u

se

Performance,  Scalability

Lay  of  the  Land

Page 46: MLlib sparkmeetup_8_6_13_final_reduced

Matlab,  Rx

Ease  of  u

se

Performance,  Scalability

Mahoutx

Lay  of  the  Land

Page 47: MLlib sparkmeetup_8_6_13_final_reduced

Matlab,  Rx

Ease  of  u

se

Performance,  Scalability

GraphLab,  VWx

Mahoutx

Lay  of  the  Land

Page 48: MLlib sparkmeetup_8_6_13_final_reduced

Matlab,  Rx

Ease  of  u

se

Performance,  Scalability

GraphLab,  VWx

Mahoutx

Lay  of  the  Land

MLlibx

Page 49: MLlib sparkmeetup_8_6_13_final_reduced

Logis2c  Regression,  Linear  SVM  (+L1,  L2)

Linear  Regression  (+Lasso,  Ridge)

Alterna2ng  Least  Squares

K-­‐Means

SGD,  Parallel  Gradient

MLlibClassificaAon:

Regression:

CollaboraAve  Filtering:

Clustering:

OpAmizaAon  PrimiAves:

Page 50: MLlib sparkmeetup_8_6_13_final_reduced

Logis2c  Regression,  Linear  SVM  (+L1,  L2)

Linear  Regression  (+Lasso,  Ridge)

Alterna2ng  Least  Squares

K-­‐Means

SGD,  Parallel  Gradient

MLlibClassificaAon:

Regression:

CollaboraAve  Filtering:

Clustering:

OpAmizaAon  PrimiAves:

Included  within  Spark  codebase✦ Unlike  Mahout/Hadoop✦ Part  of  Spark  0.8  release✦ Con2nued  support  via  Spark  project

Page 51: MLlib sparkmeetup_8_6_13_final_reduced

MLlib  Performance

Page 52: MLlib sparkmeetup_8_6_13_final_reduced

✦ WallAme:  elapsed  2me  to  execute  task

MLlib  Performance

Page 53: MLlib sparkmeetup_8_6_13_final_reduced

✦ WallAme:  elapsed  2me  to  execute  task

✦ Weak  scaling✦ fix  problem  size  per  processor✦ ideally:  constant  wall2me  as  we  grow  cluster

MLlib  Performance

Page 54: MLlib sparkmeetup_8_6_13_final_reduced

✦ WallAme:  elapsed  2me  to  execute  task

✦ Weak  scaling✦ fix  problem  size  per  processor✦ ideally:  constant  wall2me  as  we  grow  cluster

✦ Strong  scaling✦ fix  total  problem  size✦ ideally:  linear  speed  up  as  we  grow  cluster

MLlib  Performance

Page 55: MLlib sparkmeetup_8_6_13_final_reduced

✦ WallAme:  elapsed  2me  to  execute  task

✦ Weak  scaling✦ fix  problem  size  per  processor✦ ideally:  constant  wall2me  as  we  grow  cluster

✦ Strong  scaling✦ fix  total  problem  size✦ ideally:  linear  speed  up  as  we  grow  cluster

✦ EC2  Experiments✦ m2.4xlarge  instances,  up  to  32  machine  clusters

MLlib  Performance

Page 56: MLlib sparkmeetup_8_6_13_final_reduced

Logis2c  Regression  -­‐  Weak  Scaling

Page 57: MLlib sparkmeetup_8_6_13_final_reduced

Logis2c  Regression  -­‐  Weak  Scaling

✦ Full  dataset:  200K  images,  160K  dense  features

Page 58: MLlib sparkmeetup_8_6_13_final_reduced

Logis2c  Regression  -­‐  Weak  Scaling

✦ Full  dataset:  200K  images,  160K  dense  features✦ Similar  weak  scaling

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e

# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

i

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

MLlib

Page 59: MLlib sparkmeetup_8_6_13_final_reduced

Logis2c  Regression  -­‐  Weak  Scaling

✦ Full  dataset:  200K  images,  160K  dense  features✦ Similar  weak  scaling✦ MLlib  within  a  factor  of  2  of  VW’s  wall=me

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=6K, d=160Kn=12.5K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

MLlib

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e

# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

i

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

MLlib

Page 60: MLlib sparkmeetup_8_6_13_final_reduced

Logis2c  Regression  -­‐  Strong  Scaling

Page 61: MLlib sparkmeetup_8_6_13_final_reduced

Logis2c  Regression  -­‐  Strong  Scaling

✦ Fixed  Dataset:  50K  images,  160K  dense  features

Page 62: MLlib sparkmeetup_8_6_13_final_reduced

Logis2c  Regression  -­‐  Strong  Scaling

✦ Fixed  Dataset:  50K  images,  160K  dense  features✦ MLlib  exhibits  beTer  scaling  proper=es

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

i

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

MLlib

Page 63: MLlib sparkmeetup_8_6_13_final_reduced

Logis2c  Regression  -­‐  Strong  Scaling

✦ Fixed  Dataset:  50K  images,  160K  dense  features✦ MLlib  exhibits  beTer  scaling  proper=es✦ MLlib  faster  than  VW  with  16  and  32  machines

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e

# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

i

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

MLlib

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

i

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

MLlib

Page 64: MLlib sparkmeetup_8_6_13_final_reduced

ALS  -­‐  Wall2me

Page 65: MLlib sparkmeetup_8_6_13_final_reduced

ALS  -­‐  Wall2me

✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size)✦ Cluster:  9  machines

Page 66: MLlib sparkmeetup_8_6_13_final_reduced

ALS  -­‐  Wall2me

✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size)✦ Cluster:  9  machines

System WallAme  (seconds)

Matlab 15443

Mahout 4206

GraphLab 291

MLlib 481

Page 67: MLlib sparkmeetup_8_6_13_final_reduced

ALS  -­‐  Wall2me

✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size)✦ Cluster:  9  machines

System WallAme  (seconds)

Matlab 15443

Mahout 4206

GraphLab 291

MLlib 481

Page 68: MLlib sparkmeetup_8_6_13_final_reduced

ALS  -­‐  Wall2me

✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size)✦ Cluster:  9  machines✦ MLlib  an  order  of  magnitude  faster  than  Mahout✦ MLlib  within  factor  of  2  of  GraphLab

System WallAme  (seconds)

Matlab 15443

Mahout 4206

GraphLab 291

MLlib 481

Page 69: MLlib sparkmeetup_8_6_13_final_reduced

Deployment  Considera2ons

Page 70: MLlib sparkmeetup_8_6_13_final_reduced

Deployment  Considera2ons

Vowpal  Wabbit,  GraphLab✦ Data  prepara=on  specific  to  each  program✦ Non-­‐trivial  setup  on  cluster✦ No  fault  tolerance

Page 71: MLlib sparkmeetup_8_6_13_final_reduced

Deployment  Considera2ons

Vowpal  Wabbit,  GraphLab✦ Data  prepara=on  specific  to  each  program✦ Non-­‐trivial  setup  on  cluster✦ No  fault  tolerance

MLlib✦ Reads  files  from  HDFS✦ Launch/compile/run  on  cluster  with  a  few  commands✦ RDD’s  are  fault  tolerance

Page 72: MLlib sparkmeetup_8_6_13_final_reduced

VisionMLlibMLIML  OpAmizerRelease  Plan

Page 73: MLlib sparkmeetup_8_6_13_final_reduced

Matlab,  Rx

Ease  of  u

se

Performance,  Scalability

GraphLab,  VWx

Mahoutx

Lay  of  the  Land

MLlibx

Page 74: MLlib sparkmeetup_8_6_13_final_reduced

Matlab,  Rx

Ease  of  u

se

Performance,  Scalability

GraphLab,  VWx

MLIx

Mahoutx

Lay  of  the  Land

MLlibx

Page 75: MLlib sparkmeetup_8_6_13_final_reduced

Current  Op2ons

Page 76: MLlib sparkmeetup_8_6_13_final_reduced

Current  Op2ons

 +      Easy  (Resembles  math,  limited  /  no  set  up  cost)  +      Sufficient  for  prototyping  /  wri2ng  papers—    Ad-­‐hoc,  non-­‐scalable  scripts—    Loss  of  transla2on  upon  re-­‐implementa2on

Page 77: MLlib sparkmeetup_8_6_13_final_reduced

Current  Op2ons

 +      Easy  (Resembles  math,  limited  /  no  set  up  cost)  +      Sufficient  for  prototyping  /  wri2ng  papers—    Ad-­‐hoc,  non-­‐scalable  scripts—    Loss  of  transla2on  upon  re-­‐implementa2on

Page 78: MLlib sparkmeetup_8_6_13_final_reduced

Current  Op2ons

 +      Easy  (Resembles  math,  limited  /  no  set  up  cost)  +      Sufficient  for  prototyping  /  wri2ng  papers—    Ad-­‐hoc,  non-­‐scalable  scripts—    Loss  of  transla2on  upon  re-­‐implementa2on

 +      Scalable  and  (some2mes)  fast  +      Exis2ng  open-­‐source  library  of  ML  algorithms—    Difficult  to  set  up,  extend

Page 79: MLlib sparkmeetup_8_6_13_final_reduced

ExamplesML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Page 80: MLlib sparkmeetup_8_6_13_final_reduced

ExamplesML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC)✦ Ini2al  studies  in  MATLAB  (Not  distributed)✦ Distributed  prototype  involving  compiled  MATLAB

Page 81: MLlib sparkmeetup_8_6_13_final_reduced

ExamplesML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC)✦ Ini2al  studies  in  MATLAB  (Not  distributed)✦ Distributed  prototype  involving  compiled  MATLAB

Mahout  ALS  with  Early  Stopping✦ Theory:  simple  if-­‐statement  (3  lines  of  code)

Page 82: MLlib sparkmeetup_8_6_13_final_reduced

ExamplesML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC)✦ Ini2al  studies  in  MATLAB  (Not  distributed)✦ Distributed  prototype  involving  compiled  MATLAB

Mahout  ALS  with  Early  Stopping✦ Theory:  simple  if-­‐statement  (3  lines  of  code)✦ Prac2ce:  sih  through  7  files,  nearly  1K  lines  of  code

Page 83: MLlib sparkmeetup_8_6_13_final_reduced

Insight:  Programming  Abstrac2ons

Page 84: MLlib sparkmeetup_8_6_13_final_reduced

Insight:  Programming  Abstrac2ons✦ Shield  ML  Developers  from  low-­‐details:  provide  

familiar  mathema2cal  operators  in  distributed  sejng

✦ ML  Developer  API  (MLI)

Page 85: MLlib sparkmeetup_8_6_13_final_reduced

Insight:  Programming  Abstrac2ons✦ Shield  ML  Developers  from  low-­‐details:  provide  

familiar  mathema2cal  operators  in  distributed  sejng

✦ ML  Developer  API  (MLI)✦ Table  Computa2on:  MLTable✦ Linear  Algebra:  MLSubMatrix✦ Op2miza2on  Primi2ves:  MLSolve

Page 86: MLlib sparkmeetup_8_6_13_final_reduced

Insight:  Programming  Abstrac2ons✦ Shield  ML  Developers  from  low-­‐details:  provide  

familiar  mathema2cal  operators  in  distributed  sejng

✦ ML  Developer  API  (MLI)✦ Table  Computa2on:  MLTable✦ Linear  Algebra:  MLSubMatrix✦ Op2miza2on  Primi2ves:  MLSolve

✦ MLI  Examples:✦ DFC:  ~50  lines  of  code

Page 87: MLlib sparkmeetup_8_6_13_final_reduced

Insight:  Programming  Abstrac2ons✦ Shield  ML  Developers  from  low-­‐details:  provide  

familiar  mathema2cal  operators  in  distributed  sejng

✦ ML  Developer  API  (MLI)✦ Table  Computa2on:  MLTable✦ Linear  Algebra:  MLSubMatrix✦ Op2miza2on  Primi2ves:  MLSolve

✦ MLI  Examples:✦ DFC:  ~50  lines  of  code✦ ALS:  early  stopping  in  3  lines;  <  40  lines  total

Page 88: MLlib sparkmeetup_8_6_13_final_reduced

Lines  of  Code

Page 89: MLlib sparkmeetup_8_6_13_final_reduced

Lines  of  CodeLogis2c  Regression

Alterna2ng  Least  Squares

System Lines  of  Code

Matlab 11

Vowpal  Wabbit 721

MLI 55

System Lines  of  Code

Matlab 20

Mahout 865

GraphLab 383

MLI 32

Page 90: MLlib sparkmeetup_8_6_13_final_reduced

Lines  of  CodeLogis2c  Regression

Alterna2ng  Least  Squares

System Lines  of  Code

Matlab 11

Vowpal  Wabbit 721

MLI 55

System Lines  of  Code

Matlab 20

Mahout 865

GraphLab 383

MLI 32

Page 91: MLlib sparkmeetup_8_6_13_final_reduced

Lines  of  CodeLogis2c  Regression

Alterna2ng  Least  Squares

System Lines  of  Code

Matlab 11

Vowpal  Wabbit 721

MLI 55

System Lines  of  Code

Matlab 20

Mahout 865

GraphLab 383

MLI 32

Page 92: MLlib sparkmeetup_8_6_13_final_reduced

MLI  Details

Page 93: MLlib sparkmeetup_8_6_13_final_reduced

MLI  Details

OLDval  x:  RDD[Array[Double]]

Page 94: MLlib sparkmeetup_8_6_13_final_reduced

MLI  Details

OLDval  x:  RDD[Array[Double]]val  x:  RDD[spark.u=l.Vector]

Page 95: MLlib sparkmeetup_8_6_13_final_reduced

MLI  Details

OLDval  x:  RDD[Array[Double]]val  x:  RDD[spark.u=l.Vector]val  x:  RDD[breeze.linalg.Vector]

Page 96: MLlib sparkmeetup_8_6_13_final_reduced

MLI  Details

OLDval  x:  RDD[Array[Double]]val  x:  RDD[spark.u=l.Vector]val  x:  RDD[breeze.linalg.Vector]val  x:  RDD[BIDMat.SMat]

Page 97: MLlib sparkmeetup_8_6_13_final_reduced

MLI  Details

OLDval  x:  RDD[Array[Double]]val  x:  RDD[spark.u=l.Vector]val  x:  RDD[breeze.linalg.Vector]val  x:  RDD[BIDMat.SMat]

Page 98: MLlib sparkmeetup_8_6_13_final_reduced

MLI  Details

OLDval  x:  RDD[Array[Double]]val  x:  RDD[spark.u=l.Vector]val  x:  RDD[breeze.linalg.Vector]val  x:  RDD[BIDMat.SMat]

NEWval  x:  MLTable

Page 99: MLlib sparkmeetup_8_6_13_final_reduced

MLI  Details

OLDval  x:  RDD[Array[Double]]val  x:  RDD[spark.u=l.Vector]val  x:  RDD[breeze.linalg.Vector]val  x:  RDD[BIDMat.SMat]

NEWval  x:  MLTable

✦ Generic  interface  for  feature  extrac2on✦ Common  interface  to  support  an  op2mizer✦ Abstract  interface  for  arbitrary  backends

Page 100: MLlib sparkmeetup_8_6_13_final_reduced

MLTable

✦ Flexibility  when  loading  data✦ e.g.,  CSV,  JSON,  XML✦ Heterogenous  data  across  

columns✦ Missing  Data✦ Feature  extrac2on

✦ Common  Interface

✦ Supports  MapReduce  and  Rela2onal  Operators  

✦ Inspired  by  DataFrames  (R)  and  Pandas  (Python)

Page 101: MLlib sparkmeetup_8_6_13_final_reduced

Feature  Extrac2on

Operation Arguments Returns Semanticsproject Seq[Index] MLTable Select a subset of columns from a table.union MLTable MLTable Concatenate two tables with identical schemas.filter MLRow ) Bool MLTable Select a subset of rows from a data table given a

functional predicate.join MLTable, Seq[Index] MLTable Inner join of two tables based on a sequence of shared

columns.map MLRow ) MLRow MLTable Apply a function to each row in the table.flatMap MLRow ) TraversableOnce[MLRow] MLTable Apply a function to each row, producing 0 or more

rows as output.reduce Seq[MLRow] ) MLRow MLTable Combine all rows in a table using an associative,

commutative reduce function.reduceByKey Int, Seq[MLRow] ) MLRow MLTable Combine all rows in a table using an associative,

commutative reduce function on a key-by-key basiswhere a key column is the first argument.

matrixBatchMap MLSubMatrix ) MLSubMatrix MLNumericTable Execute a batch function on a local partition of thedata. Output matrices are concatenated to form a newtable.

numRows None Long Returns number of rows in the table.numCols None Long Returns the number of columns in the table.

Fig. 2: MLTable API Illustration. This table captures core operations of the API and is not exhaustive.

1 def main(args: Array[String]) {2 val mc = new MLContext("local")3

4 //Read in table from file on HDFS.5 val rawTextTable = mc.textFile(args(0))6

7 //Run feature extraction on the raw text - get the top 30000 bigrams.8 val featurizedTable = tfIdf(nGrams(rawTextTable, n=2, top=30000))9

10 //Cluster the data using K-Means.11 val kMeansModel = KMeans(featurizedTable, k=50)12 }

Fig. 3: Loading, featurizing, and learning clusters on a corpus of Text Data.

Family Example Uses Returns SemanticsShape dims(mat), mat.numRows, mat.numCols Int or (Int,Int) Matrix dimensions.Composition matA on matB, matA then matB Matrix Combine two matrices row-wise or

column-wise.Indexing mat(0,??), mat(10,10), mat(Seq(2,4), 1) Matrix or Scalar Select elements or sub-matrices.Reverse Indexing mat(0,??).nonZeros, mat.parentRow(n),

mat.parentCol(n)Index Look up matrix indexes in either base

matrix or parent table.Updating mat(1,2) = 5, mat(1, Seq(3,10)) = matB None Assign values to elements or sub-

matrices.Arithmetic matA + matB, matA - 5, matA / matB Matrix Element-wise arithmetic between ma-

trices and scalars or matrices.Linear Algebra matA times matB, matA dot matB,

matA.transpose, matA.solve(v), matA.svd,matA.eigen, matA.rank

Matrix or Scalar Basic and extended linear algebra sup-port.

Fig. 4: MLSubMatrix API Illustration

with data size, as in the case with basic linear regression. As aresult various optimization techniques are used to converge toan approximate solution while iterating over the data. We treatoptimization as a first class citizen in our API. Furthermore, wedesigned the Optimizer interface to allow developers to con-tribute new optimizations algorithms to the system. Figure 13(bottom) shows as an example how to implement StochasticGradient Descent using MLSubMatrix.

Finally, we encourage developers to implement their algo-rithms using the Algorithm interface, which should return amodel as specified by the Model interface. An algorithm im-plementing the Algorithm interface is a class with a train()method that accepts data and hyperparameters as input, andproduces a Model. A Model is a class which makes predictions.In the case of a classification model, this would be a predictedclass given a new example point. In the case of a collaborative

Page 102: MLlib sparkmeetup_8_6_13_final_reduced

MLSubMatrix

✦ Linear  algebra  on  local  parAAons✦ E.g.,  matrix-­‐vector  opera2ons  for  

mini-­‐batch  logis2c  regression✦ E.g.,  solving  linear  system  of  equa2ons  

for  Alterna2ng  Least  Squares

✦ Sparse  and  Dense  Matrix  Support

Page 103: MLlib sparkmeetup_8_6_13_final_reduced

Alterna2ng  Least  Squares

1 function [U, V] = ALS_matlab(M, U, V, k, lambda, maxiter)2

3 % Initialize variables4 [m,n] = size(M);5 lambI = lambda * eye(k);6 for q = 1:m7 Vinds{q} = find(M(q,:) ˜= 0);8 end9 for q=1:n

10 Uinds{q} = find(M(:,q) ˜= 0);11 end12

13 % ALS main loop14 for iter=1:maxiter15 parfor q=1:m16 Vq = V(Vinds{q},:);17 U(q,:) = (Vq’*Vq + lambI) \ (Vq’ * M(q,Vinds{q})’);18 end19 parfor q=1:n20 Uq = U(Uinds{q},:);21 V(q,:) = (Uq’*Uq + lambI) \ (Uq’ * M(Uinds{q},q));22 end23 end24 end

1 object BroadcastALS extends Algorithm {2 def train(trainData: MLNumericTable, trainDataTrans: MLNumericTable,3 m: Int, n: Int, k: Int, lambda: Int, maxIter: Int): ALSModel = {4 val lambI = MLSubMatrix.eye(k).mul(lambda)5 var U = MLSubMatrix.rand(m, k)6 var V = MLSubMatrix.rand(n, k)7 var U_b = trainData.context.broadcast(U)8 var V_b = trainData.context.broadcast(V)9 for (iter <- 0 until maxIter) {

10 U = trainData.matrixBatchMap(localALS(_, U_b.value, lambI, k))11 U_b = trainData.context.broadcast(U)12 V = trainDataTrans.matrixBatchMap(localALS(_, V_b.value, lambI, k))13 V_b = trainData.context.broadcast(V)14 }15 new ALSModel(U, V)16 }17

18 def localALS(trainDataPart: MLSubMatrix, Y: MLSubMatrix, lambI: MLSubMatrix, k: Int){19 var localX = MLSubMatrix.zeros(trainDataPart.numRows, k)20 for (i <- 0 until trainDataPart.numRows) {21 val q = trainDataPart.rowID(i)22 val nz_inds = trainDataPart.nzCols(q)23 val Yq = Y(trainDataPart.nzCols(q), ??)24 localX(i, ??) = ((Yq.transpose times Yq) + lambI)25 .solve(Yq.transpose times trainDataPart(q, nz_inds).transpose)26 }27 return localX28 }29 }

Fig. 14: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Page 104: MLlib sparkmeetup_8_6_13_final_reduced

MLSolve

✦ Distributed  implementaAons  of  common  opAmizaAon  paZerns

✦ E.g.,  Stochas2c  Gradient  Descent:  Applicable  to  summable  ML  losses

✦ E.g.,  LBFGS:  An  approximate  2nd-­‐order  op2miza2on  method  

✦ E.g.,  ADMM:  Decomposi2on  /  coordina2on  procedure

Page 105: MLlib sparkmeetup_8_6_13_final_reduced

Logis2c  Regression

1 function w = log_reg(X, y, maxiter, learning_rate)2 [n, d] = size(X);3 w = zeros(d,1);4 for iter = 1:maxiter5 grad = X’ * (sigmoid(X * w) - y);6 w = w - learning_rate * grad;7 end8 end9

10 % applies sigmoid function component-wise on the vector x11 function s = sigmoid(x)12 s = 1 ./ (1 + exp(-1 .* x));13 end

1 object LogisticRegression extends Algorithm {2 def sigmoid(z: Scalar) = 1.0 / (1.0 + exp(-1.0*z))3

4 def gradientFunction(w: MLSubMatrix, x: MLSubMatrix, y: Scalar): MLSubMatrix = {5 x.transpose * (sigmoid(x dot w) - y)6 }7

8 def train(data: MLNumericTable, p: LogRegParams): LogRegModel = {9 val d = data.numCols

10 val params = SGDParams(initweights = MLSubMatrix.zeros(d, 1),11 maxIterations = p.maxIter, learningRate = p.learningRate,12 gradientFunction = gradientFunction)13 val weights = SGD(data, params)14 new LogRegModel(weights)15 }16 }

1 object StochasticGradientDescent extends Optimizer {2

3 def localSGD(x: MLSubMatrix, weights: MLSubMatrix, n: Index, lambda: Scalar,4 gradientFunction: (MLSubMatrix, MLSubMatrix, Scalar) => MLSubMatrix): MLSubMatrix = {5 var localWeights = weights6 for (i <- 0 to x.numRows) {7 val label = x(i,0)8 val data = x(i, 1 to x.numCols)9 val grad = gradientFunction(localWeights, data, label)

10 localWeights = localWeights - (grad * (lambda/n))11 }12 localWeights13 }14

15 def compute(data: MLNumericTable, params: StochasticGradientDescentParameters): MLSubMatrix = {16 // Initialize parameters17 var weights = params.initWeights.value18 var converged = false19 val n = data.count20 while(!converged && i < params.maxIterations) {21 i+=122 // Run SGD locally on each partition, and average resulting parameters after each iteration23 newWeights = data24 .matrixBatchMap(localSGD(_, weights, n, params.learningRate, params.gradientFunction))25 .reduce(_ + _)/data.partitions.length26 if ((newWeights - weights).mean < params.epsilon) converged = true27 weights = newWeights28 }29 weights30 }31 }

Fig. 13: Logistic Regression Code in MATLAB (top) and MLI (middle, bottom).

Page 106: MLlib sparkmeetup_8_6_13_final_reduced

Linear  Regression  (+Lasso,  Ridge)

Alterna2ng  Least  Squares,  [DFC]

K-­‐Means,  [DP-­‐Means]

Logis2c  Regression,  Linear  SVM  (+L1,  L2),  Mul2nomial  Regression,  [Naive  Bayes,  Decision  Trees]

SGD,  Parallel  Gradient,  Local  SGD,  [L-­‐BFGS,  ADMM,  Adagrad]

Principal  Component  Analysis  (PCA),  N-­‐grams,  feature  cleaning  /  normaliza2on

Cross  Valida2on,  Evalua2on  Metrics

MLI  Func2onalityRegression:

CollaboraAve  Filtering:

Clustering:

ClassificaAon:

OpAmizaAon  PrimiAves:

Feature  ExtracAon:

ML  Tools:

Page 107: MLlib sparkmeetup_8_6_13_final_reduced

Linear  Regression  (+Lasso,  Ridge)

Alterna2ng  Least  Squares,  [DFC]

K-­‐Means,  [DP-­‐Means]

Logis2c  Regression,  Linear  SVM  (+L1,  L2),  Mul2nomial  Regression,  [Naive  Bayes,  Decision  Trees]

SGD,  Parallel  Gradient,  Local  SGD,  [L-­‐BFGS,  ADMM,  Adagrad]

Principal  Component  Analysis  (PCA),  N-­‐grams,  feature  cleaning  /  normaliza2on

Cross  Valida2on,  Evalua2on  Metrics

MLI  Func2onalityRegression:

CollaboraAve  Filtering:

Clustering:

ClassificaAon:

OpAmizaAon  PrimiAves:

Feature  ExtracAon:

ML  Tools:

Page 108: MLlib sparkmeetup_8_6_13_final_reduced

VisionMLlibMLIML  OpAmizerRelease  Plan

Page 109: MLlib sparkmeetup_8_6_13_final_reduced

Build  a  Classifier  for  X

What  you  want  to  do

Page 110: MLlib sparkmeetup_8_6_13_final_reduced

Build  a  Classifier  for  X

What  you  want  to  do What  you  have  to  do✦ Learn  the  internals  of  ML  classificaAon  

algorithms,  sampling,  feature  selecAon,  X-­‐validaAon,….

✦ PotenAally  learn  Spark/Hadoop/…✦ Implement  3-­‐4  algorithms✦ Implement  grid-­‐search  to  find  the  right  

algorithm  parameters✦ Implement  validaAon  algorithms✦ Experiment  with  different  sampling-­‐

sizes,  algorithms,  features✦ ….

Page 111: MLlib sparkmeetup_8_6_13_final_reduced

Build  a  Classifier  for  X

What  you  want  to  do What  you  have  to  do✦ Learn  the  internals  of  ML  classificaAon  

algorithms,  sampling,  feature  selecAon,  X-­‐validaAon,….

✦ PotenAally  learn  Spark/Hadoop/…✦ Implement  3-­‐4  algorithms✦ Implement  grid-­‐search  to  find  the  right  

algorithm  parameters✦ Implement  validaAon  algorithms✦ Experiment  with  different  sampling-­‐

sizes,  algorithms,  features✦ ….

and  in  the  end

Ask  For  Help

Page 112: MLlib sparkmeetup_8_6_13_final_reduced

Insight:  A  Declara2ve  Approach

SQL Result

✦ End  Users  tell  the  system  what  they  want,  not  how  to  get  it

Page 113: MLlib sparkmeetup_8_6_13_final_reduced

Insight:  A  Declara2ve  Approach

SQL Result MQL Model

✦ End  Users  tell  the  system  what  they  want,  not  how  to  get  it

Page 114: MLlib sparkmeetup_8_6_13_final_reduced

var  X  =  load(”als_clinical”,  2  to  10)var  y  =  load(”als_clinical”,  1)var  (fn-­‐model,  summary)  =  doClassify(X,  y)

Example:  Supervised  ClassificaAon

✦ End  Users  tell  the  system  what  they  want,  not  how  to  get  it

Insight:  A  Declara2ve  Approach

Page 115: MLlib sparkmeetup_8_6_13_final_reduced

var  X  =  load(”als_clinical”,  2  to  10)var  y  =  load(”als_clinical”,  1)var  (fn-­‐model,  summary)  =  doClassify(X,  y)

Example:  Supervised  ClassificaAon

Algorithm

 Indepe

ndent  

✦ End  Users  tell  the  system  what  they  want,  not  how  to  get  it

Insight:  A  Declara2ve  Approach

Page 116: MLlib sparkmeetup_8_6_13_final_reduced

 ML  Op2mizer:  A  Search  Problem

5min

Boosting

SVM

✦ System  is  responsible  for  searching  through  model  space

✦ Opportuni2es  for  physical  op2miza2on

Page 117: MLlib sparkmeetup_8_6_13_final_reduced

Systems  Op2miza2on  of  Model  Search

35

Observation: We tend to be I/O bound during model training.

A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g Monkey

Page 118: MLlib sparkmeetup_8_6_13_final_reduced

Systems  Op2miza2on  of  Model  Search

✦ Idea  from  databases  –  shared  cursor!

35

Observation: We tend to be I/O bound during model training.

A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g Monkey

Page 119: MLlib sparkmeetup_8_6_13_final_reduced

Systems  Op2miza2on  of  Model  Search

✦ Idea  from  databases  –  shared  cursor!

35

Observation: We tend to be I/O bound during model training.

A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g MonkeyQ

uery

A

Page 120: MLlib sparkmeetup_8_6_13_final_reduced

Systems  Op2miza2on  of  Model  Search

✦ Idea  from  databases  –  shared  cursor!

35

Observation: We tend to be I/O bound during model training.

A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g MonkeyQ

uery

A

Que

ry B

Page 121: MLlib sparkmeetup_8_6_13_final_reduced

Systems  Op2miza2on  of  Model  Search

✦ Idea  from  databases  –  shared  cursor!

35

Observation: We tend to be I/O bound during model training.

A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g MonkeyQ

uery

A

Que

ry B

✦ Single  pass  over  the  data,  many  models  trained

Page 122: MLlib sparkmeetup_8_6_13_final_reduced

Systems  Op2miza2on  of  Model  Search

✦ Idea  from  databases  –  shared  cursor!

35

Observation: We tend to be I/O bound during model training.

A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g MonkeyQ

uery

A

Que

ry B

✦ Single  pass  over  the  data,  many  models  trained

✦ Example  –  Logis2c  Regression  via  SGD

Page 123: MLlib sparkmeetup_8_6_13_final_reduced

Spark

MLlib

MLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End  User

Rela2onship  with  MLI

MQL

Page 124: MLlib sparkmeetup_8_6_13_final_reduced

Spark

MLlib

MLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End  User

Rela2onship  with  MLI

✦ MLI  provides  common  interface  for  all  algorithms

MQL

Page 125: MLlib sparkmeetup_8_6_13_final_reduced

Spark

MLlib

MLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End  User

Rela2onship  with  MLI

✦ MLI  provides  common  interface  for  all  algorithms✦ Contracts:  Meta-­‐data  for  algorithms  writen  against  MLI

MQL

Page 126: MLlib sparkmeetup_8_6_13_final_reduced

Spark

MLlib

MLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End  User

Rela2onship  with  MLI

✦ MLI  provides  common  interface  for  all  algorithms✦ Contracts:  Meta-­‐data  for  algorithms  writen  against  MLI

✦ Type  (e.g.,  classifica2on)✦ Parameters✦ Run2me  (e.g.,  O(n))✦ Input-­‐Specifica2on✦ Output-­‐Specifica2on✦ …

MQL

Page 127: MLlib sparkmeetup_8_6_13_final_reduced

VisionMLlibMLIML  OpAmizerRelease  Plan

Page 128: MLlib sparkmeetup_8_6_13_final_reduced

Contributors✦ John  Duchi✦ Michael  Franklin✦ Joseph  Gonzalez✦ Rean  Griffith✦ Michael  Jordan✦ Tim  Kraska✦ Xinghao  Pan✦ Virginia  Smith✦ Shivaram  Venkarataram✦ Matei  Zaharia

Page 129: MLlib sparkmeetup_8_6_13_final_reduced

Contributors✦ John  Duchi✦ Michael  Franklin✦ Joseph  Gonzalez✦ Rean  Griffith✦ Michael  Jordan✦ Tim  Kraska✦ Xinghao  Pan✦ Virginia  Smith✦ Shivaram  Venkarataram✦ Matei  Zaharia

*

*

*

*

Page 130: MLlib sparkmeetup_8_6_13_final_reduced

First  Release  (Summer)

MLlibMLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End  User

Spark

Page 131: MLlib sparkmeetup_8_6_13_final_reduced

First  Release  (Summer)

✦ MLlib:  low-­‐level  ML  library  and  underlying  kernels✦ Callable  from  Scala,  Java✦ Included  as  part  of  Spark

✦ MLI:  API  for  feature  extrac2on  and  ML  algorithms✦ Plaworm  for  ML  development✦ Includes  more  extensive  library  and  with  faster  dev-­‐cycle  than  MLlib

MLlibMLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End  User

Spark

Page 132: MLlib sparkmeetup_8_6_13_final_reduced

Second  Release  (Winter)

MLlibMLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End  User

Spark

Page 133: MLlib sparkmeetup_8_6_13_final_reduced

Second  Release  (Winter)

✦ ML  OpAmizer:  automated  model  selec2on✦ Search  problem  over  feature  extractors  and  algorithms  in  MLI✦ Contracts✦ Restricted  query  language  (MQL)

MLlibMLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End  User

Spark

Page 134: MLlib sparkmeetup_8_6_13_final_reduced

Second  Release  (Winter)

✦ ML  OpAmizer:  automated  model  selec2on✦ Search  problem  over  feature  extractors  and  algorithms  in  MLI✦ Contracts✦ Restricted  query  language  (MQL)

✦ Feature  extracAon  for  image  data

MLlibMLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End  User

Spark

Page 135: MLlib sparkmeetup_8_6_13_final_reduced

Future  Direc2ons

Page 136: MLlib sparkmeetup_8_6_13_final_reduced

Future  Direc2ons✦ IdenAfy  minimal  set  of  ML  operators

✦ Expose  internals  of  ML  algorithms  to  op2mizer

Page 137: MLlib sparkmeetup_8_6_13_final_reduced

Future  Direc2ons✦ IdenAfy  minimal  set  of  ML  operators

✦ Expose  internals  of  ML  algorithms  to  op2mizer

✦ Unified  language  for  End  Users  and  ML  Developers

Page 138: MLlib sparkmeetup_8_6_13_final_reduced

Future  Direc2ons✦ IdenAfy  minimal  set  of  ML  operators

✦ Expose  internals  of  ML  algorithms  to  op2mizer

✦ Unified  language  for  End  Users  and  ML  Developers

✦ Plug-­‐ins  to  Python,  R

Page 139: MLlib sparkmeetup_8_6_13_final_reduced

Future  Direc2ons✦ IdenAfy  minimal  set  of  ML  operators

✦ Expose  internals  of  ML  algorithms  to  op2mizer

✦ Unified  language  for  End  Users  and  ML  Developers

✦ Plug-­‐ins  to  Python,  R

✦ VisualizaAon  for  unsupervised  learning  and  explora2on

Page 140: MLlib sparkmeetup_8_6_13_final_reduced

Future  Direc2ons✦ IdenAfy  minimal  set  of  ML  operators

✦ Expose  internals  of  ML  algorithms  to  op2mizer

✦ Unified  language  for  End  Users  and  ML  Developers

✦ Plug-­‐ins  to  Python,  R

✦ VisualizaAon  for  unsupervised  learning  and  explora2on

✦ Advanced  ML  capabiliAes✦ Time-­‐series  algorithms✦ Graphical  models✦ Advanced  Op2miza2on  (e.g.,  asynchronous  computa2on)✦ Online  updates✦ Sampling  for  efficiency  

Page 141: MLlib sparkmeetup_8_6_13_final_reduced

ContribuAons  encouraged!

Berkeley,  CAAugust  29-­‐30www.mlbase.org

baseML

baseML

baseML

baseML

ML base

ML base

ML base

ML base

ML base