MLlib sparkmeetup_8_6_13_final_reduced

Post on 27-Jan-2015

118 views 2 download

Tags:

description

 

Transcript of MLlib sparkmeetup_8_6_13_final_reduced

Evan  Sparks  and  Ameet  TalwalkarUC  Berkeley

UC Berkeley

baseML

baseML

baseML

baseML

ML base

ML base

ML base

ML base

ML base

Three  Converging  Trends

Big  Data

Three  Converging  Trends

Distributed  Compu2ng

Big  Data

Three  Converging  Trends

Distributed  Compu2ng

Big  Data

Three  Converging  Trends

Machine  Learning

Distributed  Compu2ng

Big  Data

Three  Converging  Trends

Machine  LearningMLbase

VisionMLlibMLIML  OpAmizerRelease  Plan

Problem:  Scalable  implementaAons  difficult  for  ML  Developers…

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Problem:  Scalable  implementaAons  difficult  for  ML  Developers…

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Problem:  Scalable  implementaAons  difficult  for  ML  Developers…

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Too  many  algorithms…

Problem:  ML  is  difficultfor  End  Users…

Too  many  algorithms…

Too  many  knobs…

Problem:  ML  is  difficultfor  End  Users…

Too  many  algorithms…

Too  many  knobs…

Problem:  ML  is  difficultfor  End  Users…

Difficult  to  debug…

Too  many  algorithms…

Too  many  knobs…

Problem:  ML  is  difficultfor  End  Users…

Difficult  to  debug…

Doesn’t  scale…

Too  many  algorithms…

Too  many  knobs…

Problem:  ML  is  difficultfor  End  Users…

Difficult  to  debug…

Reliable

Fast

Accurate

Provable

Doesn’t  scale…

ML  Experts Systems  ExpertsMLbase

1. Easy  scalable  ML  development  (ML  Developers)2. User-­‐friendly  ML  at  scale  (End  Users)

ML  Experts Systems  ExpertsMLbase

1. Easy  scalable  ML  development  (ML  Developers)2. User-­‐friendly  ML  at  scale  (End  Users)

Along  the  way,  we  gain  insight  into  data  intensive  compu2ng

ML  Experts Systems  ExpertsMLbase

Matlab  Stack

Matlab  Stack

Single Machine

Lapack

Matlab  Stack

Single Machine

✦ Lapack:  low-­‐level  Fortran  linear  algebra  library

Lapack

Matlab Interface

Matlab  Stack

Single Machine

✦ Lapack:  low-­‐level  Fortran  linear  algebra  library✦ Matlab  Interface

✦ Higher-­‐level  abstrac2ons  for  data  access  /  processing✦ More  extensive  func2onality  than  Lapack✦ Leverages  Lapack  whenever  possible

Lapack

Matlab Interface

Matlab  Stack

Single Machine

✦ Lapack:  low-­‐level  Fortran  linear  algebra  library✦ Matlab  Interface

✦ Higher-­‐level  abstrac2ons  for  data  access  /  processing✦ More  extensive  func2onality  than  Lapack✦ Leverages  Lapack  whenever  possible

✦ Similar  stories  for  R  and  Python

MLbase  Stack

Lapack

Matlab Interface

Single Machine

MLbase  Stack

Runtime(s)

Lapack

Matlab Interface

Single Machine

MLbase  Stack

Runtime(s)Spark

Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on

Lapack

Matlab Interface

Single Machine

MLbase  Stack

Runtime(s)

MLlib

Spark

Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on

MLlib:  low-­‐level  ML  library  in  Spark✦ Callable  from  Scala,  Java

Lapack

Matlab Interface

Single Machine

MLbase  Stack

Runtime(s)

MLlib

MLI

Spark

Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on

MLlib:  low-­‐level  ML  library  in  Spark✦ Callable  from  Scala,  Java

MLI:  API  /  plaHorm  for  feature  extrac=on  and  algorithm  development✦ Includes  higher-­‐level  func2onality  with  faster  dev  cycle  than  MLlib

Lapack

Matlab Interface

Single Machine

MLbase  Stack

Runtime(s)

MLlib

MLI

ML Optimizer

Spark

Spark:  cluster  compu=ng  system  designed  for  itera=ve  computa=on

MLlib:  low-­‐level  ML  library  in  Spark✦ Callable  from  Scala,  Java

MLI:  API  /  plaHorm  for  feature  extrac=on  and  algorithm  development✦ Includes  higher-­‐level  func2onality  with  faster  dev  cycle  than  MLlib

ML  OpAmizer:  automates  model  selec=on✦ Solves  a  search  problem  over  feature  extractors  and  algorithms  in  MLI

Lapack

Matlab Interface

Single Machine

MLlib

MLI

ML OptimizerEnd  User

MLbase  Stack  Status

Spark

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

MLlib

MLI

ML OptimizerEnd  User

MLbase  Stack  Status

Spark

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

MLlib

MLI

ML OptimizerEnd  User

MLbase  Stack  Status

Goal 1:Summer Release

Spark

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

MLlib

MLI

ML OptimizerEnd  User

MLbase  Stack  Status

Goal 1:Summer Release

Goal 2:Winter Release

Spark

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Example:  MLlib

Example:  MLlib

✦ Goal:  Classifica2on  of  text  file

Example:  MLlib

✦ Goal:  Classifica2on  of  text  file✦ Featurize  data  manually

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Example:  MLlib

✦ Goal:  Classifica2on  of  text  file✦ Featurize  data  manually✦ Calls  MLlib’s  LR  func2on

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Example:  MLI

Example:  MLI

✦ Use  built-­‐in  feature  extrac2on  func2onality

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Example:  MLI

✦ Use  built-­‐in  feature  extrac2on  func2onality✦ MLI  Logis2c  Regression  leverages  MLlib

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Example:  MLI

✦ Use  built-­‐in  feature  extrac2on  func2onality✦ MLI  Logis2c  Regression  leverages  MLlib✦ Extensions:

✦ Embed  in  cross-­‐valida2on  rou2ne✦ Use  different  feature  extractors  /  algorithms  or  

write  new  ones

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Example:  ML  Op2mizer

var  X  =  load(”text_file”,  2  to  10)var  y  =  load(”text_file”,  1)var  (fn-­‐model,  summary)  =  doClassify(X,  y)

✦ User  declara2vely  specifies  task✦ ML  Op2mizer  searches  through  MLI

VisionMLlibMLIML  OpAmizerRelease  Plan

Ease  of  u

se

Performance,  Scalability

Lay  of  the  Land

Matlab,  Rx

Ease  of  u

se

Performance,  Scalability

Lay  of  the  Land

Matlab,  Rx

Ease  of  u

se

Performance,  Scalability

Mahoutx

Lay  of  the  Land

Matlab,  Rx

Ease  of  u

se

Performance,  Scalability

GraphLab,  VWx

Mahoutx

Lay  of  the  Land

Matlab,  Rx

Ease  of  u

se

Performance,  Scalability

GraphLab,  VWx

Mahoutx

Lay  of  the  Land

MLlibx

Logis2c  Regression,  Linear  SVM  (+L1,  L2)

Linear  Regression  (+Lasso,  Ridge)

Alterna2ng  Least  Squares

K-­‐Means

SGD,  Parallel  Gradient

MLlibClassificaAon:

Regression:

CollaboraAve  Filtering:

Clustering:

OpAmizaAon  PrimiAves:

Logis2c  Regression,  Linear  SVM  (+L1,  L2)

Linear  Regression  (+Lasso,  Ridge)

Alterna2ng  Least  Squares

K-­‐Means

SGD,  Parallel  Gradient

MLlibClassificaAon:

Regression:

CollaboraAve  Filtering:

Clustering:

OpAmizaAon  PrimiAves:

Included  within  Spark  codebase✦ Unlike  Mahout/Hadoop✦ Part  of  Spark  0.8  release✦ Con2nued  support  via  Spark  project

MLlib  Performance

✦ WallAme:  elapsed  2me  to  execute  task

MLlib  Performance

✦ WallAme:  elapsed  2me  to  execute  task

✦ Weak  scaling✦ fix  problem  size  per  processor✦ ideally:  constant  wall2me  as  we  grow  cluster

MLlib  Performance

✦ WallAme:  elapsed  2me  to  execute  task

✦ Weak  scaling✦ fix  problem  size  per  processor✦ ideally:  constant  wall2me  as  we  grow  cluster

✦ Strong  scaling✦ fix  total  problem  size✦ ideally:  linear  speed  up  as  we  grow  cluster

MLlib  Performance

✦ WallAme:  elapsed  2me  to  execute  task

✦ Weak  scaling✦ fix  problem  size  per  processor✦ ideally:  constant  wall2me  as  we  grow  cluster

✦ Strong  scaling✦ fix  total  problem  size✦ ideally:  linear  speed  up  as  we  grow  cluster

✦ EC2  Experiments✦ m2.4xlarge  instances,  up  to  32  machine  clusters

MLlib  Performance

Logis2c  Regression  -­‐  Weak  Scaling

Logis2c  Regression  -­‐  Weak  Scaling

✦ Full  dataset:  200K  images,  160K  dense  features

Logis2c  Regression  -­‐  Weak  Scaling

✦ Full  dataset:  200K  images,  160K  dense  features✦ Similar  weak  scaling

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e

# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

i

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

MLlib

Logis2c  Regression  -­‐  Weak  Scaling

✦ Full  dataset:  200K  images,  160K  dense  features✦ Similar  weak  scaling✦ MLlib  within  a  factor  of  2  of  VW’s  wall=me

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=6K, d=160Kn=12.5K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

MLlib

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e

# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

i

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

MLlib

Logis2c  Regression  -­‐  Strong  Scaling

Logis2c  Regression  -­‐  Strong  Scaling

✦ Fixed  Dataset:  50K  images,  160K  dense  features

Logis2c  Regression  -­‐  Strong  Scaling

✦ Fixed  Dataset:  50K  images,  160K  dense  features✦ MLlib  exhibits  beTer  scaling  proper=es

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

i

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

MLlib

Logis2c  Regression  -­‐  Strong  Scaling

✦ Fixed  Dataset:  50K  images,  160K  dense  features✦ MLlib  exhibits  beTer  scaling  proper=es✦ MLlib  faster  than  VW  with  16  and  32  machines

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e

# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

i

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

MLlib

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

i

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

MLlib

ALS  -­‐  Wall2me

ALS  -­‐  Wall2me

✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size)✦ Cluster:  9  machines

ALS  -­‐  Wall2me

✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size)✦ Cluster:  9  machines

System WallAme  (seconds)

Matlab 15443

Mahout 4206

GraphLab 291

MLlib 481

ALS  -­‐  Wall2me

✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size)✦ Cluster:  9  machines

System WallAme  (seconds)

Matlab 15443

Mahout 4206

GraphLab 291

MLlib 481

ALS  -­‐  Wall2me

✦ Dataset:  Scaled  version  of  NeHlix  data  (9X  in  size)✦ Cluster:  9  machines✦ MLlib  an  order  of  magnitude  faster  than  Mahout✦ MLlib  within  factor  of  2  of  GraphLab

System WallAme  (seconds)

Matlab 15443

Mahout 4206

GraphLab 291

MLlib 481

Deployment  Considera2ons

Deployment  Considera2ons

Vowpal  Wabbit,  GraphLab✦ Data  prepara=on  specific  to  each  program✦ Non-­‐trivial  setup  on  cluster✦ No  fault  tolerance

Deployment  Considera2ons

Vowpal  Wabbit,  GraphLab✦ Data  prepara=on  specific  to  each  program✦ Non-­‐trivial  setup  on  cluster✦ No  fault  tolerance

MLlib✦ Reads  files  from  HDFS✦ Launch/compile/run  on  cluster  with  a  few  commands✦ RDD’s  are  fault  tolerance

VisionMLlibMLIML  OpAmizerRelease  Plan

Matlab,  Rx

Ease  of  u

se

Performance,  Scalability

GraphLab,  VWx

Mahoutx

Lay  of  the  Land

MLlibx

Matlab,  Rx

Ease  of  u

se

Performance,  Scalability

GraphLab,  VWx

MLIx

Mahoutx

Lay  of  the  Land

MLlibx

Current  Op2ons

Current  Op2ons

 +      Easy  (Resembles  math,  limited  /  no  set  up  cost)  +      Sufficient  for  prototyping  /  wri2ng  papers—    Ad-­‐hoc,  non-­‐scalable  scripts—    Loss  of  transla2on  upon  re-­‐implementa2on

Current  Op2ons

 +      Easy  (Resembles  math,  limited  /  no  set  up  cost)  +      Sufficient  for  prototyping  /  wri2ng  papers—    Ad-­‐hoc,  non-­‐scalable  scripts—    Loss  of  transla2on  upon  re-­‐implementa2on

Current  Op2ons

 +      Easy  (Resembles  math,  limited  /  no  set  up  cost)  +      Sufficient  for  prototyping  /  wri2ng  papers—    Ad-­‐hoc,  non-­‐scalable  scripts—    Loss  of  transla2on  upon  re-­‐implementa2on

 +      Scalable  and  (some2mes)  fast  +      Exis2ng  open-­‐source  library  of  ML  algorithms—    Difficult  to  set  up,  extend

ExamplesML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

ExamplesML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC)✦ Ini2al  studies  in  MATLAB  (Not  distributed)✦ Distributed  prototype  involving  compiled  MATLAB

ExamplesML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC)✦ Ini2al  studies  in  MATLAB  (Not  distributed)✦ Distributed  prototype  involving  compiled  MATLAB

Mahout  ALS  with  Early  Stopping✦ Theory:  simple  if-­‐statement  (3  lines  of  code)

ExamplesML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

‘Distributed’  Divide-­‐Factor-­‐Combine  (DFC)✦ Ini2al  studies  in  MATLAB  (Not  distributed)✦ Distributed  prototype  involving  compiled  MATLAB

Mahout  ALS  with  Early  Stopping✦ Theory:  simple  if-­‐statement  (3  lines  of  code)✦ Prac2ce:  sih  through  7  files,  nearly  1K  lines  of  code

Insight:  Programming  Abstrac2ons

Insight:  Programming  Abstrac2ons✦ Shield  ML  Developers  from  low-­‐details:  provide  

familiar  mathema2cal  operators  in  distributed  sejng

✦ ML  Developer  API  (MLI)

Insight:  Programming  Abstrac2ons✦ Shield  ML  Developers  from  low-­‐details:  provide  

familiar  mathema2cal  operators  in  distributed  sejng

✦ ML  Developer  API  (MLI)✦ Table  Computa2on:  MLTable✦ Linear  Algebra:  MLSubMatrix✦ Op2miza2on  Primi2ves:  MLSolve

Insight:  Programming  Abstrac2ons✦ Shield  ML  Developers  from  low-­‐details:  provide  

familiar  mathema2cal  operators  in  distributed  sejng

✦ ML  Developer  API  (MLI)✦ Table  Computa2on:  MLTable✦ Linear  Algebra:  MLSubMatrix✦ Op2miza2on  Primi2ves:  MLSolve

✦ MLI  Examples:✦ DFC:  ~50  lines  of  code

Insight:  Programming  Abstrac2ons✦ Shield  ML  Developers  from  low-­‐details:  provide  

familiar  mathema2cal  operators  in  distributed  sejng

✦ ML  Developer  API  (MLI)✦ Table  Computa2on:  MLTable✦ Linear  Algebra:  MLSubMatrix✦ Op2miza2on  Primi2ves:  MLSolve

✦ MLI  Examples:✦ DFC:  ~50  lines  of  code✦ ALS:  early  stopping  in  3  lines;  <  40  lines  total

Lines  of  Code

Lines  of  CodeLogis2c  Regression

Alterna2ng  Least  Squares

System Lines  of  Code

Matlab 11

Vowpal  Wabbit 721

MLI 55

System Lines  of  Code

Matlab 20

Mahout 865

GraphLab 383

MLI 32

Lines  of  CodeLogis2c  Regression

Alterna2ng  Least  Squares

System Lines  of  Code

Matlab 11

Vowpal  Wabbit 721

MLI 55

System Lines  of  Code

Matlab 20

Mahout 865

GraphLab 383

MLI 32

Lines  of  CodeLogis2c  Regression

Alterna2ng  Least  Squares

System Lines  of  Code

Matlab 11

Vowpal  Wabbit 721

MLI 55

System Lines  of  Code

Matlab 20

Mahout 865

GraphLab 383

MLI 32

MLI  Details

MLI  Details

OLDval  x:  RDD[Array[Double]]

MLI  Details

OLDval  x:  RDD[Array[Double]]val  x:  RDD[spark.u=l.Vector]

MLI  Details

OLDval  x:  RDD[Array[Double]]val  x:  RDD[spark.u=l.Vector]val  x:  RDD[breeze.linalg.Vector]

MLI  Details

OLDval  x:  RDD[Array[Double]]val  x:  RDD[spark.u=l.Vector]val  x:  RDD[breeze.linalg.Vector]val  x:  RDD[BIDMat.SMat]

MLI  Details

OLDval  x:  RDD[Array[Double]]val  x:  RDD[spark.u=l.Vector]val  x:  RDD[breeze.linalg.Vector]val  x:  RDD[BIDMat.SMat]

MLI  Details

OLDval  x:  RDD[Array[Double]]val  x:  RDD[spark.u=l.Vector]val  x:  RDD[breeze.linalg.Vector]val  x:  RDD[BIDMat.SMat]

NEWval  x:  MLTable

MLI  Details

OLDval  x:  RDD[Array[Double]]val  x:  RDD[spark.u=l.Vector]val  x:  RDD[breeze.linalg.Vector]val  x:  RDD[BIDMat.SMat]

NEWval  x:  MLTable

✦ Generic  interface  for  feature  extrac2on✦ Common  interface  to  support  an  op2mizer✦ Abstract  interface  for  arbitrary  backends

MLTable

✦ Flexibility  when  loading  data✦ e.g.,  CSV,  JSON,  XML✦ Heterogenous  data  across  

columns✦ Missing  Data✦ Feature  extrac2on

✦ Common  Interface

✦ Supports  MapReduce  and  Rela2onal  Operators  

✦ Inspired  by  DataFrames  (R)  and  Pandas  (Python)

Feature  Extrac2on

Operation Arguments Returns Semanticsproject Seq[Index] MLTable Select a subset of columns from a table.union MLTable MLTable Concatenate two tables with identical schemas.filter MLRow ) Bool MLTable Select a subset of rows from a data table given a

functional predicate.join MLTable, Seq[Index] MLTable Inner join of two tables based on a sequence of shared

columns.map MLRow ) MLRow MLTable Apply a function to each row in the table.flatMap MLRow ) TraversableOnce[MLRow] MLTable Apply a function to each row, producing 0 or more

rows as output.reduce Seq[MLRow] ) MLRow MLTable Combine all rows in a table using an associative,

commutative reduce function.reduceByKey Int, Seq[MLRow] ) MLRow MLTable Combine all rows in a table using an associative,

commutative reduce function on a key-by-key basiswhere a key column is the first argument.

matrixBatchMap MLSubMatrix ) MLSubMatrix MLNumericTable Execute a batch function on a local partition of thedata. Output matrices are concatenated to form a newtable.

numRows None Long Returns number of rows in the table.numCols None Long Returns the number of columns in the table.

Fig. 2: MLTable API Illustration. This table captures core operations of the API and is not exhaustive.

1 def main(args: Array[String]) {2 val mc = new MLContext("local")3

4 //Read in table from file on HDFS.5 val rawTextTable = mc.textFile(args(0))6

7 //Run feature extraction on the raw text - get the top 30000 bigrams.8 val featurizedTable = tfIdf(nGrams(rawTextTable, n=2, top=30000))9

10 //Cluster the data using K-Means.11 val kMeansModel = KMeans(featurizedTable, k=50)12 }

Fig. 3: Loading, featurizing, and learning clusters on a corpus of Text Data.

Family Example Uses Returns SemanticsShape dims(mat), mat.numRows, mat.numCols Int or (Int,Int) Matrix dimensions.Composition matA on matB, matA then matB Matrix Combine two matrices row-wise or

column-wise.Indexing mat(0,??), mat(10,10), mat(Seq(2,4), 1) Matrix or Scalar Select elements or sub-matrices.Reverse Indexing mat(0,??).nonZeros, mat.parentRow(n),

mat.parentCol(n)Index Look up matrix indexes in either base

matrix or parent table.Updating mat(1,2) = 5, mat(1, Seq(3,10)) = matB None Assign values to elements or sub-

matrices.Arithmetic matA + matB, matA - 5, matA / matB Matrix Element-wise arithmetic between ma-

trices and scalars or matrices.Linear Algebra matA times matB, matA dot matB,

matA.transpose, matA.solve(v), matA.svd,matA.eigen, matA.rank

Matrix or Scalar Basic and extended linear algebra sup-port.

Fig. 4: MLSubMatrix API Illustration

with data size, as in the case with basic linear regression. As aresult various optimization techniques are used to converge toan approximate solution while iterating over the data. We treatoptimization as a first class citizen in our API. Furthermore, wedesigned the Optimizer interface to allow developers to con-tribute new optimizations algorithms to the system. Figure 13(bottom) shows as an example how to implement StochasticGradient Descent using MLSubMatrix.

Finally, we encourage developers to implement their algo-rithms using the Algorithm interface, which should return amodel as specified by the Model interface. An algorithm im-plementing the Algorithm interface is a class with a train()method that accepts data and hyperparameters as input, andproduces a Model. A Model is a class which makes predictions.In the case of a classification model, this would be a predictedclass given a new example point. In the case of a collaborative

MLSubMatrix

✦ Linear  algebra  on  local  parAAons✦ E.g.,  matrix-­‐vector  opera2ons  for  

mini-­‐batch  logis2c  regression✦ E.g.,  solving  linear  system  of  equa2ons  

for  Alterna2ng  Least  Squares

✦ Sparse  and  Dense  Matrix  Support

Alterna2ng  Least  Squares

1 function [U, V] = ALS_matlab(M, U, V, k, lambda, maxiter)2

3 % Initialize variables4 [m,n] = size(M);5 lambI = lambda * eye(k);6 for q = 1:m7 Vinds{q} = find(M(q,:) ˜= 0);8 end9 for q=1:n

10 Uinds{q} = find(M(:,q) ˜= 0);11 end12

13 % ALS main loop14 for iter=1:maxiter15 parfor q=1:m16 Vq = V(Vinds{q},:);17 U(q,:) = (Vq’*Vq + lambI) \ (Vq’ * M(q,Vinds{q})’);18 end19 parfor q=1:n20 Uq = U(Uinds{q},:);21 V(q,:) = (Uq’*Uq + lambI) \ (Uq’ * M(Uinds{q},q));22 end23 end24 end

1 object BroadcastALS extends Algorithm {2 def train(trainData: MLNumericTable, trainDataTrans: MLNumericTable,3 m: Int, n: Int, k: Int, lambda: Int, maxIter: Int): ALSModel = {4 val lambI = MLSubMatrix.eye(k).mul(lambda)5 var U = MLSubMatrix.rand(m, k)6 var V = MLSubMatrix.rand(n, k)7 var U_b = trainData.context.broadcast(U)8 var V_b = trainData.context.broadcast(V)9 for (iter <- 0 until maxIter) {

10 U = trainData.matrixBatchMap(localALS(_, U_b.value, lambI, k))11 U_b = trainData.context.broadcast(U)12 V = trainDataTrans.matrixBatchMap(localALS(_, V_b.value, lambI, k))13 V_b = trainData.context.broadcast(V)14 }15 new ALSModel(U, V)16 }17

18 def localALS(trainDataPart: MLSubMatrix, Y: MLSubMatrix, lambI: MLSubMatrix, k: Int){19 var localX = MLSubMatrix.zeros(trainDataPart.numRows, k)20 for (i <- 0 until trainDataPart.numRows) {21 val q = trainDataPart.rowID(i)22 val nz_inds = trainDataPart.nzCols(q)23 val Yq = Y(trainDataPart.nzCols(q), ??)24 localX(i, ??) = ((Yq.transpose times Yq) + lambI)25 .solve(Yq.transpose times trainDataPart(q, nz_inds).transpose)26 }27 return localX28 }29 }

Fig. 14: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

MLSolve

✦ Distributed  implementaAons  of  common  opAmizaAon  paZerns

✦ E.g.,  Stochas2c  Gradient  Descent:  Applicable  to  summable  ML  losses

✦ E.g.,  LBFGS:  An  approximate  2nd-­‐order  op2miza2on  method  

✦ E.g.,  ADMM:  Decomposi2on  /  coordina2on  procedure

Logis2c  Regression

1 function w = log_reg(X, y, maxiter, learning_rate)2 [n, d] = size(X);3 w = zeros(d,1);4 for iter = 1:maxiter5 grad = X’ * (sigmoid(X * w) - y);6 w = w - learning_rate * grad;7 end8 end9

10 % applies sigmoid function component-wise on the vector x11 function s = sigmoid(x)12 s = 1 ./ (1 + exp(-1 .* x));13 end

1 object LogisticRegression extends Algorithm {2 def sigmoid(z: Scalar) = 1.0 / (1.0 + exp(-1.0*z))3

4 def gradientFunction(w: MLSubMatrix, x: MLSubMatrix, y: Scalar): MLSubMatrix = {5 x.transpose * (sigmoid(x dot w) - y)6 }7

8 def train(data: MLNumericTable, p: LogRegParams): LogRegModel = {9 val d = data.numCols

10 val params = SGDParams(initweights = MLSubMatrix.zeros(d, 1),11 maxIterations = p.maxIter, learningRate = p.learningRate,12 gradientFunction = gradientFunction)13 val weights = SGD(data, params)14 new LogRegModel(weights)15 }16 }

1 object StochasticGradientDescent extends Optimizer {2

3 def localSGD(x: MLSubMatrix, weights: MLSubMatrix, n: Index, lambda: Scalar,4 gradientFunction: (MLSubMatrix, MLSubMatrix, Scalar) => MLSubMatrix): MLSubMatrix = {5 var localWeights = weights6 for (i <- 0 to x.numRows) {7 val label = x(i,0)8 val data = x(i, 1 to x.numCols)9 val grad = gradientFunction(localWeights, data, label)

10 localWeights = localWeights - (grad * (lambda/n))11 }12 localWeights13 }14

15 def compute(data: MLNumericTable, params: StochasticGradientDescentParameters): MLSubMatrix = {16 // Initialize parameters17 var weights = params.initWeights.value18 var converged = false19 val n = data.count20 while(!converged && i < params.maxIterations) {21 i+=122 // Run SGD locally on each partition, and average resulting parameters after each iteration23 newWeights = data24 .matrixBatchMap(localSGD(_, weights, n, params.learningRate, params.gradientFunction))25 .reduce(_ + _)/data.partitions.length26 if ((newWeights - weights).mean < params.epsilon) converged = true27 weights = newWeights28 }29 weights30 }31 }

Fig. 13: Logistic Regression Code in MATLAB (top) and MLI (middle, bottom).

Linear  Regression  (+Lasso,  Ridge)

Alterna2ng  Least  Squares,  [DFC]

K-­‐Means,  [DP-­‐Means]

Logis2c  Regression,  Linear  SVM  (+L1,  L2),  Mul2nomial  Regression,  [Naive  Bayes,  Decision  Trees]

SGD,  Parallel  Gradient,  Local  SGD,  [L-­‐BFGS,  ADMM,  Adagrad]

Principal  Component  Analysis  (PCA),  N-­‐grams,  feature  cleaning  /  normaliza2on

Cross  Valida2on,  Evalua2on  Metrics

MLI  Func2onalityRegression:

CollaboraAve  Filtering:

Clustering:

ClassificaAon:

OpAmizaAon  PrimiAves:

Feature  ExtracAon:

ML  Tools:

Linear  Regression  (+Lasso,  Ridge)

Alterna2ng  Least  Squares,  [DFC]

K-­‐Means,  [DP-­‐Means]

Logis2c  Regression,  Linear  SVM  (+L1,  L2),  Mul2nomial  Regression,  [Naive  Bayes,  Decision  Trees]

SGD,  Parallel  Gradient,  Local  SGD,  [L-­‐BFGS,  ADMM,  Adagrad]

Principal  Component  Analysis  (PCA),  N-­‐grams,  feature  cleaning  /  normaliza2on

Cross  Valida2on,  Evalua2on  Metrics

MLI  Func2onalityRegression:

CollaboraAve  Filtering:

Clustering:

ClassificaAon:

OpAmizaAon  PrimiAves:

Feature  ExtracAon:

ML  Tools:

VisionMLlibMLIML  OpAmizerRelease  Plan

Build  a  Classifier  for  X

What  you  want  to  do

Build  a  Classifier  for  X

What  you  want  to  do What  you  have  to  do✦ Learn  the  internals  of  ML  classificaAon  

algorithms,  sampling,  feature  selecAon,  X-­‐validaAon,….

✦ PotenAally  learn  Spark/Hadoop/…✦ Implement  3-­‐4  algorithms✦ Implement  grid-­‐search  to  find  the  right  

algorithm  parameters✦ Implement  validaAon  algorithms✦ Experiment  with  different  sampling-­‐

sizes,  algorithms,  features✦ ….

Build  a  Classifier  for  X

What  you  want  to  do What  you  have  to  do✦ Learn  the  internals  of  ML  classificaAon  

algorithms,  sampling,  feature  selecAon,  X-­‐validaAon,….

✦ PotenAally  learn  Spark/Hadoop/…✦ Implement  3-­‐4  algorithms✦ Implement  grid-­‐search  to  find  the  right  

algorithm  parameters✦ Implement  validaAon  algorithms✦ Experiment  with  different  sampling-­‐

sizes,  algorithms,  features✦ ….

and  in  the  end

Ask  For  Help

Insight:  A  Declara2ve  Approach

SQL Result

✦ End  Users  tell  the  system  what  they  want,  not  how  to  get  it

Insight:  A  Declara2ve  Approach

SQL Result MQL Model

✦ End  Users  tell  the  system  what  they  want,  not  how  to  get  it

var  X  =  load(”als_clinical”,  2  to  10)var  y  =  load(”als_clinical”,  1)var  (fn-­‐model,  summary)  =  doClassify(X,  y)

Example:  Supervised  ClassificaAon

✦ End  Users  tell  the  system  what  they  want,  not  how  to  get  it

Insight:  A  Declara2ve  Approach

var  X  =  load(”als_clinical”,  2  to  10)var  y  =  load(”als_clinical”,  1)var  (fn-­‐model,  summary)  =  doClassify(X,  y)

Example:  Supervised  ClassificaAon

Algorithm

 Indepe

ndent  

✦ End  Users  tell  the  system  what  they  want,  not  how  to  get  it

Insight:  A  Declara2ve  Approach

 ML  Op2mizer:  A  Search  Problem

5min

Boosting

SVM

✦ System  is  responsible  for  searching  through  model  space

✦ Opportuni2es  for  physical  op2miza2on

Systems  Op2miza2on  of  Model  Search

35

Observation: We tend to be I/O bound during model training.

A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g Monkey

Systems  Op2miza2on  of  Model  Search

✦ Idea  from  databases  –  shared  cursor!

35

Observation: We tend to be I/O bound during model training.

A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g Monkey

Systems  Op2miza2on  of  Model  Search

✦ Idea  from  databases  –  shared  cursor!

35

Observation: We tend to be I/O bound during model training.

A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g MonkeyQ

uery

A

Systems  Op2miza2on  of  Model  Search

✦ Idea  from  databases  –  shared  cursor!

35

Observation: We tend to be I/O bound during model training.

A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g MonkeyQ

uery

A

Que

ry B

Systems  Op2miza2on  of  Model  Search

✦ Idea  from  databases  –  shared  cursor!

35

Observation: We tend to be I/O bound during model training.

A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g MonkeyQ

uery

A

Que

ry B

✦ Single  pass  over  the  data,  many  models  trained

Systems  Op2miza2on  of  Model  Search

✦ Idea  from  databases  –  shared  cursor!

35

Observation: We tend to be I/O bound during model training.

A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g MonkeyQ

uery

A

Que

ry B

✦ Single  pass  over  the  data,  many  models  trained

✦ Example  –  Logis2c  Regression  via  SGD

Spark

MLlib

MLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End  User

Rela2onship  with  MLI

MQL

Spark

MLlib

MLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End  User

Rela2onship  with  MLI

✦ MLI  provides  common  interface  for  all  algorithms

MQL

Spark

MLlib

MLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End  User

Rela2onship  with  MLI

✦ MLI  provides  common  interface  for  all  algorithms✦ Contracts:  Meta-­‐data  for  algorithms  writen  against  MLI

MQL

Spark

MLlib

MLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End  User

Rela2onship  with  MLI

✦ MLI  provides  common  interface  for  all  algorithms✦ Contracts:  Meta-­‐data  for  algorithms  writen  against  MLI

✦ Type  (e.g.,  classifica2on)✦ Parameters✦ Run2me  (e.g.,  O(n))✦ Input-­‐Specifica2on✦ Output-­‐Specifica2on✦ …

MQL

VisionMLlibMLIML  OpAmizerRelease  Plan

Contributors✦ John  Duchi✦ Michael  Franklin✦ Joseph  Gonzalez✦ Rean  Griffith✦ Michael  Jordan✦ Tim  Kraska✦ Xinghao  Pan✦ Virginia  Smith✦ Shivaram  Venkarataram✦ Matei  Zaharia

Contributors✦ John  Duchi✦ Michael  Franklin✦ Joseph  Gonzalez✦ Rean  Griffith✦ Michael  Jordan✦ Tim  Kraska✦ Xinghao  Pan✦ Virginia  Smith✦ Shivaram  Venkarataram✦ Matei  Zaharia

*

*

*

*

First  Release  (Summer)

MLlibMLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End  User

Spark

First  Release  (Summer)

✦ MLlib:  low-­‐level  ML  library  and  underlying  kernels✦ Callable  from  Scala,  Java✦ Included  as  part  of  Spark

✦ MLI:  API  for  feature  extrac2on  and  ML  algorithms✦ Plaworm  for  ML  development✦ Includes  more  extensive  library  and  with  faster  dev-­‐cycle  than  MLlib

MLlibMLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End  User

Spark

Second  Release  (Winter)

MLlibMLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End  User

Spark

Second  Release  (Winter)

✦ ML  OpAmizer:  automated  model  selec2on✦ Search  problem  over  feature  extractors  and  algorithms  in  MLI✦ Contracts✦ Restricted  query  language  (MQL)

MLlibMLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End  User

Spark

Second  Release  (Winter)

✦ ML  OpAmizer:  automated  model  selec2on✦ Search  problem  over  feature  extractors  and  algorithms  in  MLI✦ Contracts✦ Restricted  query  language  (MQL)

✦ Feature  extracAon  for  image  data

MLlibMLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End  User

Spark

Future  Direc2ons

Future  Direc2ons✦ IdenAfy  minimal  set  of  ML  operators

✦ Expose  internals  of  ML  algorithms  to  op2mizer

Future  Direc2ons✦ IdenAfy  minimal  set  of  ML  operators

✦ Expose  internals  of  ML  algorithms  to  op2mizer

✦ Unified  language  for  End  Users  and  ML  Developers

Future  Direc2ons✦ IdenAfy  minimal  set  of  ML  operators

✦ Expose  internals  of  ML  algorithms  to  op2mizer

✦ Unified  language  for  End  Users  and  ML  Developers

✦ Plug-­‐ins  to  Python,  R

Future  Direc2ons✦ IdenAfy  minimal  set  of  ML  operators

✦ Expose  internals  of  ML  algorithms  to  op2mizer

✦ Unified  language  for  End  Users  and  ML  Developers

✦ Plug-­‐ins  to  Python,  R

✦ VisualizaAon  for  unsupervised  learning  and  explora2on

Future  Direc2ons✦ IdenAfy  minimal  set  of  ML  operators

✦ Expose  internals  of  ML  algorithms  to  op2mizer

✦ Unified  language  for  End  Users  and  ML  Developers

✦ Plug-­‐ins  to  Python,  R

✦ VisualizaAon  for  unsupervised  learning  and  explora2on

✦ Advanced  ML  capabiliAes✦ Time-­‐series  algorithms✦ Graphical  models✦ Advanced  Op2miza2on  (e.g.,  asynchronous  computa2on)✦ Online  updates✦ Sampling  for  efficiency  

ContribuAons  encouraged!

Berkeley,  CAAugust  29-­‐30www.mlbase.org

baseML

baseML

baseML

baseML

ML base

ML base

ML base

ML base

ML base