Evan Sparks and Ameet TalwalkarUC Berkeley
UC Berkeley
baseML
baseML
baseML
baseML
ML base
ML base
ML base
ML base
ML base
Three Converging Trends
Big Data
Three Converging Trends
Distributed Compu2ng
Big Data
Three Converging Trends
Distributed Compu2ng
Big Data
Three Converging Trends
Machine Learning
Distributed Compu2ng
Big Data
Three Converging Trends
Machine LearningMLbase
VisionMLlibMLIML OpAmizerRelease Plan
Problem: Scalable implementaAons difficult for ML Developers…
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
Problem: Scalable implementaAons difficult for ML Developers…
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
Problem: Scalable implementaAons difficult for ML Developers…
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
Too many algorithms…
Problem: ML is difficultfor End Users…
Too many algorithms…
Too many knobs…
Problem: ML is difficultfor End Users…
Too many algorithms…
Too many knobs…
Problem: ML is difficultfor End Users…
Difficult to debug…
Too many algorithms…
Too many knobs…
Problem: ML is difficultfor End Users…
Difficult to debug…
Doesn’t scale…
Too many algorithms…
Too many knobs…
Problem: ML is difficultfor End Users…
Difficult to debug…
Reliable
Fast
Accurate
Provable
Doesn’t scale…
ML Experts Systems ExpertsMLbase
1. Easy scalable ML development (ML Developers)2. User-‐friendly ML at scale (End Users)
ML Experts Systems ExpertsMLbase
1. Easy scalable ML development (ML Developers)2. User-‐friendly ML at scale (End Users)
Along the way, we gain insight into data intensive compu2ng
ML Experts Systems ExpertsMLbase
Matlab Stack
Matlab Stack
Single Machine
Lapack
Matlab Stack
Single Machine
✦ Lapack: low-‐level Fortran linear algebra library
Lapack
Matlab Interface
Matlab Stack
Single Machine
✦ Lapack: low-‐level Fortran linear algebra library✦ Matlab Interface
✦ Higher-‐level abstrac2ons for data access / processing✦ More extensive func2onality than Lapack✦ Leverages Lapack whenever possible
Lapack
Matlab Interface
Matlab Stack
Single Machine
✦ Lapack: low-‐level Fortran linear algebra library✦ Matlab Interface
✦ Higher-‐level abstrac2ons for data access / processing✦ More extensive func2onality than Lapack✦ Leverages Lapack whenever possible
✦ Similar stories for R and Python
MLbase Stack
Lapack
Matlab Interface
Single Machine
MLbase Stack
Runtime(s)
Lapack
Matlab Interface
Single Machine
MLbase Stack
Runtime(s)Spark
Spark: cluster compu=ng system designed for itera=ve computa=on
Lapack
Matlab Interface
Single Machine
MLbase Stack
Runtime(s)
MLlib
Spark
Spark: cluster compu=ng system designed for itera=ve computa=on
MLlib: low-‐level ML library in Spark✦ Callable from Scala, Java
Lapack
Matlab Interface
Single Machine
MLbase Stack
Runtime(s)
MLlib
MLI
Spark
Spark: cluster compu=ng system designed for itera=ve computa=on
MLlib: low-‐level ML library in Spark✦ Callable from Scala, Java
MLI: API / plaHorm for feature extrac=on and algorithm development✦ Includes higher-‐level func2onality with faster dev cycle than MLlib
Lapack
Matlab Interface
Single Machine
MLbase Stack
Runtime(s)
MLlib
MLI
ML Optimizer
Spark
Spark: cluster compu=ng system designed for itera=ve computa=on
MLlib: low-‐level ML library in Spark✦ Callable from Scala, Java
MLI: API / plaHorm for feature extrac=on and algorithm development✦ Includes higher-‐level func2onality with faster dev cycle than MLlib
ML OpAmizer: automates model selec=on✦ Solves a search problem over feature extractors and algorithms in MLI
Lapack
Matlab Interface
Single Machine
MLlib
MLI
ML OptimizerEnd User
MLbase Stack Status
Spark
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
MLlib
MLI
ML OptimizerEnd User
MLbase Stack Status
Spark
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
MLlib
MLI
ML OptimizerEnd User
MLbase Stack Status
Goal 1:Summer Release
Spark
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
MLlib
MLI
ML OptimizerEnd User
MLbase Stack Status
Goal 1:Summer Release
Goal 2:Winter Release
Spark
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
Example: MLlib
Example: MLlib
✦ Goal: Classifica2on of text file
Example: MLlib
✦ Goal: Classifica2on of text file✦ Featurize data manually
1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3
4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6
7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)11
12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }
1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3
4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6
7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9
10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }
Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).
Example: MLlib
✦ Goal: Classifica2on of text file✦ Featurize data manually✦ Calls MLlib’s LR func2on
1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3
4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6
7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)11
12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }
1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3
4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6
7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9
10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }
Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).
Example: MLI
Example: MLI
✦ Use built-‐in feature extrac2on func2onality
1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3
4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6
7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)11
12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }
1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3
4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6
7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9
10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }
Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).
Example: MLI
✦ Use built-‐in feature extrac2on func2onality✦ MLI Logis2c Regression leverages MLlib
1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3
4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6
7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)11
12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }
1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3
4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6
7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9
10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }
Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).
Example: MLI
✦ Use built-‐in feature extrac2on func2onality✦ MLI Logis2c Regression leverages MLlib✦ Extensions:
✦ Embed in cross-‐valida2on rou2ne✦ Use different feature extractors / algorithms or
write new ones
1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3
4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6
7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))
10 val featureizedTable = classes.zip(ngrams)11
12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }
1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3
4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6
7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9
10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }
Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).
Example: ML Op2mizer
var X = load(”text_file”, 2 to 10)var y = load(”text_file”, 1)var (fn-‐model, summary) = doClassify(X, y)
✦ User declara2vely specifies task✦ ML Op2mizer searches through MLI
VisionMLlibMLIML OpAmizerRelease Plan
Ease of u
se
Performance, Scalability
Lay of the Land
Matlab, Rx
Ease of u
se
Performance, Scalability
Lay of the Land
Matlab, Rx
Ease of u
se
Performance, Scalability
Mahoutx
Lay of the Land
Matlab, Rx
Ease of u
se
Performance, Scalability
GraphLab, VWx
Mahoutx
Lay of the Land
Matlab, Rx
Ease of u
se
Performance, Scalability
GraphLab, VWx
Mahoutx
Lay of the Land
MLlibx
Logis2c Regression, Linear SVM (+L1, L2)
Linear Regression (+Lasso, Ridge)
Alterna2ng Least Squares
K-‐Means
SGD, Parallel Gradient
MLlibClassificaAon:
Regression:
CollaboraAve Filtering:
Clustering:
OpAmizaAon PrimiAves:
Logis2c Regression, Linear SVM (+L1, L2)
Linear Regression (+Lasso, Ridge)
Alterna2ng Least Squares
K-‐Means
SGD, Parallel Gradient
MLlibClassificaAon:
Regression:
CollaboraAve Filtering:
Clustering:
OpAmizaAon PrimiAves:
Included within Spark codebase✦ Unlike Mahout/Hadoop✦ Part of Spark 0.8 release✦ Con2nued support via Spark project
MLlib Performance
✦ WallAme: elapsed 2me to execute task
MLlib Performance
✦ WallAme: elapsed 2me to execute task
✦ Weak scaling✦ fix problem size per processor✦ ideally: constant wall2me as we grow cluster
MLlib Performance
✦ WallAme: elapsed 2me to execute task
✦ Weak scaling✦ fix problem size per processor✦ ideally: constant wall2me as we grow cluster
✦ Strong scaling✦ fix total problem size✦ ideally: linear speed up as we grow cluster
MLlib Performance
✦ WallAme: elapsed 2me to execute task
✦ Weak scaling✦ fix problem size per processor✦ ideally: constant wall2me as we grow cluster
✦ Strong scaling✦ fix total problem size✦ ideally: linear speed up as we grow cluster
✦ EC2 Experiments✦ m2.4xlarge instances, up to 32 machine clusters
MLlib Performance
Logis2c Regression -‐ Weak Scaling
Logis2c Regression -‐ Weak Scaling
✦ Full dataset: 200K images, 160K dense features
Logis2c Regression -‐ Weak Scaling
✦ Full dataset: 200K images, 160K dense features✦ Similar weak scaling
MLbase VW Matlab0
1000
2000
3000
4000
wal
ltim
e (s
)
n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K
Fig. 5: Walltime for weak scaling for logistic regression.
0 5 10 15 20 25 300
2
4
6
8
10
rela
tive
wal
ltim
e
# machines
MLbaseVWIdeal
Fig. 6: Weak scaling for logistic regression
MLbase VW Matlab0
200
400
600
800
1000
1200
1400
wal
ltim
e (s
)
1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines
Fig. 7: Walltime for strong scaling for logistic regression.
0 5 10 15 20 25 300
5
10
15
20
25
30
35
# machines
spee
dup
MLbaseVWIdeal
Fig. 8: Strong scaling for logistic regression
with respect to computation. In practice, we see comparablescaling results as more machines are added.
In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.
Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.
From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.
System Lines of CodeMLbase 32
GraphLab 383Mahout 865
MATLAB-Mex 124MATLAB 20
TABLE II: Lines of code for various implementations of ALS
B. Collaborative Filtering: Alternating Least Squares
Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n
be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV
T .Commonly, U and V are estimated using the following bi-convex objective:
min
U,V
X
(i,j)2⌦(M)
(Mij � U
Ti Vj)
2+ �(||U ||2F + ||V ||2F ) . (2)
Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u
⇤i 2 Rk is the optimal solution for the
i
th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the
MLlib
Logis2c Regression -‐ Weak Scaling
✦ Full dataset: 200K images, 160K dense features✦ Similar weak scaling✦ MLlib within a factor of 2 of VW’s wall=me
MLbase VW Matlab0
1000
2000
3000
4000
wal
ltim
e (s
)
n=6K, d=160Kn=12.5K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K
MLlib
MLbase VW Matlab0
1000
2000
3000
4000
wal
ltim
e (s
)
n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K
Fig. 5: Walltime for weak scaling for logistic regression.
0 5 10 15 20 25 300
2
4
6
8
10
rela
tive
wal
ltim
e
# machines
MLbaseVWIdeal
Fig. 6: Weak scaling for logistic regression
MLbase VW Matlab0
200
400
600
800
1000
1200
1400
wal
ltim
e (s
)
1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines
Fig. 7: Walltime for strong scaling for logistic regression.
0 5 10 15 20 25 300
5
10
15
20
25
30
35
# machines
spee
dup
MLbaseVWIdeal
Fig. 8: Strong scaling for logistic regression
with respect to computation. In practice, we see comparablescaling results as more machines are added.
In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.
Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.
From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.
System Lines of CodeMLbase 32
GraphLab 383Mahout 865
MATLAB-Mex 124MATLAB 20
TABLE II: Lines of code for various implementations of ALS
B. Collaborative Filtering: Alternating Least Squares
Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n
be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV
T .Commonly, U and V are estimated using the following bi-convex objective:
min
U,V
X
(i,j)2⌦(M)
(Mij � U
Ti Vj)
2+ �(||U ||2F + ||V ||2F ) . (2)
Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u
⇤i 2 Rk is the optimal solution for the
i
th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the
MLlib
Logis2c Regression -‐ Strong Scaling
Logis2c Regression -‐ Strong Scaling
✦ Fixed Dataset: 50K images, 160K dense features
Logis2c Regression -‐ Strong Scaling
✦ Fixed Dataset: 50K images, 160K dense features✦ MLlib exhibits beTer scaling proper=es
MLbase VW Matlab0
1000
2000
3000
4000
wal
ltim
e (s
)
n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K
Fig. 5: Walltime for weak scaling for logistic regression.
0 5 10 15 20 25 300
2
4
6
8
10
rela
tive
wal
ltim
e# machines
MLbaseVWIdeal
Fig. 6: Weak scaling for logistic regression
MLbase VW Matlab0
200
400
600
800
1000
1200
1400
wal
ltim
e (s
)
1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines
Fig. 7: Walltime for strong scaling for logistic regression.
0 5 10 15 20 25 300
5
10
15
20
25
30
35
# machines
spee
dup
MLbaseVWIdeal
Fig. 8: Strong scaling for logistic regression
with respect to computation. In practice, we see comparablescaling results as more machines are added.
In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.
Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.
From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.
System Lines of CodeMLbase 32
GraphLab 383Mahout 865
MATLAB-Mex 124MATLAB 20
TABLE II: Lines of code for various implementations of ALS
B. Collaborative Filtering: Alternating Least Squares
Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n
be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV
T .Commonly, U and V are estimated using the following bi-convex objective:
min
U,V
X
(i,j)2⌦(M)
(Mij � U
Ti Vj)
2+ �(||U ||2F + ||V ||2F ) . (2)
Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u
⇤i 2 Rk is the optimal solution for the
i
th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the
MLlib
Logis2c Regression -‐ Strong Scaling
✦ Fixed Dataset: 50K images, 160K dense features✦ MLlib exhibits beTer scaling proper=es✦ MLlib faster than VW with 16 and 32 machines
MLbase VW Matlab0
1000
2000
3000
4000
wal
ltim
e (s
)
n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K
Fig. 5: Walltime for weak scaling for logistic regression.
0 5 10 15 20 25 300
2
4
6
8
10
rela
tive
wal
ltim
e
# machines
MLbaseVWIdeal
Fig. 6: Weak scaling for logistic regression
MLbase VW Matlab0
200
400
600
800
1000
1200
1400
wal
ltim
e (s
)
1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines
Fig. 7: Walltime for strong scaling for logistic regression.
0 5 10 15 20 25 300
5
10
15
20
25
30
35
# machines
spee
dup
MLbaseVWIdeal
Fig. 8: Strong scaling for logistic regression
with respect to computation. In practice, we see comparablescaling results as more machines are added.
In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.
Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.
From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.
System Lines of CodeMLbase 32
GraphLab 383Mahout 865
MATLAB-Mex 124MATLAB 20
TABLE II: Lines of code for various implementations of ALS
B. Collaborative Filtering: Alternating Least Squares
Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n
be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV
T .Commonly, U and V are estimated using the following bi-convex objective:
min
U,V
X
(i,j)2⌦(M)
(Mij � U
Ti Vj)
2+ �(||U ||2F + ||V ||2F ) . (2)
Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u
⇤i 2 Rk is the optimal solution for the
i
th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the
MLlib
MLbase VW Matlab0
1000
2000
3000
4000
wal
ltim
e (s
)
n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K
Fig. 5: Walltime for weak scaling for logistic regression.
0 5 10 15 20 25 300
2
4
6
8
10
rela
tive
wal
ltim
e# machines
MLbaseVWIdeal
Fig. 6: Weak scaling for logistic regression
MLbase VW Matlab0
200
400
600
800
1000
1200
1400
wal
ltim
e (s
)
1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines
Fig. 7: Walltime for strong scaling for logistic regression.
0 5 10 15 20 25 300
5
10
15
20
25
30
35
# machines
spee
dup
MLbaseVWIdeal
Fig. 8: Strong scaling for logistic regression
with respect to computation. In practice, we see comparablescaling results as more machines are added.
In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.
Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.
From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.
System Lines of CodeMLbase 32
GraphLab 383Mahout 865
MATLAB-Mex 124MATLAB 20
TABLE II: Lines of code for various implementations of ALS
B. Collaborative Filtering: Alternating Least Squares
Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n
be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV
T .Commonly, U and V are estimated using the following bi-convex objective:
min
U,V
X
(i,j)2⌦(M)
(Mij � U
Ti Vj)
2+ �(||U ||2F + ||V ||2F ) . (2)
Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u
⇤i 2 Rk is the optimal solution for the
i
th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the
MLlib
ALS -‐ Wall2me
ALS -‐ Wall2me
✦ Dataset: Scaled version of NeHlix data (9X in size)✦ Cluster: 9 machines
ALS -‐ Wall2me
✦ Dataset: Scaled version of NeHlix data (9X in size)✦ Cluster: 9 machines
System WallAme (seconds)
Matlab 15443
Mahout 4206
GraphLab 291
MLlib 481
ALS -‐ Wall2me
✦ Dataset: Scaled version of NeHlix data (9X in size)✦ Cluster: 9 machines
System WallAme (seconds)
Matlab 15443
Mahout 4206
GraphLab 291
MLlib 481
ALS -‐ Wall2me
✦ Dataset: Scaled version of NeHlix data (9X in size)✦ Cluster: 9 machines✦ MLlib an order of magnitude faster than Mahout✦ MLlib within factor of 2 of GraphLab
System WallAme (seconds)
Matlab 15443
Mahout 4206
GraphLab 291
MLlib 481
Deployment Considera2ons
Deployment Considera2ons
Vowpal Wabbit, GraphLab✦ Data prepara=on specific to each program✦ Non-‐trivial setup on cluster✦ No fault tolerance
Deployment Considera2ons
Vowpal Wabbit, GraphLab✦ Data prepara=on specific to each program✦ Non-‐trivial setup on cluster✦ No fault tolerance
MLlib✦ Reads files from HDFS✦ Launch/compile/run on cluster with a few commands✦ RDD’s are fault tolerance
VisionMLlibMLIML OpAmizerRelease Plan
Matlab, Rx
Ease of u
se
Performance, Scalability
GraphLab, VWx
Mahoutx
Lay of the Land
MLlibx
Matlab, Rx
Ease of u
se
Performance, Scalability
GraphLab, VWx
MLIx
Mahoutx
Lay of the Land
MLlibx
Current Op2ons
Current Op2ons
+ Easy (Resembles math, limited / no set up cost) + Sufficient for prototyping / wri2ng papers— Ad-‐hoc, non-‐scalable scripts— Loss of transla2on upon re-‐implementa2on
Current Op2ons
+ Easy (Resembles math, limited / no set up cost) + Sufficient for prototyping / wri2ng papers— Ad-‐hoc, non-‐scalable scripts— Loss of transla2on upon re-‐implementa2on
Current Op2ons
+ Easy (Resembles math, limited / no set up cost) + Sufficient for prototyping / wri2ng papers— Ad-‐hoc, non-‐scalable scripts— Loss of transla2on upon re-‐implementa2on
+ Scalable and (some2mes) fast + Exis2ng open-‐source library of ML algorithms— Difficult to set up, extend
ExamplesML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
ExamplesML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
‘Distributed’ Divide-‐Factor-‐Combine (DFC)✦ Ini2al studies in MATLAB (Not distributed)✦ Distributed prototype involving compiled MATLAB
ExamplesML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
‘Distributed’ Divide-‐Factor-‐Combine (DFC)✦ Ini2al studies in MATLAB (Not distributed)✦ Distributed prototype involving compiled MATLAB
Mahout ALS with Early Stopping✦ Theory: simple if-‐statement (3 lines of code)
ExamplesML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
‘Distributed’ Divide-‐Factor-‐Combine (DFC)✦ Ini2al studies in MATLAB (Not distributed)✦ Distributed prototype involving compiled MATLAB
Mahout ALS with Early Stopping✦ Theory: simple if-‐statement (3 lines of code)✦ Prac2ce: sih through 7 files, nearly 1K lines of code
Insight: Programming Abstrac2ons
Insight: Programming Abstrac2ons✦ Shield ML Developers from low-‐details: provide
familiar mathema2cal operators in distributed sejng
✦ ML Developer API (MLI)
Insight: Programming Abstrac2ons✦ Shield ML Developers from low-‐details: provide
familiar mathema2cal operators in distributed sejng
✦ ML Developer API (MLI)✦ Table Computa2on: MLTable✦ Linear Algebra: MLSubMatrix✦ Op2miza2on Primi2ves: MLSolve
Insight: Programming Abstrac2ons✦ Shield ML Developers from low-‐details: provide
familiar mathema2cal operators in distributed sejng
✦ ML Developer API (MLI)✦ Table Computa2on: MLTable✦ Linear Algebra: MLSubMatrix✦ Op2miza2on Primi2ves: MLSolve
✦ MLI Examples:✦ DFC: ~50 lines of code
Insight: Programming Abstrac2ons✦ Shield ML Developers from low-‐details: provide
familiar mathema2cal operators in distributed sejng
✦ ML Developer API (MLI)✦ Table Computa2on: MLTable✦ Linear Algebra: MLSubMatrix✦ Op2miza2on Primi2ves: MLSolve
✦ MLI Examples:✦ DFC: ~50 lines of code✦ ALS: early stopping in 3 lines; < 40 lines total
Lines of Code
Lines of CodeLogis2c Regression
Alterna2ng Least Squares
System Lines of Code
Matlab 11
Vowpal Wabbit 721
MLI 55
System Lines of Code
Matlab 20
Mahout 865
GraphLab 383
MLI 32
Lines of CodeLogis2c Regression
Alterna2ng Least Squares
System Lines of Code
Matlab 11
Vowpal Wabbit 721
MLI 55
System Lines of Code
Matlab 20
Mahout 865
GraphLab 383
MLI 32
Lines of CodeLogis2c Regression
Alterna2ng Least Squares
System Lines of Code
Matlab 11
Vowpal Wabbit 721
MLI 55
System Lines of Code
Matlab 20
Mahout 865
GraphLab 383
MLI 32
MLI Details
MLI Details
OLDval x: RDD[Array[Double]]
MLI Details
OLDval x: RDD[Array[Double]]val x: RDD[spark.u=l.Vector]
MLI Details
OLDval x: RDD[Array[Double]]val x: RDD[spark.u=l.Vector]val x: RDD[breeze.linalg.Vector]
MLI Details
OLDval x: RDD[Array[Double]]val x: RDD[spark.u=l.Vector]val x: RDD[breeze.linalg.Vector]val x: RDD[BIDMat.SMat]
MLI Details
OLDval x: RDD[Array[Double]]val x: RDD[spark.u=l.Vector]val x: RDD[breeze.linalg.Vector]val x: RDD[BIDMat.SMat]
MLI Details
OLDval x: RDD[Array[Double]]val x: RDD[spark.u=l.Vector]val x: RDD[breeze.linalg.Vector]val x: RDD[BIDMat.SMat]
NEWval x: MLTable
MLI Details
OLDval x: RDD[Array[Double]]val x: RDD[spark.u=l.Vector]val x: RDD[breeze.linalg.Vector]val x: RDD[BIDMat.SMat]
NEWval x: MLTable
✦ Generic interface for feature extrac2on✦ Common interface to support an op2mizer✦ Abstract interface for arbitrary backends
MLTable
✦ Flexibility when loading data✦ e.g., CSV, JSON, XML✦ Heterogenous data across
columns✦ Missing Data✦ Feature extrac2on
✦ Common Interface
✦ Supports MapReduce and Rela2onal Operators
✦ Inspired by DataFrames (R) and Pandas (Python)
Feature Extrac2on
Operation Arguments Returns Semanticsproject Seq[Index] MLTable Select a subset of columns from a table.union MLTable MLTable Concatenate two tables with identical schemas.filter MLRow ) Bool MLTable Select a subset of rows from a data table given a
functional predicate.join MLTable, Seq[Index] MLTable Inner join of two tables based on a sequence of shared
columns.map MLRow ) MLRow MLTable Apply a function to each row in the table.flatMap MLRow ) TraversableOnce[MLRow] MLTable Apply a function to each row, producing 0 or more
rows as output.reduce Seq[MLRow] ) MLRow MLTable Combine all rows in a table using an associative,
commutative reduce function.reduceByKey Int, Seq[MLRow] ) MLRow MLTable Combine all rows in a table using an associative,
commutative reduce function on a key-by-key basiswhere a key column is the first argument.
matrixBatchMap MLSubMatrix ) MLSubMatrix MLNumericTable Execute a batch function on a local partition of thedata. Output matrices are concatenated to form a newtable.
numRows None Long Returns number of rows in the table.numCols None Long Returns the number of columns in the table.
Fig. 2: MLTable API Illustration. This table captures core operations of the API and is not exhaustive.
1 def main(args: Array[String]) {2 val mc = new MLContext("local")3
4 //Read in table from file on HDFS.5 val rawTextTable = mc.textFile(args(0))6
7 //Run feature extraction on the raw text - get the top 30000 bigrams.8 val featurizedTable = tfIdf(nGrams(rawTextTable, n=2, top=30000))9
10 //Cluster the data using K-Means.11 val kMeansModel = KMeans(featurizedTable, k=50)12 }
Fig. 3: Loading, featurizing, and learning clusters on a corpus of Text Data.
Family Example Uses Returns SemanticsShape dims(mat), mat.numRows, mat.numCols Int or (Int,Int) Matrix dimensions.Composition matA on matB, matA then matB Matrix Combine two matrices row-wise or
column-wise.Indexing mat(0,??), mat(10,10), mat(Seq(2,4), 1) Matrix or Scalar Select elements or sub-matrices.Reverse Indexing mat(0,??).nonZeros, mat.parentRow(n),
mat.parentCol(n)Index Look up matrix indexes in either base
matrix or parent table.Updating mat(1,2) = 5, mat(1, Seq(3,10)) = matB None Assign values to elements or sub-
matrices.Arithmetic matA + matB, matA - 5, matA / matB Matrix Element-wise arithmetic between ma-
trices and scalars or matrices.Linear Algebra matA times matB, matA dot matB,
matA.transpose, matA.solve(v), matA.svd,matA.eigen, matA.rank
Matrix or Scalar Basic and extended linear algebra sup-port.
Fig. 4: MLSubMatrix API Illustration
with data size, as in the case with basic linear regression. As aresult various optimization techniques are used to converge toan approximate solution while iterating over the data. We treatoptimization as a first class citizen in our API. Furthermore, wedesigned the Optimizer interface to allow developers to con-tribute new optimizations algorithms to the system. Figure 13(bottom) shows as an example how to implement StochasticGradient Descent using MLSubMatrix.
Finally, we encourage developers to implement their algo-rithms using the Algorithm interface, which should return amodel as specified by the Model interface. An algorithm im-plementing the Algorithm interface is a class with a train()method that accepts data and hyperparameters as input, andproduces a Model. A Model is a class which makes predictions.In the case of a classification model, this would be a predictedclass given a new example point. In the case of a collaborative
MLSubMatrix
✦ Linear algebra on local parAAons✦ E.g., matrix-‐vector opera2ons for
mini-‐batch logis2c regression✦ E.g., solving linear system of equa2ons
for Alterna2ng Least Squares
✦ Sparse and Dense Matrix Support
Alterna2ng Least Squares
1 function [U, V] = ALS_matlab(M, U, V, k, lambda, maxiter)2
3 % Initialize variables4 [m,n] = size(M);5 lambI = lambda * eye(k);6 for q = 1:m7 Vinds{q} = find(M(q,:) ˜= 0);8 end9 for q=1:n
10 Uinds{q} = find(M(:,q) ˜= 0);11 end12
13 % ALS main loop14 for iter=1:maxiter15 parfor q=1:m16 Vq = V(Vinds{q},:);17 U(q,:) = (Vq’*Vq + lambI) \ (Vq’ * M(q,Vinds{q})’);18 end19 parfor q=1:n20 Uq = U(Uinds{q},:);21 V(q,:) = (Uq’*Uq + lambI) \ (Uq’ * M(Uinds{q},q));22 end23 end24 end
1 object BroadcastALS extends Algorithm {2 def train(trainData: MLNumericTable, trainDataTrans: MLNumericTable,3 m: Int, n: Int, k: Int, lambda: Int, maxIter: Int): ALSModel = {4 val lambI = MLSubMatrix.eye(k).mul(lambda)5 var U = MLSubMatrix.rand(m, k)6 var V = MLSubMatrix.rand(n, k)7 var U_b = trainData.context.broadcast(U)8 var V_b = trainData.context.broadcast(V)9 for (iter <- 0 until maxIter) {
10 U = trainData.matrixBatchMap(localALS(_, U_b.value, lambI, k))11 U_b = trainData.context.broadcast(U)12 V = trainDataTrans.matrixBatchMap(localALS(_, V_b.value, lambI, k))13 V_b = trainData.context.broadcast(V)14 }15 new ALSModel(U, V)16 }17
18 def localALS(trainDataPart: MLSubMatrix, Y: MLSubMatrix, lambI: MLSubMatrix, k: Int){19 var localX = MLSubMatrix.zeros(trainDataPart.numRows, k)20 for (i <- 0 until trainDataPart.numRows) {21 val q = trainDataPart.rowID(i)22 val nz_inds = trainDataPart.nzCols(q)23 val Yq = Y(trainDataPart.nzCols(q), ??)24 localX(i, ??) = ((Yq.transpose times Yq) + lambI)25 .solve(Yq.transpose times trainDataPart(q, nz_inds).transpose)26 }27 return localX28 }29 }
Fig. 14: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).
MLSolve
✦ Distributed implementaAons of common opAmizaAon paZerns
✦ E.g., Stochas2c Gradient Descent: Applicable to summable ML losses
✦ E.g., LBFGS: An approximate 2nd-‐order op2miza2on method
✦ E.g., ADMM: Decomposi2on / coordina2on procedure
Logis2c Regression
1 function w = log_reg(X, y, maxiter, learning_rate)2 [n, d] = size(X);3 w = zeros(d,1);4 for iter = 1:maxiter5 grad = X’ * (sigmoid(X * w) - y);6 w = w - learning_rate * grad;7 end8 end9
10 % applies sigmoid function component-wise on the vector x11 function s = sigmoid(x)12 s = 1 ./ (1 + exp(-1 .* x));13 end
1 object LogisticRegression extends Algorithm {2 def sigmoid(z: Scalar) = 1.0 / (1.0 + exp(-1.0*z))3
4 def gradientFunction(w: MLSubMatrix, x: MLSubMatrix, y: Scalar): MLSubMatrix = {5 x.transpose * (sigmoid(x dot w) - y)6 }7
8 def train(data: MLNumericTable, p: LogRegParams): LogRegModel = {9 val d = data.numCols
10 val params = SGDParams(initweights = MLSubMatrix.zeros(d, 1),11 maxIterations = p.maxIter, learningRate = p.learningRate,12 gradientFunction = gradientFunction)13 val weights = SGD(data, params)14 new LogRegModel(weights)15 }16 }
1 object StochasticGradientDescent extends Optimizer {2
3 def localSGD(x: MLSubMatrix, weights: MLSubMatrix, n: Index, lambda: Scalar,4 gradientFunction: (MLSubMatrix, MLSubMatrix, Scalar) => MLSubMatrix): MLSubMatrix = {5 var localWeights = weights6 for (i <- 0 to x.numRows) {7 val label = x(i,0)8 val data = x(i, 1 to x.numCols)9 val grad = gradientFunction(localWeights, data, label)
10 localWeights = localWeights - (grad * (lambda/n))11 }12 localWeights13 }14
15 def compute(data: MLNumericTable, params: StochasticGradientDescentParameters): MLSubMatrix = {16 // Initialize parameters17 var weights = params.initWeights.value18 var converged = false19 val n = data.count20 while(!converged && i < params.maxIterations) {21 i+=122 // Run SGD locally on each partition, and average resulting parameters after each iteration23 newWeights = data24 .matrixBatchMap(localSGD(_, weights, n, params.learningRate, params.gradientFunction))25 .reduce(_ + _)/data.partitions.length26 if ((newWeights - weights).mean < params.epsilon) converged = true27 weights = newWeights28 }29 weights30 }31 }
Fig. 13: Logistic Regression Code in MATLAB (top) and MLI (middle, bottom).
Linear Regression (+Lasso, Ridge)
Alterna2ng Least Squares, [DFC]
K-‐Means, [DP-‐Means]
Logis2c Regression, Linear SVM (+L1, L2), Mul2nomial Regression, [Naive Bayes, Decision Trees]
SGD, Parallel Gradient, Local SGD, [L-‐BFGS, ADMM, Adagrad]
Principal Component Analysis (PCA), N-‐grams, feature cleaning / normaliza2on
Cross Valida2on, Evalua2on Metrics
MLI Func2onalityRegression:
CollaboraAve Filtering:
Clustering:
ClassificaAon:
OpAmizaAon PrimiAves:
Feature ExtracAon:
ML Tools:
Linear Regression (+Lasso, Ridge)
Alterna2ng Least Squares, [DFC]
K-‐Means, [DP-‐Means]
Logis2c Regression, Linear SVM (+L1, L2), Mul2nomial Regression, [Naive Bayes, Decision Trees]
SGD, Parallel Gradient, Local SGD, [L-‐BFGS, ADMM, Adagrad]
Principal Component Analysis (PCA), N-‐grams, feature cleaning / normaliza2on
Cross Valida2on, Evalua2on Metrics
MLI Func2onalityRegression:
CollaboraAve Filtering:
Clustering:
ClassificaAon:
OpAmizaAon PrimiAves:
Feature ExtracAon:
ML Tools:
VisionMLlibMLIML OpAmizerRelease Plan
Build a Classifier for X
What you want to do
Build a Classifier for X
What you want to do What you have to do✦ Learn the internals of ML classificaAon
algorithms, sampling, feature selecAon, X-‐validaAon,….
✦ PotenAally learn Spark/Hadoop/…✦ Implement 3-‐4 algorithms✦ Implement grid-‐search to find the right
algorithm parameters✦ Implement validaAon algorithms✦ Experiment with different sampling-‐
sizes, algorithms, features✦ ….
Build a Classifier for X
What you want to do What you have to do✦ Learn the internals of ML classificaAon
algorithms, sampling, feature selecAon, X-‐validaAon,….
✦ PotenAally learn Spark/Hadoop/…✦ Implement 3-‐4 algorithms✦ Implement grid-‐search to find the right
algorithm parameters✦ Implement validaAon algorithms✦ Experiment with different sampling-‐
sizes, algorithms, features✦ ….
and in the end
Ask For Help
Insight: A Declara2ve Approach
SQL Result
✦ End Users tell the system what they want, not how to get it
Insight: A Declara2ve Approach
SQL Result MQL Model
✦ End Users tell the system what they want, not how to get it
var X = load(”als_clinical”, 2 to 10)var y = load(”als_clinical”, 1)var (fn-‐model, summary) = doClassify(X, y)
Example: Supervised ClassificaAon
✦ End Users tell the system what they want, not how to get it
Insight: A Declara2ve Approach
var X = load(”als_clinical”, 2 to 10)var y = load(”als_clinical”, 1)var (fn-‐model, summary) = doClassify(X, y)
Example: Supervised ClassificaAon
Algorithm
Indepe
ndent
✦ End Users tell the system what they want, not how to get it
Insight: A Declara2ve Approach
ML Op2mizer: A Search Problem
5min
Boosting
SVM
✦ System is responsible for searching through model space
✦ Opportuni2es for physical op2miza2on
Systems Op2miza2on of Model Search
35
Observation: We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g Monkey
Systems Op2miza2on of Model Search
✦ Idea from databases – shared cursor!
35
Observation: We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g Monkey
Systems Op2miza2on of Model Search
✦ Idea from databases – shared cursor!
35
Observation: We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g MonkeyQ
uery
A
Systems Op2miza2on of Model Search
✦ Idea from databases – shared cursor!
35
Observation: We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g MonkeyQ
uery
A
Que
ry B
Systems Op2miza2on of Model Search
✦ Idea from databases – shared cursor!
35
Observation: We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g MonkeyQ
uery
A
Que
ry B
✦ Single pass over the data, many models trained
Systems Op2miza2on of Model Search
✦ Idea from databases – shared cursor!
35
Observation: We tend to be I/O bound during model training.
A B C
1 a Dog
1 b Cat
2 c Cat
2 d Cat
3 e Dog
3 f Horse
4 g MonkeyQ
uery
A
Que
ry B
✦ Single pass over the data, many models trained
✦ Example – Logis2c Regression via SGD
Spark
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
End User
Rela2onship with MLI
MQL
Spark
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
End User
Rela2onship with MLI
✦ MLI provides common interface for all algorithms
MQL
Spark
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
End User
Rela2onship with MLI
✦ MLI provides common interface for all algorithms✦ Contracts: Meta-‐data for algorithms writen against MLI
MQL
Spark
MLlib
MLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
End User
Rela2onship with MLI
✦ MLI provides common interface for all algorithms✦ Contracts: Meta-‐data for algorithms writen against MLI
✦ Type (e.g., classifica2on)✦ Parameters✦ Run2me (e.g., O(n))✦ Input-‐Specifica2on✦ Output-‐Specifica2on✦ …
MQL
VisionMLlibMLIML OpAmizerRelease Plan
Contributors✦ John Duchi✦ Michael Franklin✦ Joseph Gonzalez✦ Rean Griffith✦ Michael Jordan✦ Tim Kraska✦ Xinghao Pan✦ Virginia Smith✦ Shivaram Venkarataram✦ Matei Zaharia
Contributors✦ John Duchi✦ Michael Franklin✦ Joseph Gonzalez✦ Rean Griffith✦ Michael Jordan✦ Tim Kraska✦ Xinghao Pan✦ Virginia Smith✦ Shivaram Venkarataram✦ Matei Zaharia
*
*
*
*
First Release (Summer)
MLlibMLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
End User
Spark
First Release (Summer)
✦ MLlib: low-‐level ML library and underlying kernels✦ Callable from Scala, Java✦ Included as part of Spark
✦ MLI: API for feature extrac2on and ML algorithms✦ Plaworm for ML development✦ Includes more extensive library and with faster dev-‐cycle than MLlib
MLlibMLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
End User
Spark
Second Release (Winter)
MLlibMLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
End User
Spark
Second Release (Winter)
✦ ML OpAmizer: automated model selec2on✦ Search problem over feature extractors and algorithms in MLI✦ Contracts✦ Restricted query language (MQL)
MLlibMLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
End User
Spark
Second Release (Winter)
✦ ML OpAmizer: automated model selec2on✦ Search problem over feature extractors and algorithms in MLI✦ Contracts✦ Restricted query language (MQL)
✦ Feature extracAon for image data
MLlibMLI
ML Optimizer
ML Developer
Meta-Data
Statistics
User
Declarative ML Task
ML Contract + Code
Master Server
….
result (e.g., fn-model & summary)
Optimizer
Parser
Executor/Monitoring
ML Library
DMX Runtime
DMX Runtime
DMX Runtime
DMX Runtime
LLP
PLP
Master
Slaves
End User
Spark
Future Direc2ons
Future Direc2ons✦ IdenAfy minimal set of ML operators
✦ Expose internals of ML algorithms to op2mizer
Future Direc2ons✦ IdenAfy minimal set of ML operators
✦ Expose internals of ML algorithms to op2mizer
✦ Unified language for End Users and ML Developers
Future Direc2ons✦ IdenAfy minimal set of ML operators
✦ Expose internals of ML algorithms to op2mizer
✦ Unified language for End Users and ML Developers
✦ Plug-‐ins to Python, R
Future Direc2ons✦ IdenAfy minimal set of ML operators
✦ Expose internals of ML algorithms to op2mizer
✦ Unified language for End Users and ML Developers
✦ Plug-‐ins to Python, R
✦ VisualizaAon for unsupervised learning and explora2on
Future Direc2ons✦ IdenAfy minimal set of ML operators
✦ Expose internals of ML algorithms to op2mizer
✦ Unified language for End Users and ML Developers
✦ Plug-‐ins to Python, R
✦ VisualizaAon for unsupervised learning and explora2on
✦ Advanced ML capabiliAes✦ Time-‐series algorithms✦ Graphical models✦ Advanced Op2miza2on (e.g., asynchronous computa2on)✦ Online updates✦ Sampling for efficiency
ContribuAons encouraged!
Berkeley, CAAugust 29-‐30www.mlbase.org
baseML
baseML
baseML
baseML
ML base
ML base
ML base
ML base
ML base
Top Related