Download - MLlib sparkmeetup_8_6_13_final_reduced

Evan Sparks and Ameet TalwalkarUC Berkeley

UC Berkeley

baseML

baseML

baseML

baseML

ML base

ML base

ML base

ML base

ML base

Three Converging Trends

Big Data


Distributed Compu2ng

Big Data



Big Data


Machine Learning


Big Data


Machine LearningMLbase

VisionMLlibMLIML OpAmizerRelease Plan

Problem: Scalable implementaAons difficult for ML Developers…

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Too many algorithms…

Problem: ML is difficultfor End Users…


Too many knobs…



Too many knobs…


Difficult to debug…


Too many knobs…



Doesn’t scale…


Too many knobs…



Reliable

Fast

Accurate

Provable

Doesn’t scale…

ML Experts Systems ExpertsMLbase

1. Easy scalable ML development (ML Developers)2. User-‐friendly ML at scale (End Users)


1. Easy scalable ML development (ML Developers)2. User-‐friendly ML at scale (End Users)

Along the way, we gain insight into data intensive compu2ng


Matlab Stack

Matlab Stack

Single Machine

Lapack

Matlab Stack

Single Machine

✦ Lapack: low-‐level Fortran linear algebra library

Lapack

Matlab Interface

Matlab Stack

Single Machine

✦ Lapack: low-‐level Fortran linear algebra library✦ Matlab Interface

✦ Higher-‐level abstrac2ons for data access / processing✦ More extensive func2onality than Lapack✦ Leverages Lapack whenever possible

Lapack

Matlab Interface

Matlab Stack

Single Machine

✦ Lapack: low-‐level Fortran linear algebra library✦ Matlab Interface

✦ Higher-‐level abstrac2ons for data access / processing✦ More extensive func2onality than Lapack✦ Leverages Lapack whenever possible

✦ Similar stories for R and Python

MLbase Stack

Lapack

Matlab Interface

Single Machine

MLbase Stack

Runtime(s)

Lapack

Matlab Interface

Single Machine

MLbase Stack

Runtime(s)Spark

Spark: cluster compu=ng system designed for itera=ve computa=on

Lapack

Matlab Interface

Single Machine

MLbase Stack

Runtime(s)

MLlib

Spark


MLlib: low-‐level ML library in Spark✦ Callable from Scala, Java

Lapack

Matlab Interface

Single Machine

MLbase Stack

Runtime(s)

MLlib

MLI

Spark



MLI: API / plaHorm for feature extrac=on and algorithm development✦ Includes higher-‐level func2onality with faster dev cycle than MLlib

Lapack

Matlab Interface

Single Machine

MLbase Stack

Runtime(s)

MLlib

MLI

ML Optimizer

Spark



MLI: API / plaHorm for feature extrac=on and algorithm development✦ Includes higher-‐level func2onality with faster dev cycle than MLlib

ML OpAmizer: automates model selec=on✦ Solves a search problem over feature extractors and algorithms in MLI

Lapack

Matlab Interface

Single Machine

MLlib

MLI

ML OptimizerEnd User

MLbase Stack Status

Spark

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….


Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

MLlib

MLI


MLbase Stack Status

Goal 1:Summer Release

Spark

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….


Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

MLlib

MLI


MLbase Stack Status

Goal 1:Summer Release

Goal 2:Winter Release

Spark

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….


Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

Example: MLlib

Example: MLlib

✦ Goal: Classifica2on of text file

Example: MLlib

✦ Goal: Classifica2on of text file✦ Featurize data manually

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Example: MLlib

✦ Goal: Classifica2on of text file✦ Featurize data manually✦ Calls MLlib’s LR func2on











Example: MLI

Example: MLI

✦ Use built-‐in feature extrac2on func2onality











Example: MLI

✦ Use built-‐in feature extrac2on func2onality✦ MLI Logis2c Regression leverages MLlib











Example: MLI

✦ Use built-‐in feature extrac2on func2onality✦ MLI Logis2c Regression leverages MLlib✦ Extensions:

✦ Embed in cross-‐valida2on rou2ne✦ Use different feature extractors / algorithms or

write new ones











Example: ML Op2mizer

var X = load(”text_file”, 2 to 10)var y = load(”text_file”, 1)var (fn-‐model, summary) = doClassify(X, y)

✦ User declara2vely specifies task✦ ML Op2mizer searches through MLI

Ease of u

se

Performance, Scalability

Lay of the Land

Matlab, Rx

Ease of u

se


Lay of the Land

Matlab, Rx

Ease of u

se


Mahoutx

Lay of the Land

Matlab, Rx

Ease of u

se


GraphLab, VWx

Mahoutx

Lay of the Land

Matlab, Rx

Ease of u

se


GraphLab, VWx

Mahoutx

Lay of the Land

MLlibx

Logis2c Regression, Linear SVM (+L1, L2)

Linear Regression (+Lasso, Ridge)

Alterna2ng Least Squares

K-‐Means

SGD, Parallel Gradient

MLlibClassificaAon:

Regression:

CollaboraAve Filtering:

Clustering:

OpAmizaAon PrimiAves:

Logis2c Regression, Linear SVM (+L1, L2)



K-‐Means

SGD, Parallel Gradient

MLlibClassificaAon:

Regression:


Clustering:


Included within Spark codebase✦ Unlike Mahout/Hadoop✦ Part of Spark 0.8 release✦ Con2nued support via Spark project

MLlib Performance

✦ WallAme: elapsed 2me to execute task

MLlib Performance


✦ Weak scaling✦ fix problem size per processor✦ ideally: constant wall2me as we grow cluster

MLlib Performance



✦ Strong scaling✦ fix total problem size✦ ideally: linear speed up as we grow cluster

MLlib Performance



✦ Strong scaling✦ fix total problem size✦ ideally: linear speed up as we grow cluster

✦ EC2 Experiments✦ m2.4xlarge instances, up to 32 machine clusters

MLlib Performance

Logis2c Regression -‐ Weak Scaling


✦ Full dataset: 200K images, 160K dense features


✦ Full dataset: 200K images, 160K dense features✦ Similar weak scaling

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e

# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

i

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

MLlib


✦ Full dataset: 200K images, 160K dense features✦ Similar weak scaling✦ MLlib within a factor of 2 of VW’s wall=me

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)

n=6K, d=160Kn=12.5K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

MLlib

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)



0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e

# machines

MLbaseVWIdeal


MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)



0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal














min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)



i


MLlib

Logis2c Regression -‐ Strong Scaling


✦ Fixed Dataset: 50K images, 160K dense features


✦ Fixed Dataset: 50K images, 160K dense features✦ MLlib exhibits beTer scaling proper=es

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)



0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e# machines

MLbaseVWIdeal


MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)



0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal














min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)



i


MLlib


✦ Fixed Dataset: 50K images, 160K dense features✦ MLlib exhibits beTer scaling proper=es✦ MLlib faster than VW with 16 and 32 machines

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)



0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e

# machines

MLbaseVWIdeal


MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)



0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal














min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)



i


MLlib

MLbase VW Matlab0

1000

2000

3000

4000

wal

ltim

e (s

)



0 5 10 15 20 25 300

2

4

6

8

10

rela

tive

wal

ltim

e# machines

MLbaseVWIdeal


MLbase VW Matlab0

200

400

600

800

1000

1200

1400

wal

ltim

e (s

)



0 5 10 15 20 25 300

5

10

15

20

25

30

35

# machines

spee

dup

MLbaseVWIdeal














min

U,V

X

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)



i


MLlib

ALS -‐ Wall2me

ALS -‐ Wall2me

✦ Dataset: Scaled version of NeHlix data (9X in size)✦ Cluster: 9 machines

ALS -‐ Wall2me

✦ Dataset: Scaled version of NeHlix data (9X in size)✦ Cluster: 9 machines

System WallAme (seconds)

Matlab 15443

Mahout 4206

GraphLab 291

MLlib 481

ALS -‐ Wall2me

✦ Dataset: Scaled version of NeHlix data (9X in size)✦ Cluster: 9 machines✦ MLlib an order of magnitude faster than Mahout✦ MLlib within factor of 2 of GraphLab

System WallAme (seconds)

Matlab 15443

Mahout 4206

GraphLab 291

MLlib 481

Deployment Considera2ons


Vowpal Wabbit, GraphLab✦ Data prepara=on specific to each program✦ Non-‐trivial setup on cluster✦ No fault tolerance


Vowpal Wabbit, GraphLab✦ Data prepara=on specific to each program✦ Non-‐trivial setup on cluster✦ No fault tolerance

MLlib✦ Reads files from HDFS✦ Launch/compile/run on cluster with a few commands✦ RDD’s are fault tolerance

Matlab, Rx

Ease of u

se


GraphLab, VWx

Mahoutx

Lay of the Land

MLlibx

Matlab, Rx

Ease of u

se


GraphLab, VWx

MLIx

Mahoutx

Lay of the Land

MLlibx

Current Op2ons

Current Op2ons

+ Easy (Resembles math, limited / no set up cost) + Sufficient for prototyping / wri2ng papers— Ad-‐hoc, non-‐scalable scripts— Loss of transla2on upon re-‐implementa2on

Current Op2ons

+ Easy (Resembles math, limited / no set up cost) + Sufficient for prototyping / wri2ng papers— Ad-‐hoc, non-‐scalable scripts— Loss of transla2on upon re-‐implementa2on

+ Scalable and (some2mes) fast + Exis2ng open-‐source library of ML algorithms— Difficult to set up, extend

ExamplesML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….


Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves


Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….


Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

‘Distributed’ Divide-‐Factor-‐Combine (DFC)✦ Ini2al studies in MATLAB (Not distributed)✦ Distributed prototype involving compiled MATLAB


Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….


Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves


Mahout ALS with Early Stopping✦ Theory: simple if-‐statement (3 lines of code)


Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….


Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves


Mahout ALS with Early Stopping✦ Theory: simple if-‐statement (3 lines of code)✦ Prac2ce: sih through 7 files, nearly 1K lines of code

Insight: Programming Abstrac2ons

Insight: Programming Abstrac2ons✦ Shield ML Developers from low-‐details: provide

familiar mathema2cal operators in distributed sejng

✦ ML Developer API (MLI)



✦ ML Developer API (MLI)✦ Table Computa2on: MLTable✦ Linear Algebra: MLSubMatrix✦ Op2miza2on Primi2ves: MLSolve




✦ MLI Examples:✦ DFC: ~50 lines of code




✦ MLI Examples:✦ DFC: ~50 lines of code✦ ALS: early stopping in 3 lines; < 40 lines total

Lines of Code

Lines of CodeLogis2c Regression


System Lines of Code

Matlab 11

Vowpal Wabbit 721

MLI 55

System Lines of Code

Matlab 20

Mahout 865

GraphLab 383

MLI 32

MLI Details

MLI Details

OLDval x: RDD[Array[Double]]

MLI Details

OLDval x: RDD[Array[Double]]val x: RDD[spark.u=l.Vector]

MLI Details

OLDval x: RDD[Array[Double]]val x: RDD[spark.u=l.Vector]val x: RDD[breeze.linalg.Vector]

MLI Details

OLDval x: RDD[Array[Double]]val x: RDD[spark.u=l.Vector]val x: RDD[breeze.linalg.Vector]val x: RDD[BIDMat.SMat]

MLI Details


NEWval x: MLTable

MLI Details


NEWval x: MLTable

✦ Generic interface for feature extrac2on✦ Common interface to support an op2mizer✦ Abstract interface for arbitrary backends

MLTable

✦ Flexibility when loading data✦ e.g., CSV, JSON, XML✦ Heterogenous data across

columns✦ Missing Data✦ Feature extrac2on

✦ Common Interface

✦ Supports MapReduce and Rela2onal Operators

✦ Inspired by DataFrames (R) and Pandas (Python)

Feature Extrac2on

Operation Arguments Returns Semanticsproject Seq[Index] MLTable Select a subset of columns from a table.union MLTable MLTable Concatenate two tables with identical schemas.filter MLRow ) Bool MLTable Select a subset of rows from a data table given a

functional predicate.join MLTable, Seq[Index] MLTable Inner join of two tables based on a sequence of shared

columns.map MLRow ) MLRow MLTable Apply a function to each row in the table.flatMap MLRow ) TraversableOnce[MLRow] MLTable Apply a function to each row, producing 0 or more

rows as output.reduce Seq[MLRow] ) MLRow MLTable Combine all rows in a table using an associative,

commutative reduce function.reduceByKey Int, Seq[MLRow] ) MLRow MLTable Combine all rows in a table using an associative,

commutative reduce function on a key-by-key basiswhere a key column is the first argument.

matrixBatchMap MLSubMatrix ) MLSubMatrix MLNumericTable Execute a batch function on a local partition of thedata. Output matrices are concatenated to form a newtable.

numRows None Long Returns number of rows in the table.numCols None Long Returns the number of columns in the table.

Fig. 2: MLTable API Illustration. This table captures core operations of the API and is not exhaustive.

1 def main(args: Array[String]) {2 val mc = new MLContext("local")3

4 //Read in table from file on HDFS.5 val rawTextTable = mc.textFile(args(0))6

7 //Run feature extraction on the raw text - get the top 30000 bigrams.8 val featurizedTable = tfIdf(nGrams(rawTextTable, n=2, top=30000))9

10 //Cluster the data using K-Means.11 val kMeansModel = KMeans(featurizedTable, k=50)12 }

Fig. 3: Loading, featurizing, and learning clusters on a corpus of Text Data.

Family Example Uses Returns SemanticsShape dims(mat), mat.numRows, mat.numCols Int or (Int,Int) Matrix dimensions.Composition matA on matB, matA then matB Matrix Combine two matrices row-wise or

column-wise.Indexing mat(0,??), mat(10,10), mat(Seq(2,4), 1) Matrix or Scalar Select elements or sub-matrices.Reverse Indexing mat(0,??).nonZeros, mat.parentRow(n),

mat.parentCol(n)Index Look up matrix indexes in either base

matrix or parent table.Updating mat(1,2) = 5, mat(1, Seq(3,10)) = matB None Assign values to elements or sub-

matrices.Arithmetic matA + matB, matA - 5, matA / matB Matrix Element-wise arithmetic between ma-

trices and scalars or matrices.Linear Algebra matA times matB, matA dot matB,

matA.transpose, matA.solve(v), matA.svd,matA.eigen, matA.rank

Matrix or Scalar Basic and extended linear algebra sup-port.

Fig. 4: MLSubMatrix API Illustration

with data size, as in the case with basic linear regression. As aresult various optimization techniques are used to converge toan approximate solution while iterating over the data. We treatoptimization as a first class citizen in our API. Furthermore, wedesigned the Optimizer interface to allow developers to con-tribute new optimizations algorithms to the system. Figure 13(bottom) shows as an example how to implement StochasticGradient Descent using MLSubMatrix.

Finally, we encourage developers to implement their algo-rithms using the Algorithm interface, which should return amodel as specified by the Model interface. An algorithm im-plementing the Algorithm interface is a class with a train()method that accepts data and hyperparameters as input, andproduces a Model. A Model is a class which makes predictions.In the case of a classification model, this would be a predictedclass given a new example point. In the case of a collaborative

MLSubMatrix

✦ Linear algebra on local parAAons✦ E.g., matrix-‐vector opera2ons for

mini-‐batch logis2c regression✦ E.g., solving linear system of equa2ons

for Alterna2ng Least Squares

✦ Sparse and Dense Matrix Support


1 function [U, V] = ALS_matlab(M, U, V, k, lambda, maxiter)2

3 % Initialize variables4 [m,n] = size(M);5 lambI = lambda * eye(k);6 for q = 1:m7 Vinds{q} = find(M(q,:) ˜= 0);8 end9 for q=1:n

10 Uinds{q} = find(M(:,q) ˜= 0);11 end12

13 % ALS main loop14 for iter=1:maxiter15 parfor q=1:m16 Vq = V(Vinds{q},:);17 U(q,:) = (Vq’*Vq + lambI) \ (Vq’ * M(q,Vinds{q})’);18 end19 parfor q=1:n20 Uq = U(Uinds{q},:);21 V(q,:) = (Uq’*Uq + lambI) \ (Uq’ * M(Uinds{q},q));22 end23 end24 end

1 object BroadcastALS extends Algorithm {2 def train(trainData: MLNumericTable, trainDataTrans: MLNumericTable,3 m: Int, n: Int, k: Int, lambda: Int, maxIter: Int): ALSModel = {4 val lambI = MLSubMatrix.eye(k).mul(lambda)5 var U = MLSubMatrix.rand(m, k)6 var V = MLSubMatrix.rand(n, k)7 var U_b = trainData.context.broadcast(U)8 var V_b = trainData.context.broadcast(V)9 for (iter <- 0 until maxIter) {

10 U = trainData.matrixBatchMap(localALS(_, U_b.value, lambI, k))11 U_b = trainData.context.broadcast(U)12 V = trainDataTrans.matrixBatchMap(localALS(_, V_b.value, lambI, k))13 V_b = trainData.context.broadcast(V)14 }15 new ALSModel(U, V)16 }17

18 def localALS(trainDataPart: MLSubMatrix, Y: MLSubMatrix, lambI: MLSubMatrix, k: Int){19 var localX = MLSubMatrix.zeros(trainDataPart.numRows, k)20 for (i <- 0 until trainDataPart.numRows) {21 val q = trainDataPart.rowID(i)22 val nz_inds = trainDataPart.nzCols(q)23 val Yq = Y(trainDataPart.nzCols(q), ??)24 localX(i, ??) = ((Yq.transpose times Yq) + lambI)25 .solve(Yq.transpose times trainDataPart(q, nz_inds).transpose)26 }27 return localX28 }29 }


MLSolve

✦ Distributed implementaAons of common opAmizaAon paZerns

✦ E.g., Stochas2c Gradient Descent: Applicable to summable ML losses

✦ E.g., LBFGS: An approximate 2nd-‐order op2miza2on method

✦ E.g., ADMM: Decomposi2on / coordina2on procedure

Logis2c Regression

1 function w = log_reg(X, y, maxiter, learning_rate)2 [n, d] = size(X);3 w = zeros(d,1);4 for iter = 1:maxiter5 grad = X’ * (sigmoid(X * w) - y);6 w = w - learning_rate * grad;7 end8 end9

10 % applies sigmoid function component-wise on the vector x11 function s = sigmoid(x)12 s = 1 ./ (1 + exp(-1 .* x));13 end

1 object LogisticRegression extends Algorithm {2 def sigmoid(z: Scalar) = 1.0 / (1.0 + exp(-1.0*z))3

4 def gradientFunction(w: MLSubMatrix, x: MLSubMatrix, y: Scalar): MLSubMatrix = {5 x.transpose * (sigmoid(x dot w) - y)6 }7

8 def train(data: MLNumericTable, p: LogRegParams): LogRegModel = {9 val d = data.numCols

10 val params = SGDParams(initweights = MLSubMatrix.zeros(d, 1),11 maxIterations = p.maxIter, learningRate = p.learningRate,12 gradientFunction = gradientFunction)13 val weights = SGD(data, params)14 new LogRegModel(weights)15 }16 }

1 object StochasticGradientDescent extends Optimizer {2

3 def localSGD(x: MLSubMatrix, weights: MLSubMatrix, n: Index, lambda: Scalar,4 gradientFunction: (MLSubMatrix, MLSubMatrix, Scalar) => MLSubMatrix): MLSubMatrix = {5 var localWeights = weights6 for (i <- 0 to x.numRows) {7 val label = x(i,0)8 val data = x(i, 1 to x.numCols)9 val grad = gradientFunction(localWeights, data, label)

10 localWeights = localWeights - (grad * (lambda/n))11 }12 localWeights13 }14

15 def compute(data: MLNumericTable, params: StochasticGradientDescentParameters): MLSubMatrix = {16 // Initialize parameters17 var weights = params.initWeights.value18 var converged = false19 val n = data.count20 while(!converged && i < params.maxIterations) {21 i+=122 // Run SGD locally on each partition, and average resulting parameters after each iteration23 newWeights = data24 .matrixBatchMap(localSGD(_, weights, n, params.learningRate, params.gradientFunction))25 .reduce(_ + _)/data.partitions.length26 if ((newWeights - weights).mean < params.epsilon) converged = true27 weights = newWeights28 }29 weights30 }31 }

Fig. 13: Logistic Regression Code in MATLAB (top) and MLI (middle, bottom).


Alterna2ng Least Squares, [DFC]

K-‐Means, [DP-‐Means]

Logis2c Regression, Linear SVM (+L1, L2), Mul2nomial Regression, [Naive Bayes, Decision Trees]

SGD, Parallel Gradient, Local SGD, [L-‐BFGS, ADMM, Adagrad]

Principal Component Analysis (PCA), N-‐grams, feature cleaning / normaliza2on

Cross Valida2on, Evalua2on Metrics

MLI Func2onalityRegression:


Clustering:

ClassificaAon:


Feature ExtracAon:

ML Tools:

Build a Classifier for X

What you want to do


What you want to do What you have to do✦ Learn the internals of ML classificaAon

algorithms, sampling, feature selecAon, X-‐validaAon,….

✦ PotenAally learn Spark/Hadoop/…✦ Implement 3-‐4 algorithms✦ Implement grid-‐search to find the right

algorithm parameters✦ Implement validaAon algorithms✦ Experiment with different sampling-‐

sizes, algorithms, features✦ ….


What you want to do What you have to do✦ Learn the internals of ML classificaAon

algorithms, sampling, feature selecAon, X-‐validaAon,….

✦ PotenAally learn Spark/Hadoop/…✦ Implement 3-‐4 algorithms✦ Implement grid-‐search to find the right

algorithm parameters✦ Implement validaAon algorithms✦ Experiment with different sampling-‐

sizes, algorithms, features✦ ….

and in the end

Ask For Help

Insight: A Declara2ve Approach

SQL Result

✦ End Users tell the system what they want, not how to get it


SQL Result MQL Model


var X = load(”als_clinical”, 2 to 10)var y = load(”als_clinical”, 1)var (fn-‐model, summary) = doClassify(X, y)

Example: Supervised ClassificaAon



var X = load(”als_clinical”, 2 to 10)var y = load(”als_clinical”, 1)var (fn-‐model, summary) = doClassify(X, y)

Example: Supervised ClassificaAon

Algorithm

Indepe

ndent



ML Op2mizer: A Search Problem

5min

Boosting

SVM

✦ System is responsible for searching through model space

✦ Opportuni2es for physical op2miza2on

Systems Op2miza2on of Model Search

35

Observation: We tend to be I/O bound during model training.

A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g Monkey


✦ Idea from databases – shared cursor!

35


A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g Monkey



35


A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g MonkeyQ

uery

A



35


A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g MonkeyQ

uery

A

Que

ry B



35


A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g MonkeyQ

uery

A

Que

ry B

✦ Single pass over the data, many models trained



35


A B C

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g MonkeyQ

uery

A

Que

ry B

✦ Single pass over the data, many models trained

✦ Example – Logis2c Regression via SGD

Spark

MLlib

MLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….


Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End User

Rela2onship with MLI

MQL

Spark

MLlib

MLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….


Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End User


✦ MLI provides common interface for all algorithms

MQL

Spark

MLlib

MLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….


Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End User


✦ MLI provides common interface for all algorithms✦ Contracts: Meta-‐data for algorithms writen against MLI

MQL

Spark

MLlib

MLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….


Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End User


✦ MLI provides common interface for all algorithms✦ Contracts: Meta-‐data for algorithms writen against MLI

✦ Type (e.g., classifica2on)✦ Parameters✦ Run2me (e.g., O(n))✦ Input-‐Specifica2on✦ Output-‐Specifica2on✦ …

MQL

Contributors✦ John Duchi✦ Michael Franklin✦ Joseph Gonzalez✦ Rean Griffith✦ Michael Jordan✦ Tim Kraska✦ Xinghao Pan✦ Virginia Smith✦ Shivaram Venkarataram✦ Matei Zaharia

Contributors✦ John Duchi✦ Michael Franklin✦ Joseph Gonzalez✦ Rean Griffith✦ Michael Jordan✦ Tim Kraska✦ Xinghao Pan✦ Virginia Smith✦ Shivaram Venkarataram✦ Matei Zaharia

*

*

*

*

First Release (Summer)

MLlibMLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….


Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End User

Spark

First Release (Summer)

✦ MLlib: low-‐level ML library and underlying kernels✦ Callable from Scala, Java✦ Included as part of Spark

✦ MLI: API for feature extrac2on and ML algorithms✦ Plaworm for ML development✦ Includes more extensive library and with faster dev-‐cycle than MLlib

MLlibMLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….


Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End User

Spark

Second Release (Winter)

MLlibMLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….


Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End User

Spark


✦ ML OpAmizer: automated model selec2on✦ Search problem over feature extractors and algorithms in MLI✦ Contracts✦ Restricted query language (MQL)

MLlibMLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….


Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End User

Spark


✦ ML OpAmizer: automated model selec2on✦ Search problem over feature extractors and algorithms in MLI✦ Contracts✦ Restricted query language (MQL)

✦ Feature extracAon for image data

MLlibMLI

ML Optimizer

ML Developer

Meta-Data

Statistics

User

Declarative ML Task

ML Contract + Code

Master Server

….


Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

DMX Runtime

DMX Runtime

DMX Runtime

LLP

PLP

Master

Slaves

End User

Spark

Future Direc2ons

Future Direc2ons✦ IdenAfy minimal set of ML operators

✦ Expose internals of ML algorithms to op2mizer



✦ Unified language for End Users and ML Developers




✦ Plug-‐ins to Python, R





✦ VisualizaAon for unsupervised learning and explora2on





✦ VisualizaAon for unsupervised learning and explora2on

✦ Advanced ML capabiliAes✦ Time-‐series algorithms✦ Graphical models✦ Advanced Op2miza2on (e.g., asynchronous computa2on)✦ Online updates✦ Sampling for efficiency

ContribuAons encouraged!

Berkeley, CAAugust 29-‐30www.mlbase.org

baseML

baseML

baseML

baseML

ML base

ML base

ML base

ML base

ML base

http://www.mlbase.org

http://www.mlbase.org