MLlib sparkmeetup_8_6_13_final_reduced

Evan Sparks and Ameet TalwalkarUC Berkeley

UC Berkeley

baseML

ML base

Three Converging Trends

Big Data

Distributed Compu2ng

Big Data

Machine Learning

Big Data

Machine LearningMLbase

VisionMLlibMLIML OpAmizerRelease Plan

Problem: Scalable implementaAons difficult for ML Developers…

ML Developer

Meta-Data

Statistics

Declarative ML Task

ML Contract + Code

Master Server

result (e.g., fn-model & summary)

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

Master

Slaves

ML Developer

Meta-Data

Statistics

Declarative ML Task

ML Contract + Code

Master Server

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

Master

Slaves

ML Developer

Meta-Data

Statistics

Declarative ML Task

ML Contract + Code

Master Server

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

Master

Slaves

Too many algorithms…

Problem: ML is difficultfor End Users…

Too many knobs…

Difficult to debug…

Too many knobs…

Doesn’t scale…

Too many knobs…

Reliable

Accurate

Provable

Doesn’t scale…

ML Experts Systems ExpertsMLbase

1. Easy scalable ML development (ML Developers)2. User-‐friendly ML at scale (End Users)

Along the way, we gain insight into data intensive compu2ng

Matlab Stack

Single Machine

Lapack

Matlab Stack

Single Machine

✦ Lapack: low-‐level Fortran linear algebra library

Lapack

Matlab Interface

Matlab Stack

Single Machine

✦ Lapack: low-‐level Fortran linear algebra library✦ Matlab Interface

✦ Higher-‐level abstrac2ons for data access / processing✦ More extensive func2onality than Lapack✦ Leverages Lapack whenever possible

Lapack

Matlab Interface

Matlab Stack

Single Machine

✦ Lapack: low-‐level Fortran linear algebra library✦ Matlab Interface

✦ Higher-‐level abstrac2ons for data access / processing✦ More extensive func2onality than Lapack✦ Leverages Lapack whenever possible

✦ Similar stories for R and Python

MLbase Stack

Lapack

Matlab Interface

Single Machine

MLbase Stack

Runtime(s)

Lapack

Matlab Interface

Single Machine

MLbase Stack

Runtime(s)Spark

Spark: cluster compu=ng system designed for itera=ve computa=on

Lapack

Matlab Interface

Single Machine

MLbase Stack

Runtime(s)

MLlib: low-‐level ML library in Spark✦ Callable from Scala, Java

Lapack

Matlab Interface

Single Machine

MLbase Stack

Runtime(s)

MLI: API / plaHorm for feature extrac=on and algorithm development✦ Includes higher-‐level func2onality with faster dev cycle than MLlib

Lapack

Matlab Interface

Single Machine

MLbase Stack

Runtime(s)

ML Optimizer

MLI: API / plaHorm for feature extrac=on and algorithm development✦ Includes higher-‐level func2onality with faster dev cycle than MLlib

ML OpAmizer: automates model selec=on✦ Solves a search problem over feature extractors and algorithms in MLI

Lapack

Matlab Interface

Single Machine

ML OptimizerEnd User

MLbase Stack Status

ML Developer

Meta-Data

Statistics

Declarative ML Task

ML Contract + Code

Master Server

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

Master

Slaves

MLbase Stack Status

ML Developer

Meta-Data

Statistics

Declarative ML Task

ML Contract + Code

Master Server

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

Master

Slaves

MLbase Stack Status

Goal 1:Summer Release

ML Developer

Meta-Data

Statistics

Declarative ML Task

ML Contract + Code

Master Server

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

Master

Slaves

MLbase Stack Status

Goal 1:Summer Release

Goal 2:Winter Release

ML Developer

Meta-Data

Statistics

Declarative ML Task

ML Contract + Code

Master Server

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

Master

Slaves

Example: MLlib

✦ Goal: Classifica2on of text file

Example: MLlib

✦ Goal: Classifica2on of text file✦ Featurize data manually

1 def main(args: Array[String]) {2 val mc = new MLContext("local", "MLILR")3

4 //Read in file from HDFS5 val rawTextTable = mc.csvFile(args(0), Seq("class","text"))6

7 //Run feature extraction8 val classes = rawTextTable(??, "class")9 val ngrams = tfIdf(nGrams(rawTextTable(??, "text"), n=2, top=30000))

10 val featureizedTable = classes.zip(ngrams)11

12 //Classify the data using Logistic Regression.13 val lrModel = LogisticRegression(featurizedTable, stepSize=0.1, numIter=12)14 }

1 def main(args: Array[String]) {2 val sc = new SparkContext("local", "SparkLR")3

4 //Load data from HDFS5 val data = sc.textFile(args(0)) //RDD[String]6

7 //User is responsible for formatting/featurizing/normalizing their RDD!8 val featurizedData: RDD[(Double,Array[Double])] = processData(data)9

10 //Train the model using MLlib.11 val model = new LogisticRegressionLocalRandomSGD()12 .setStepSize(0.1)13 .setNumIterations(50)14 .train(featurizedData)15 }

Fig. 15: Matrix Factorization via ALS code in MATLAB (top) and MLI (bottom).

Example: MLlib

✦ Goal: Classifica2on of text file✦ Featurize data manually✦ Calls MLlib’s LR func2on

Example: MLI

✦ Use built-‐in feature extrac2on func2onality

Example: MLI

✦ Use built-‐in feature extrac2on func2onality✦ MLI Logis2c Regression leverages MLlib

Example: MLI

✦ Use built-‐in feature extrac2on func2onality✦ MLI Logis2c Regression leverages MLlib✦ Extensions:

✦ Embed in cross-‐valida2on rou2ne✦ Use different feature extractors / algorithms or

write new ones

Example: ML Op2mizer

var X = load(”text_file”, 2 to 10)var y = load(”text_file”, 1)var (fn-‐model, summary) = doClassify(X, y)

✦ User declara2vely specifies task✦ ML Op2mizer searches through MLI

Ease of u

Performance, Scalability

Lay of the Land

Matlab, Rx

Ease of u

Lay of the Land

Matlab, Rx

Ease of u

Mahoutx

Lay of the Land

Matlab, Rx

Ease of u

GraphLab, VWx

Mahoutx

Lay of the Land

Matlab, Rx

Ease of u

GraphLab, VWx

Mahoutx

Lay of the Land

MLlibx

Logis2c Regression, Linear SVM (+L1, L2)

Linear Regression (+Lasso, Ridge)

Alterna2ng Least Squares

K-‐Means

SGD, Parallel Gradient

MLlibClassificaAon:

Regression:

CollaboraAve Filtering:

Clustering:

OpAmizaAon PrimiAves:

Logis2c Regression, Linear SVM (+L1, L2)

K-‐Means

SGD, Parallel Gradient

MLlibClassificaAon:

Regression:

Clustering:

Included within Spark codebase✦ Unlike Mahout/Hadoop✦ Part of Spark 0.8 release✦ Con2nued support via Spark project

MLlib Performance

✦ WallAme: elapsed 2me to execute task

MLlib Performance

✦ Weak scaling✦ fix problem size per processor✦ ideally: constant wall2me as we grow cluster

MLlib Performance

✦ Strong scaling✦ fix total problem size✦ ideally: linear speed up as we grow cluster

MLlib Performance

✦ Strong scaling✦ fix total problem size✦ ideally: linear speed up as we grow cluster

✦ EC2 Experiments✦ m2.4xlarge instances, up to 32 machine clusters

MLlib Performance

Logis2c Regression -‐ Weak Scaling

✦ Full dataset: 200K images, 160K dense features

✦ Full dataset: 200K images, 160K dense features✦ Similar weak scaling

MLbase VW Matlab0

n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

Fig. 5: Walltime for weak scaling for logistic regression.

0 5 10 15 20 25 300

# machines

MLbaseVWIdeal

Fig. 6: Weak scaling for logistic regression

MLbase VW Matlab0

1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines

Fig. 7: Walltime for strong scaling for logistic regression.

0 5 10 15 20 25 300

# machines

MLbaseVWIdeal

Fig. 8: Strong scaling for logistic regression

with respect to computation. In practice, we see comparablescaling results as more machines are added.

In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.

Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.

From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.

System Lines of CodeMLbase 32

GraphLab 383Mahout 865

MATLAB-Mex 124MATLAB 20

TABLE II: Lines of code for various implementations of ALS

B. Collaborative Filtering: Alternating Least Squares

Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n

be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV

T .Commonly, U and V are estimated using the following bi-convex objective:

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u

⇤i 2 Rk is the optimal solution for the

th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the

✦ Full dataset: 200K images, 160K dense features✦ Similar weak scaling✦ MLlib within a factor of 2 of VW’s wall=me

MLbase VW Matlab0

n=6K, d=160Kn=12.5K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K

MLbase VW Matlab0

0 5 10 15 20 25 300

# machines

MLbaseVWIdeal

MLbase VW Matlab0

0 5 10 15 20 25 300

# machines

MLbaseVWIdeal

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

Logis2c Regression -‐ Strong Scaling

✦ Fixed Dataset: 50K images, 160K dense features

✦ Fixed Dataset: 50K images, 160K dense features✦ MLlib exhibits beTer scaling proper=es

MLbase VW Matlab0

0 5 10 15 20 25 300

e# machines

MLbaseVWIdeal

MLbase VW Matlab0

0 5 10 15 20 25 300

# machines

MLbaseVWIdeal

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

✦ Fixed Dataset: 50K images, 160K dense features✦ MLlib exhibits beTer scaling proper=es✦ MLlib faster than VW with 16 and 32 machines

MLbase VW Matlab0

0 5 10 15 20 25 300

# machines

MLbaseVWIdeal

MLbase VW Matlab0

0 5 10 15 20 25 300

# machines

MLbaseVWIdeal

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

MLbase VW Matlab0

0 5 10 15 20 25 300

e# machines

MLbaseVWIdeal

MLbase VW Matlab0

0 5 10 15 20 25 300

# machines

MLbaseVWIdeal

(i,j)2⌦(M)

(Mij � U

Ti Vj)

2+ �(||U ||2F + ||V ||2F ) . (2)

ALS -‐ Wall2me

✦ Dataset: Scaled version of NeHlix data (9X in size)✦ Cluster: 9 machines

ALS -‐ Wall2me

System WallAme (seconds)

Matlab 15443

Mahout 4206

GraphLab 291

MLlib 481

ALS -‐ Wall2me

Matlab 15443

Mahout 4206

GraphLab 291

MLlib 481

ALS -‐ Wall2me

✦ Dataset: Scaled version of NeHlix data (9X in size)✦ Cluster: 9 machines✦ MLlib an order of magnitude faster than Mahout✦ MLlib within factor of 2 of GraphLab

Matlab 15443

Mahout 4206

GraphLab 291

MLlib 481

Deployment Considera2ons

Vowpal Wabbit, GraphLab✦ Data prepara=on specific to each program✦ Non-‐trivial setup on cluster✦ No fault tolerance

Deployment Considera2ons

Vowpal Wabbit, GraphLab✦ Data prepara=on specific to each program✦ Non-‐trivial setup on cluster✦ No fault tolerance

MLlib✦ Reads files from HDFS✦ Launch/compile/run on cluster with a few commands✦ RDD’s are fault tolerance

Matlab, Rx

Ease of u

GraphLab, VWx

Mahoutx

Lay of the Land

MLlibx

Matlab, Rx

Ease of u

GraphLab, VWx

Mahoutx

Lay of the Land

MLlibx

Current Op2ons

+ Easy (Resembles math, limited / no set up cost) + Sufficient for prototyping / wri2ng papers— Ad-‐hoc, non-‐scalable scripts— Loss of transla2on upon re-‐implementa2on

Current Op2ons

+ Scalable and (some2mes) fast + Exis2ng open-‐source library of ML algorithms— Difficult to set up, extend

ExamplesML Developer

Meta-Data

Statistics

Declarative ML Task

ML Contract + Code

Master Server

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

Master

Slaves

Meta-Data

Statistics

Declarative ML Task

ML Contract + Code

Master Server

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

Master

Slaves

‘Distributed’ Divide-‐Factor-‐Combine (DFC)✦ Ini2al studies in MATLAB (Not distributed)✦ Distributed prototype involving compiled MATLAB

Meta-Data

Statistics

Declarative ML Task

ML Contract + Code

Master Server

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

Master

Slaves

Mahout ALS with Early Stopping✦ Theory: simple if-‐statement (3 lines of code)

Meta-Data

Statistics

Declarative ML Task

ML Contract + Code

Master Server

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

Master

Slaves

Mahout ALS with Early Stopping✦ Theory: simple if-‐statement (3 lines of code)✦ Prac2ce: sih through 7 files, nearly 1K lines of code

Insight: Programming Abstrac2ons

Insight: Programming Abstrac2ons✦ Shield ML Developers from low-‐details: provide

familiar mathema2cal operators in distributed sejng

✦ ML Developer API (MLI)

✦ ML Developer API (MLI)✦ Table Computa2on: MLTable✦ Linear Algebra: MLSubMatrix✦ Op2miza2on Primi2ves: MLSolve

✦ MLI Examples:✦ DFC: ~50 lines of code

✦ MLI Examples:✦ DFC: ~50 lines of code✦ ALS: early stopping in 3 lines; < 40 lines total

Lines of Code

Lines of CodeLogis2c Regression

System Lines of Code

Matlab 11

Vowpal Wabbit 721

MLI 55

Matlab 20

Mahout 865

GraphLab 383

MLI 32

Matlab 11

Vowpal Wabbit 721

MLI 55

Matlab 20

Mahout 865

GraphLab 383

MLI 32

Matlab 11

Vowpal Wabbit 721

MLI 55

Matlab 20

Mahout 865

GraphLab 383

MLI 32

MLI Details

OLDval x: RDD[Array[Double]]

MLI Details

OLDval x: RDD[Array[Double]]val x: RDD[spark.u=l.Vector]

MLI Details

OLDval x: RDD[Array[Double]]val x: RDD[spark.u=l.Vector]val x: RDD[breeze.linalg.Vector]

MLI Details

OLDval x: RDD[Array[Double]]val x: RDD[spark.u=l.Vector]val x: RDD[breeze.linalg.Vector]val x: RDD[BIDMat.SMat]

MLI Details

NEWval x: MLTable

MLI Details

NEWval x: MLTable

✦ Generic interface for feature extrac2on✦ Common interface to support an op2mizer✦ Abstract interface for arbitrary backends

MLTable

✦ Flexibility when loading data✦ e.g., CSV, JSON, XML✦ Heterogenous data across

columns✦ Missing Data✦ Feature extrac2on

✦ Common Interface

✦ Supports MapReduce and Rela2onal Operators

✦ Inspired by DataFrames (R) and Pandas (Python)

Feature Extrac2on

Operation Arguments Returns Semanticsproject Seq[Index] MLTable Select a subset of columns from a table.union MLTable MLTable Concatenate two tables with identical schemas.filter MLRow ) Bool MLTable Select a subset of rows from a data table given a

functional predicate.join MLTable, Seq[Index] MLTable Inner join of two tables based on a sequence of shared

columns.map MLRow ) MLRow MLTable Apply a function to each row in the table.flatMap MLRow ) TraversableOnce[MLRow] MLTable Apply a function to each row, producing 0 or more

rows as output.reduce Seq[MLRow] ) MLRow MLTable Combine all rows in a table using an associative,

commutative reduce function.reduceByKey Int, Seq[MLRow] ) MLRow MLTable Combine all rows in a table using an associative,

commutative reduce function on a key-by-key basiswhere a key column is the first argument.

matrixBatchMap MLSubMatrix ) MLSubMatrix MLNumericTable Execute a batch function on a local partition of thedata. Output matrices are concatenated to form a newtable.

numRows None Long Returns number of rows in the table.numCols None Long Returns the number of columns in the table.

Fig. 2: MLTable API Illustration. This table captures core operations of the API and is not exhaustive.

1 def main(args: Array[String]) {2 val mc = new MLContext("local")3

4 //Read in table from file on HDFS.5 val rawTextTable = mc.textFile(args(0))6

7 //Run feature extraction on the raw text - get the top 30000 bigrams.8 val featurizedTable = tfIdf(nGrams(rawTextTable, n=2, top=30000))9

10 //Cluster the data using K-Means.11 val kMeansModel = KMeans(featurizedTable, k=50)12 }

Fig. 3: Loading, featurizing, and learning clusters on a corpus of Text Data.

Family Example Uses Returns SemanticsShape dims(mat), mat.numRows, mat.numCols Int or (Int,Int) Matrix dimensions.Composition matA on matB, matA then matB Matrix Combine two matrices row-wise or

column-wise.Indexing mat(0,??), mat(10,10), mat(Seq(2,4), 1) Matrix or Scalar Select elements or sub-matrices.Reverse Indexing mat(0,??).nonZeros, mat.parentRow(n),

mat.parentCol(n)Index Look up matrix indexes in either base

matrix or parent table.Updating mat(1,2) = 5, mat(1, Seq(3,10)) = matB None Assign values to elements or sub-

matrices.Arithmetic matA + matB, matA - 5, matA / matB Matrix Element-wise arithmetic between ma-

trices and scalars or matrices.Linear Algebra matA times matB, matA dot matB,

matA.transpose, matA.solve(v), matA.svd,matA.eigen, matA.rank

Matrix or Scalar Basic and extended linear algebra sup-port.

Fig. 4: MLSubMatrix API Illustration

with data size, as in the case with basic linear regression. As aresult various optimization techniques are used to converge toan approximate solution while iterating over the data. We treatoptimization as a first class citizen in our API. Furthermore, wedesigned the Optimizer interface to allow developers to con-tribute new optimizations algorithms to the system. Figure 13(bottom) shows as an example how to implement StochasticGradient Descent using MLSubMatrix.

Finally, we encourage developers to implement their algo-rithms using the Algorithm interface, which should return amodel as specified by the Model interface. An algorithm im-plementing the Algorithm interface is a class with a train()method that accepts data and hyperparameters as input, andproduces a Model. A Model is a class which makes predictions.In the case of a classification model, this would be a predictedclass given a new example point. In the case of a collaborative

MLSubMatrix

✦ Linear algebra on local parAAons✦ E.g., matrix-‐vector opera2ons for

mini-‐batch logis2c regression✦ E.g., solving linear system of equa2ons

for Alterna2ng Least Squares

✦ Sparse and Dense Matrix Support

1 function [U, V] = ALS_matlab(M, U, V, k, lambda, maxiter)2

3 % Initialize variables4 [m,n] = size(M);5 lambI = lambda * eye(k);6 for q = 1:m7 Vinds{q} = find(M(q,:) ˜= 0);8 end9 for q=1:n

10 Uinds{q} = find(M(:,q) ˜= 0);11 end12

13 % ALS main loop14 for iter=1:maxiter15 parfor q=1:m16 Vq = V(Vinds{q},:);17 U(q,:) = (Vq’*Vq + lambI) \ (Vq’ * M(q,Vinds{q})’);18 end19 parfor q=1:n20 Uq = U(Uinds{q},:);21 V(q,:) = (Uq’*Uq + lambI) \ (Uq’ * M(Uinds{q},q));22 end23 end24 end

1 object BroadcastALS extends Algorithm {2 def train(trainData: MLNumericTable, trainDataTrans: MLNumericTable,3 m: Int, n: Int, k: Int, lambda: Int, maxIter: Int): ALSModel = {4 val lambI = MLSubMatrix.eye(k).mul(lambda)5 var U = MLSubMatrix.rand(m, k)6 var V = MLSubMatrix.rand(n, k)7 var U_b = trainData.context.broadcast(U)8 var V_b = trainData.context.broadcast(V)9 for (iter <- 0 until maxIter) {

10 U = trainData.matrixBatchMap(localALS(_, U_b.value, lambI, k))11 U_b = trainData.context.broadcast(U)12 V = trainDataTrans.matrixBatchMap(localALS(_, V_b.value, lambI, k))13 V_b = trainData.context.broadcast(V)14 }15 new ALSModel(U, V)16 }17

18 def localALS(trainDataPart: MLSubMatrix, Y: MLSubMatrix, lambI: MLSubMatrix, k: Int){19 var localX = MLSubMatrix.zeros(trainDataPart.numRows, k)20 for (i <- 0 until trainDataPart.numRows) {21 val q = trainDataPart.rowID(i)22 val nz_inds = trainDataPart.nzCols(q)23 val Yq = Y(trainDataPart.nzCols(q), ??)24 localX(i, ??) = ((Yq.transpose times Yq) + lambI)25 .solve(Yq.transpose times trainDataPart(q, nz_inds).transpose)26 }27 return localX28 }29 }

MLSolve

✦ Distributed implementaAons of common opAmizaAon paZerns

✦ E.g., Stochas2c Gradient Descent: Applicable to summable ML losses

✦ E.g., LBFGS: An approximate 2nd-‐order op2miza2on method

✦ E.g., ADMM: Decomposi2on / coordina2on procedure

Logis2c Regression

1 function w = log_reg(X, y, maxiter, learning_rate)2 [n, d] = size(X);3 w = zeros(d,1);4 for iter = 1:maxiter5 grad = X’ * (sigmoid(X * w) - y);6 w = w - learning_rate * grad;7 end8 end9

10 % applies sigmoid function component-wise on the vector x11 function s = sigmoid(x)12 s = 1 ./ (1 + exp(-1 .* x));13 end

1 object LogisticRegression extends Algorithm {2 def sigmoid(z: Scalar) = 1.0 / (1.0 + exp(-1.0*z))3

4 def gradientFunction(w: MLSubMatrix, x: MLSubMatrix, y: Scalar): MLSubMatrix = {5 x.transpose * (sigmoid(x dot w) - y)6 }7

8 def train(data: MLNumericTable, p: LogRegParams): LogRegModel = {9 val d = data.numCols

10 val params = SGDParams(initweights = MLSubMatrix.zeros(d, 1),11 maxIterations = p.maxIter, learningRate = p.learningRate,12 gradientFunction = gradientFunction)13 val weights = SGD(data, params)14 new LogRegModel(weights)15 }16 }

1 object StochasticGradientDescent extends Optimizer {2

3 def localSGD(x: MLSubMatrix, weights: MLSubMatrix, n: Index, lambda: Scalar,4 gradientFunction: (MLSubMatrix, MLSubMatrix, Scalar) => MLSubMatrix): MLSubMatrix = {5 var localWeights = weights6 for (i <- 0 to x.numRows) {7 val label = x(i,0)8 val data = x(i, 1 to x.numCols)9 val grad = gradientFunction(localWeights, data, label)

10 localWeights = localWeights - (grad * (lambda/n))11 }12 localWeights13 }14

15 def compute(data: MLNumericTable, params: StochasticGradientDescentParameters): MLSubMatrix = {16 // Initialize parameters17 var weights = params.initWeights.value18 var converged = false19 val n = data.count20 while(!converged && i < params.maxIterations) {21 i+=122 // Run SGD locally on each partition, and average resulting parameters after each iteration23 newWeights = data24 .matrixBatchMap(localSGD(_, weights, n, params.learningRate, params.gradientFunction))25 .reduce(_ + _)/data.partitions.length26 if ((newWeights - weights).mean < params.epsilon) converged = true27 weights = newWeights28 }29 weights30 }31 }

Fig. 13: Logistic Regression Code in MATLAB (top) and MLI (middle, bottom).

Alterna2ng Least Squares, [DFC]

K-‐Means, [DP-‐Means]

Logis2c Regression, Linear SVM (+L1, L2), Mul2nomial Regression, [Naive Bayes, Decision Trees]

SGD, Parallel Gradient, Local SGD, [L-‐BFGS, ADMM, Adagrad]

Principal Component Analysis (PCA), N-‐grams, feature cleaning / normaliza2on

Cross Valida2on, Evalua2on Metrics

MLI Func2onalityRegression:

Clustering:

ClassificaAon:

Feature ExtracAon:

ML Tools:

Alterna2ng Least Squares, [DFC]

K-‐Means, [DP-‐Means]

Logis2c Regression, Linear SVM (+L1, L2), Mul2nomial Regression, [Naive Bayes, Decision Trees]

SGD, Parallel Gradient, Local SGD, [L-‐BFGS, ADMM, Adagrad]

Principal Component Analysis (PCA), N-‐grams, feature cleaning / normaliza2on

Cross Valida2on, Evalua2on Metrics

MLI Func2onalityRegression:

Clustering:

ClassificaAon:

Feature ExtracAon:

ML Tools:

Build a Classifier for X

What you want to do

What you want to do What you have to do✦ Learn the internals of ML classificaAon

algorithms, sampling, feature selecAon, X-‐validaAon,….

✦ PotenAally learn Spark/Hadoop/…✦ Implement 3-‐4 algorithms✦ Implement grid-‐search to find the right

algorithm parameters✦ Implement validaAon algorithms✦ Experiment with different sampling-‐

sizes, algorithms, features✦ ….

What you want to do What you have to do✦ Learn the internals of ML classificaAon

algorithms, sampling, feature selecAon, X-‐validaAon,….

✦ PotenAally learn Spark/Hadoop/…✦ Implement 3-‐4 algorithms✦ Implement grid-‐search to find the right

algorithm parameters✦ Implement validaAon algorithms✦ Experiment with different sampling-‐

sizes, algorithms, features✦ ….

and in the end

Ask For Help

Insight: A Declara2ve Approach

SQL Result

✦ End Users tell the system what they want, not how to get it

SQL Result MQL Model

var X = load(”als_clinical”, 2 to 10)var y = load(”als_clinical”, 1)var (fn-‐model, summary) = doClassify(X, y)

Example: Supervised ClassificaAon

var X = load(”als_clinical”, 2 to 10)var y = load(”als_clinical”, 1)var (fn-‐model, summary) = doClassify(X, y)

Example: Supervised ClassificaAon

Algorithm

Indepe

ML Op2mizer: A Search Problem

Boosting

✦ System is responsible for searching through model space

✦ Opportuni2es for physical op2miza2on

Systems Op2miza2on of Model Search

Observation: We tend to be I/O bound during model training.

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g Monkey

✦ Idea from databases – shared cursor!

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g Monkey

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g MonkeyQ

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g MonkeyQ

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g MonkeyQ

✦ Single pass over the data, many models trained

1 a Dog

1 b Cat

2 c Cat

2 d Cat

3 e Dog

3 f Horse

4 g MonkeyQ

✦ Single pass over the data, many models trained

✦ Example – Logis2c Regression via SGD

ML Optimizer

ML Developer

Meta-Data

Statistics

Declarative ML Task

ML Contract + Code

Master Server

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

Master

Slaves

End User

Rela2onship with MLI

ML Optimizer

ML Developer

Meta-Data

Statistics

Declarative ML Task

ML Contract + Code

Master Server

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

Master

Slaves

End User

✦ MLI provides common interface for all algorithms

ML Optimizer

ML Developer

Meta-Data

Statistics

Declarative ML Task

ML Contract + Code

Master Server

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

Master

Slaves

End User

✦ MLI provides common interface for all algorithms✦ Contracts: Meta-‐data for algorithms writen against MLI

ML Optimizer

ML Developer

Meta-Data

Statistics

Declarative ML Task

ML Contract + Code

Master Server

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

Master

Slaves

End User

✦ MLI provides common interface for all algorithms✦ Contracts: Meta-‐data for algorithms writen against MLI

✦ Type (e.g., classifica2on)✦ Parameters✦ Run2me (e.g., O(n))✦ Input-‐Specifica2on✦ Output-‐Specifica2on✦ …

Contributors✦ John Duchi✦ Michael Franklin✦ Joseph Gonzalez✦ Rean Griffith✦ Michael Jordan✦ Tim Kraska✦ Xinghao Pan✦ Virginia Smith✦ Shivaram Venkarataram✦ Matei Zaharia

First Release (Summer)

MLlibMLI

ML Optimizer

ML Developer

Meta-Data

Statistics

Declarative ML Task

ML Contract + Code

Master Server

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

Master

Slaves

End User

First Release (Summer)

✦ MLlib: low-‐level ML library and underlying kernels✦ Callable from Scala, Java✦ Included as part of Spark

✦ MLI: API for feature extrac2on and ML algorithms✦ Plaworm for ML development✦ Includes more extensive library and with faster dev-‐cycle than MLlib

MLlibMLI

ML Optimizer

ML Developer

Meta-Data

Statistics

Declarative ML Task

ML Contract + Code

Master Server

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

Master

Slaves

End User

Second Release (Winter)

MLlibMLI

ML Optimizer

ML Developer

Meta-Data

Statistics

Declarative ML Task

ML Contract + Code

Master Server

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

Master

Slaves

End User

✦ ML OpAmizer: automated model selec2on✦ Search problem over feature extractors and algorithms in MLI✦ Contracts✦ Restricted query language (MQL)

MLlibMLI

ML Optimizer

ML Developer

Meta-Data

Statistics

Declarative ML Task

ML Contract + Code

Master Server

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

Master

Slaves

End User

✦ ML OpAmizer: automated model selec2on✦ Search problem over feature extractors and algorithms in MLI✦ Contracts✦ Restricted query language (MQL)

✦ Feature extracAon for image data

MLlibMLI

ML Optimizer

ML Developer

Meta-Data

Statistics

Declarative ML Task

ML Contract + Code

Master Server

Optimizer

Parser

Executor/Monitoring

ML Library

DMX Runtime

Master

Slaves

End User

Future Direc2ons

Future Direc2ons✦ IdenAfy minimal set of ML operators

✦ Expose internals of ML algorithms to op2mizer

✦ Unified language for End Users and ML Developers

✦ Plug-‐ins to Python, R

✦ VisualizaAon for unsupervised learning and explora2on

✦ Advanced ML capabiliAes✦ Time-‐series algorithms✦ Graphical models✦ Advanced Op2miza2on (e.g., asynchronous computa2on)✦ Online updates✦ Sampling for efficiency

ContribuAons encouraged!

Berkeley, CAAugust 29-‐30www.mlbase.org

baseML

ML base

MLlib sparkmeetup_8_6_13_final_reduced

Technology

Transcript of MLlib sparkmeetup_8_6_13_final_reduced

Production Readiness Testing At Salesforce Using Spark MLlib

Multiclassification with Decision Tree in Spark MLlib 1.3

Spark MLlib is the Spark component providing the machine learning…€¦ · Spark MLlib is the Spark component providing the machine learning/data mining algorithms ... applying

Apache spark mllib guide for pipeliningcis.csuohio.edu/~sschung/cis612/Apache spark mllib guide for pipelining.pdfHere is the Apache spark mllib guide for pipelining: ... hence a Transformer

MLlib: Scalable Machine Learning on Spark

DECISION MAKING WITH MLLIB, SPARK AND SPARK STREAMING

Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra

Sparse Data Support in MLlib

Spark Streaming and MLlib - Hyderabad Spark Group

機械学習ライブラリ「Spark MLlib」で作る アニメレコメンドシステム

Introduction to the Spark MLLib Toolkit in IBM Streams V4.1

MLlib: Scalable Machine Learning on Sparkrezab/sparkworkshop/slides/...Spark SQL + MLlib // Data can easily be extracted from existing sources, // such as Apache Hive. val trainingTable

MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Learning Spark - Chapter 11: Machine learning with MLlib

Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

MLlib: Spark's Machine Learning Library

Advanced Analytics with SQL and MLLib - …people.apache.org/~marmbrus/talks/SQLAndMLLib... · Advanced Analytics with "" SQL and MLLib ... MapReduce greatly simpliﬁed big data

Spark MLlibでリコメンドエンジンを作った話

Apache Spark MLlib 2.0 Preview: Data Science and Production

Spark MLlib is the Spark component providing the machine ......Spark MLlib is the Spark component providing the machine learning/data mining algorithms Pre-processing techniques Classification

機械学習ライブラリ「Spark MLlib」で作るアニメレコメンドシステム