Download - Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Transcript
Page 1: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Reza Zadeh

Distributing Matrix Computations with Spark MLlib

Page 2: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

A General Platform

Spark Core

Spark Streaming"

real-time

Spark SQL structured

GraphX graph

MLlib machine learning

Standard libraries included with Spark

Page 3: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Outline Introduction to MLlib Example Invocations Benefits of Iterations: Optimization

Singular Value Decomposition All-pairs Similarity Computation MLlib + {Streaming, GraphX, SQL}

Page 4: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Introduction

Page 5: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

MLlib History MLlib is a Spark subproject providing machine learning primitives

Initial contribution from AMPLab, UC Berkeley

Shipped with Spark since Sept 2013

Page 6: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

MLlib: Available algorithms classification: logistic regression, linear SVM,"naïve Bayes, least squares, classification tree regression: generalized linear models (GLMs), regression tree collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) clustering: k-means|| decomposition: SVD, PCA optimization: stochastic gradient descent, L-BFGS

Page 7: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Example Invocations

Page 8: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Example: K-means

Page 9: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Example: PCA

Page 10: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Example: ALS

Page 11: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Benefits of fast iterations

Page 12: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Optimization At least two large classes of optimization problems humans can solve:

-  Convex Programs -  Spectral Problems (SVD)

Page 13: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Optimization - LR data  =  spark.textFile(...).map(readPoint).cache()    w  =  numpy.random.rand(D)    for  i  in  range(iterations):          gradient  =  data.map(lambda  p:                  (1  /  (1  +  exp(-­‐p.y  *  w.dot(p.x))))  *  p.y  *  p.x          ).reduce(lambda  a,  b:  a  +  b)          w  -­‐=  gradient    print  “Final  w:  %s”  %  w  

Page 14: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Spark PageRank Using cache(), keep neighbor lists in RAM Using partitioning, avoid repeated hashing

Neighbors (id, edges)

Ranks (id, rank)

join

partitionBy

join join …

Page 15: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

PageRank Results

171

72

23

0

50

100

150

200

Tim

e pe

r ite

ratio

n (s)

Hadoop

Basic Spark

Spark + Controlled Partitioning

Page 16: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Spark PageRank

Generalizes  to  Matrix  Multiplication,  opening  many  algorithms  from  Numerical  Linear  Algebra  

Page 17: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Deep Dive: Singular Value Decomposition

Page 18: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Singular Value Decomposition Two cases: Tall and Skinny vs roughly Square

computeSVD function takes care of which one to call, so you don’t have to.

Page 19: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

SVD selection

Page 20: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Tall and Skinny SVD

Page 21: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Tall and Skinny SVD

Gets  us      V  and  the  singular  values  

Gets  us      U  by  one  matrix  multiplication  

Page 22: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Square SVD via ARPACK Very mature Fortran77 package for computing eigenvalue decompositions"

JNI interface available via netlib-java"

Distributed using Spark

Page 23: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Square SVD via ARPACK Only needs to compute matrix vector multiplies to build Krylov subspaces

The result of matrix-vector multiply is small"

The multiplication can be distributed

Page 24: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Deep Dive: All pairs Similarity

Page 25: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Deep Dive: All pairs Similarity Compute via DIMSUM: “Dimension Independent Similarity Computation using MapReduce”

Will be in Spark 1.2 as a method in RowMatrix

Page 26: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

All-pairs similarity computation

Page 27: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Naïve Approach

Page 28: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Naïve approach: analysis

Page 29: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

DIMSUM Sampling

Page 30: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

DIMSUM Analysis

Page 31: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Spark implementation

Page 32: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Ongoing Work in MLlib stats library (e.g. stratified sampling, ScaRSR) ADMM LDA

General Convex Optimization

Page 33: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

MLlib + {Streaming, GraphX, SQL}

Page 34: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

MLlib + Streaming As of Spark 1.1, you can train linear models in a streaming fashion

Model weights are updated via SGD, thus amenable to streaming

More work needed for decision trees

Page 35: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

MLlib + SQL

points = context.sql(“select latitude, longitude from tweets”) !

model = KMeans.train(points, 10) !!

Page 36: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

MLlib + GraphX

Page 37: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Future of MLlib

Page 38: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

General Linear Algebra CoordinateMatrix RowMatrix BlockMatrix

Local and distributed versions."Operations in-between.

Goal:  version  1.2  

Goal:  version  1.3  

Page 39: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Research Goal: General Convex Optimization

Distribute  CVX  by  backing  CVXPY  with  

PySpark    

Easy-­‐to-­‐express  distributable  convex  

programs    

Need  to  know  less  math  to  optimize  

complicated  objectives  

Page 40: Distributing Matrix Computations with Spark MLlibrezab/slides/reza_mllib_maryland.pdf · MLlib History MLlib is a Spark subproject providing machine learning primitives Initial contribution

Spark and ML Spark has all its roots in research, so we hope to keep incorporating new ideas!