MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark...

42
Reza Zadeh MLlib and All-pairs Similarity

Transcript of MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark...

Page 1: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Reza Zadeh

MLlib and All-pairs Similarity

Page 2: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Outline Example Invocations Benefits of Iterations Singular Value Decomposition

All-pairs Similarity Computation MLlib + {Streaming, GraphX, SQL} Future Directions

Page 3: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Introduction

Page 4: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

A General Platform

Spark Core

Spark Streaming"

real-time

Spark SQL structured

GraphX graph

MLlib machine learning

Standard libraries included with Spark

Page 5: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

MLlib History MLlib is a Spark subproject providing machine learning primitives

Initial contribution from AMPLab, UC Berkeley

Shipped with Spark since Sept 2013

Page 6: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

MLlib: Available algorithms classification: logistic regression, linear SVM,"naïve Bayes, least squares, classification tree regression: generalized linear models (GLMs), regression tree collaborative filtering: alternating least squares (ALS), non-negative matrix factorization (NMF) clustering: k-means|| decomposition: SVD, PCA optimization: stochastic gradient descent, L-BFGS

Page 7: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Example Invocations

Page 8: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Example: K-means

Page 9: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Example: PCA

Page 10: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Example: ALS

Page 11: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Benefits of fast iterations

Page 12: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Optimization At least two large classes of optimization problems humans can solve:

-  Convex Programs -  Singular Value Decomposition

Page 13: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Optimization - LR data  =  spark.textFile(...).map(readPoint).cache()    w  =  numpy.random.rand(D)    for  i  in  range(iterations):          gradient  =  data.map(lambda  p:                  (1  /  (1  +  exp(-­‐p.y  *  w.dot(p.x))))  *  p.y  *  p.x          ).reduce(lambda  a,  b:  a  +  b)          w  -­‐=  gradient    print  “Final  w:  %s”  %  w  

Page 14: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

MR PageRank Repeatedly multiply sparse matrix and vector Requires repeatedly hashing together page adjacency lists and rank vector

Neighbors (id, edges)

Ranks (id, rank) …

Same file grouped over and over

iteration 1 iteration 2 iteration 3

Page 15: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Neighbors (id, edges)

Ranks (id, rank)

Spark PageRank Using cache(), keep neighbor lists in RAM Using partitioning, avoid repeated hashing

join

join

join …

partitionBy

Page 16: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Spark PageRank Using cache(), keep neighbor lists in RAM Using partitioning, avoid repeated hashing

Neighbors (id, edges)

Ranks (id, rank)

join join join …

same"node

partitionBy

Page 17: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Spark PageRank Using cache(), keep neighbor lists in RAM Using partitioning, avoid repeated hashing

Neighbors (id, edges)

Ranks (id, rank)

join

partitionBy

join join …

Page 18: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

PageRank Code #  RDD  of  (id,  neighbors)  pairs  links  =  spark.textFile(...).map(parsePage)                            .partitionBy(128).cache()    ranks  =  links.mapValues(lambda  v:  1.0)    #  RDD  of  (id,  rank)    for  i  in  range(ITERATIONS):          ranks  =  links.join(ranks).flatMap(                  lambda  (id,  (links,  rank)):                          [(d,  rank/links.size)  for  d  in  links]          ).reduceByKey(lambda  a,  b:  a  +  b)  

Generalizes  to  Matrix  Multiplication,  opening  many  algorithms  from  Numerical  Linear  Algebra  

Page 19: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

PageRank Results

171

72

23

0

50

100

150

200

Tim

e pe

r ite

ratio

n (s)

Hadoop

Basic Spark

Spark + Controlled Partitioning

Page 20: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Deep Dive: Singular Value Decomposition

Page 21: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Singular Value Decomposition Two cases: Tall and Skinny vs roughly Square

computeSVD function takes care of which one to call, so you don’t have to.

Page 22: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

SVD selection

Page 23: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Tall and Skinny SVD

Page 24: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Square SVD via ARPACK Very mature Fortran77 package for computing eigenvalue decompositions

JNI interface available via netlib-java Distributed using Spark distributed matrix-vector multiplies!

Page 25: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Deep Dive: All pairs Similarity

Page 26: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Deep Dive: All pairs Similarity Compute via DIMSUM: “Dimension Independent Similarity Computation using MapReduce”

Will be in Spark 1.2 as a method in RowMatrix

Page 27: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

All-pairs similarity computation

Page 28: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Naïve Approach

Page 29: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Naïve approach: analysis

Page 30: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

DIMSUM Sampling

Page 31: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

DIMSUM Analysis

Page 32: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

DIMSUM Proof

Page 33: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Spark implementation Magnitudes shipped with every task Makes life much easier than e.g. MapReduce

Page 34: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Ongoing Work in MLlib multiclass decision trees stats library (e.g. stratified sampling, ScaRSR) ADMM

LDA All-pairs similarity (DIMSUM) General Convex Optimization

Page 35: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

MLlib + {Streaming, GraphX, SQL}

Page 36: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

MLlib + Streaming As of Spark 1.1, you can train linear models in a streaming fashion

Model weights are updated via SGD, thus amenable to streaming

More work needed for decision trees

Page 37: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

MLlib + SQL

points = context.sql(“select latitude, longitude from tweets”) !

model = KMeans.train(points, 10) !!

Page 38: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

MLlib + GraphX

Page 39: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Future of MLlib

Page 40: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

General Convex Optimization

Distribute  CVX  by  backing  CVXPY  with  

PySpark    

Easy-­‐to-­‐express  distributable  convex  

programs    

Need  to  know  less  math  to  optimize  

complicated  objectives  

Page 41: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Spark and ML Spark has all its roots in research, so we hope to keep incorporating new ideas!

Page 42: MLlib and All-pairs Similarity - Stanford Universityrezab/slides/maryland_mllib.pdfSpark Core Spark Streaming" real-time Spark SQL structured GraphX graph MLlib machine learning …

Next Speakers Ameet: History of MLlib and the research on it at Berkeley

Ankur: Graph processing with GraphX

TD: Spark Streaming