New Large-Scale Matrix Factorization: DSGD in Sparkrezab/classes/cme323/S16/projects... · 2016. 6....

10
June 01, 2016 Large-Scale Matrix Factorization: DSGD in Spark

Transcript of New Large-Scale Matrix Factorization: DSGD in Sparkrezab/classes/cme323/S16/projects... · 2016. 6....

Page 1: New Large-Scale Matrix Factorization: DSGD in Sparkrezab/classes/cme323/S16/projects... · 2016. 6. 9. · • PCA, SVD, NMF, TF, ALS, SGD, SSGD, DSGD, etc. • Spark MLlib: ALS •

June 01, 2016

Large-Scale Matrix Factorization: DSGD in Spark

Page 2: New Large-Scale Matrix Factorization: DSGD in Sparkrezab/classes/cme323/S16/projects... · 2016. 6. 9. · • PCA, SVD, NMF, TF, ALS, SGD, SSGD, DSGD, etc. • Spark MLlib: ALS •

Collaborative Filtering (CF) •  Most prominent approach to generate recommendations •  Idea: Use “wisdom of crowd” to recommend items •  Input: Implicit or explicit user ratings •  Algorithms: Baseline, Memory-based, Matrix Factorization, etc.

Matrix Factorization: •  Latent factor model •  Dimension Reduction •  Handles large datasets, sparsity •  Can be regularized •  PCA, SVD, NMF, TF, ALS, SGD, SSGD, DSGD, etc.

•  Spark MLlib: ALS •  Gemulla paper: DSGD

Page 3: New Large-Scale Matrix Factorization: DSGD in Sparkrezab/classes/cme323/S16/projects... · 2016. 6. 9. · • PCA, SVD, NMF, TF, ALS, SGD, SSGD, DSGD, etc. • Spark MLlib: ALS •

q  Low-rank (k<<m, n) q Minimize errors of observed ratings: q  NP-hard non convex optimization:

u  ALS: fix U & solve for V, fix V & solve for U

u  Turns in convex least squares problem

q  Communication Costs: u  Two Copies for ratings: partitioned by users

& partitioned by items u  Block to block instead of all to all

q  Fully Parallel: Models shared via join between workers

MLlib Alternating Least Squares (ALS)

Page 4: New Large-Scale Matrix Factorization: DSGD in Sparkrezab/classes/cme323/S16/projects... · 2016. 6. 9. · • PCA, SVD, NMF, TF, ALS, SGD, SSGD, DSGD, etc. • Spark MLlib: ALS •

Stochastic Gradient Descent SGD

Page 5: New Large-Scale Matrix Factorization: DSGD in Sparkrezab/classes/cme323/S16/projects... · 2016. 6. 9. · • PCA, SVD, NMF, TF, ALS, SGD, SSGD, DSGD, etc. • Spark MLlib: ALS •

Stochastic Gradient Descent SGD

q  SGD: iterative, steps dependent

q  How to distribute?

Page 6: New Large-Scale Matrix Factorization: DSGD in Sparkrezab/classes/cme323/S16/projects... · 2016. 6. 9. · • PCA, SVD, NMF, TF, ALS, SGD, SSGD, DSGD, etc. • Spark MLlib: ALS •

Distributed Stochastic Gradient Descent DSGD q  Divide into interchangeable strata q  d independent map tasks:

u  Each takes a block: Zb, Wb, Hb

u  Local SGD on each stratum q  Local losses sum q  Representation allows parallelism

Cover the whole matrix Cover the whole matrix

Page 7: New Large-Scale Matrix Factorization: DSGD in Sparkrezab/classes/cme323/S16/projects... · 2016. 6. 9. · • PCA, SVD, NMF, TF, ALS, SGD, SSGD, DSGD, etc. • Spark MLlib: ALS •

Distributed SGD

Page 8: New Large-Scale Matrix Factorization: DSGD in Sparkrezab/classes/cme323/S16/projects... · 2016. 6. 9. · • PCA, SVD, NMF, TF, ALS, SGD, SSGD, DSGD, etc. • Spark MLlib: ALS •

Results •  DSGD scales well with input matrix dimensions & number of available ratings •  Scales well with # of Cores until too many è data per mapper is too small •  d-monomial strata representation: reduced communication costs •  DSGD shuffles only strata and blocks (SGD shuffles all data)

Rank Time per epoch (s)

50 120

100 125

200 135

Cores Time per epoch

8 1x

16 0.52x

32 0.27x

64 0.24x

Page 9: New Large-Scale Matrix Factorization: DSGD in Sparkrezab/classes/cme323/S16/projects... · 2016. 6. 9. · • PCA, SVD, NMF, TF, ALS, SGD, SSGD, DSGD, etc. • Spark MLlib: ALS •

Notes •  ALS speedy convergence, works well in practice •  Spark: very helpful for implementing DSGD compared to a MapReduce •  Operate on data in memory, no write to disk after every iteration •  DSGD can be enhanced further in Spark: smart data dependent blocking

Page 10: New Large-Scale Matrix Factorization: DSGD in Sparkrezab/classes/cme323/S16/projects... · 2016. 6. 9. · • PCA, SVD, NMF, TF, ALS, SGD, SSGD, DSGD, etc. • Spark MLlib: ALS •

THANK YOU!