Large-Scale Machine Learning with Apache Spark
-
Upload
db-tsai -
Category
Engineering
-
view
5.261 -
download
0
description
Transcript of Large-Scale Machine Learning with Apache Spark
![Page 1: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/1.jpg)
MLlib: Scalable Machine Learning on Spark
Xiangrui Meng
![Page 2: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/2.jpg)
What is MLlib?
2
![Page 3: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/3.jpg)
What is MLlib?
MLlib is a Spark subproject providing machine learning primitives:
• initial contribution from AMPLab, UC Berkeley
• shipped with Spark since version 0.8
• 35 contributors
3
![Page 4: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/4.jpg)
What is MLlib?Algorithms:!
• classification: logistic regression, linear support vector machine (SVM), naive Bayes, classification tree
• regression: generalized linear models (GLMs), regression tree
• collaborative filtering: alternating least squares (ALS)
• clustering: k-means
• decomposition: singular value decomposition (SVD), principal component analysis (PCA)
4
![Page 5: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/5.jpg)
Why MLlib?
5
![Page 6: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/6.jpg)
scikit-learn?
Algorithms:!
• classification: SVM, nearest neighbors, random forest, …
• regression: support vector regression (SVR), ridge regression, Lasso, logistic regression, …!
• clustering: k-means, spectral clustering, …
• decomposition: PCA, non-negative matrix factorization (NMF), independent component analysis (ICA), …
6
![Page 7: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/7.jpg)
Mahout?
Algorithms:!
• classification: logistic regression, naive Bayes, random forest, …
• collaborative filtering: ALS, …
• clustering: k-means, fuzzy k-means, …
• decomposition: SVD, randomized SVD, …
7
![Page 8: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/8.jpg)
Vowpal Wabbit?H2O?
R?MATLAB?
Mahout?
Weka?scikit-learn?
LIBLINEAR?
8
GraphLab?
![Page 9: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/9.jpg)
Why MLlib?
9
![Page 10: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/10.jpg)
• It is built on Apache Spark, a fast and general engine for large-scale data processing.
• Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
• Write applications quickly in Java, Scala, or Python.
Why MLlib?
10
![Page 11: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/11.jpg)
Spark philosophy
Make life easy and productive for data scientists:
• Well documented, expressive API’s
• Powerful domain specific libraries
• Easy integration with storage systems
• … and caching to avoid data movement
![Page 12: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/12.jpg)
Word count (Scala)
val counts = sc.textFile("hdfs://...") .flatMap(line => line.split(" ")) .map(word => (word, 1L)) .reduceByKey(_ + _)
![Page 13: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/13.jpg)
Word count (SparkR)lines <- textFile(sc, "hdfs://...") words <- flatMap(lines, function(line) { unlist(strsplit(line, " ")) }) wordCount <- lapply(words, function(word) { list(word, 1L) }) counts <- reduceByKey(wordCount, "+", 2L) output <- collect(counts)
![Page 14: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/14.jpg)
Gradient descent
val points = spark.textFile(...).map(parsePoint).cache() var w = Vector.zeros(d) for (i <- 1 to numIterations) { val gradient = points.map { p => (1 / (1 + exp(-p.y * w.dot(p.x)) - 1) * p.y * p.x ).reduce(_ + _) w -= alpha * gradient }
w w � ↵ ·nX
i=1
g(w;xi, yi)
14
![Page 15: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/15.jpg)
// Load and parse the data. val data = sc.textFile("kmeans_data.txt") val parsedData = data.map(_.split(‘ ').map(_.toDouble)).cache() !// Cluster the data into two classes using KMeans. val clusters = KMeans.train(parsedData, 2, numIterations = 20) !// Compute the sum of squared errors. val cost = clusters.computeCost(parsedData) println("Sum of squared errors = " + cost)
k-means (scala)
15
![Page 16: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/16.jpg)
k-means (python)# Load and parse the data data = sc.textFile("kmeans_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ‘)])).cache() !# Build the model (cluster the data) clusters = KMeans.train(parsedData, 2, maxIterations = 10, runs = 1, initialization_mode = "kmeans||") !# Evaluate clustering by computing the sum of squared errors def error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)])) !cost = parsedData.map(lambda point: error(point)) .reduce(lambda x, y: x + y) print("Sum of squared error = " + str(cost))
16
![Page 17: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/17.jpg)
Dimension reduction+ k-means
// compute principal components val points: RDD[Vector] = ... val mat = RowMatrix(points) val pc = mat.computePrincipalComponents(20) !// project points to a low-dimensional space val projected = mat.multiply(pc).rows !// train a k-means model on the projected data val model = KMeans.train(projected, 10)
![Page 18: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/18.jpg)
Collaborative filtering// Load and parse the data val data = sc.textFile("mllib/data/als/test.data") val ratings = data.map(_.split(',') match { case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble) }) !// Build the recommendation model using ALS val numIterations = 20 val model = ALS.train(ratings, 1, 20, 0.01) !// Evaluate the model on rating data val usersProducts = ratings.map { case Rating(user, product, rate) => (user, product) } val predictions = model.predict(usersProducts)
18
![Page 19: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/19.jpg)
Why MLlib?• It ships with Spark as
a standard component.
19
![Page 20: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/20.jpg)
Spark community
One of the largest open source projects in big data:
• 170+ developers contributing
• 30+ companies contributing
• 400+ discussions per month on the mailing list
![Page 21: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/21.jpg)
30-day commit activity
0
50
100
150
200
Patches
MapReduceStormYarnSpark
0
10000
20000
30000
40000
Lines Added
MapReduceStormYarnSpark
0
3500
7000
10500
14000
Lines Removed
MapReduceStormYarnSpark
![Page 22: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/22.jpg)
Out for dinner?!
• Search for a restaurant and make a reservation.
• Start navigation.
• Food looks good? Take a photo and share.
22
![Page 23: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/23.jpg)
Why smartphone?
Out for dinner?!
• Search for a restaurant and make a reservation. (Yellow Pages?)
• Start navigation. (GPS?)
• Food looks good? Take a photo and share. (Camera?)
23
![Page 24: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/24.jpg)
Why MLlib?
A special-purpose device may be better at one aspect than a general-purpose device. But the cost of context switching is high: • different languages or APIs
• different data formats
• different tuning tricks
24
![Page 25: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/25.jpg)
Spark SQL + MLlib// Data can easily be extracted from existing sources, // such as Apache Hive. val trainingTable = sql(""" SELECT e.action, u.age, u.latitude, u.longitude FROM Users u JOIN Events e ON u.userId = e.userId""") !// Since `sql` returns an RDD, the results of the above // query can be easily used in MLlib. val training = trainingTable.map { row => val features = Vectors.dense(row(1), row(2), row(3)) LabeledPoint(row(0), features) } !val model = SVMWithSGD.train(training)
![Page 26: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/26.jpg)
Streaming + MLlib
// collect tweets using streaming !// train a k-means model val model: KMmeansModel = ... !// apply model to filter tweets val tweets = TwitterUtils.createStream(ssc, Some(authorizations(0))) val statuses = tweets.map(_.getText) val filteredTweets = statuses.filter(t => model.predict(featurize(t)) == clusterNumber) !// print tweets within this particular cluster filteredTweets.print()
![Page 27: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/27.jpg)
GraphX + MLlib// assemble link graph val graph = Graph(pages, links) val pageRank: RDD[(Long, Double)] = graph.staticPageRank(10).vertices !// load page labels (spam or not) and content features val labelAndFeatures: RDD[(Long, (Double, Seq((Int, Double)))] = ... val training: RDD[LabeledPoint] = labelAndFeatures.join(pageRank).map { case (id, ((label, features), pageRank)) => LabeledPoint(label, Vectors.sparse(features ++ (1000, pageRank)) } !// train a spam detector using logistic regression val model = LogisticRegressionWithSGD.train(training)
![Page 28: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/28.jpg)
Why MLlib?• Built on Spark’s lightweight yet powerful APIs.
• Spark’s open source community.
• Seamless integration with Spark’s other components.
28
• Comparable to or even better than other libraries specialized in large-scale machine learning.
![Page 29: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/29.jpg)
Why MLlib?
• Scalability
• Performance
• User-friendly APIs
• Integration with Spark and its other components
29
![Page 30: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/30.jpg)
Logistic regression
30
![Page 31: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/31.jpg)
Logistic regression - weak scaling
• Full dataset: 200K images, 160K dense features. • Similar weak scaling. • MLlib within a factor of 2 of VW’s wall-clock time.
MLbase VW Matlab0
1000
2000
3000
4000
wal
ltim
e (s
)
n=6K, d=160Kn=12.5K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K
MLlib
MLbase VW Matlab0
1000
2000
3000
4000
wal
ltim
e (s
)
n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K
Fig. 5: Walltime for weak scaling for logistic regression.
0 5 10 15 20 25 300
2
4
6
8
10
rela
tive
wal
ltim
e
# machines
MLbaseVWIdeal
Fig. 6: Weak scaling for logistic regression
MLbase VW Matlab0
200
400
600
800
1000
1200
1400
wal
ltim
e (s
)
1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines
Fig. 7: Walltime for strong scaling for logistic regression.
0 5 10 15 20 25 300
5
10
15
20
25
30
35
# machines
spee
dup
MLbaseVWIdeal
Fig. 8: Strong scaling for logistic regression
with respect to computation. In practice, we see comparablescaling results as more machines are added.
In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.
Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.
From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.
System Lines of CodeMLbase 32
GraphLab 383Mahout 865
MATLAB-Mex 124MATLAB 20
TABLE II: Lines of code for various implementations of ALS
B. Collaborative Filtering: Alternating Least Squares
Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n
be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV
T .Commonly, U and V are estimated using the following bi-convex objective:
min
U,V
X
(i,j)2⌦(M)
(Mij � U
Ti Vj)
2+ �(||U ||2F + ||V ||2F ) . (2)
Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u
⇤i 2 Rk is the optimal solution for the
i
th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the
MLlib
31
![Page 32: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/32.jpg)
Logistic regression - strong scaling
• Fixed Dataset: 50K images, 160K dense features. • MLlib exhibits better scaling properties. • MLlib is faster than VW with 16 and 32 machines.
MLbase VW Matlab0
1000
2000
3000
4000
wal
ltim
e (s
)
n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K
Fig. 5: Walltime for weak scaling for logistic regression.
0 5 10 15 20 25 300
2
4
6
8
10
rela
tive
wal
ltim
e
# machines
MLbaseVWIdeal
Fig. 6: Weak scaling for logistic regression
MLbase VW Matlab0
200
400
600
800
1000
1200
1400
wal
ltim
e (s
)
1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines
Fig. 7: Walltime for strong scaling for logistic regression.
0 5 10 15 20 25 300
5
10
15
20
25
30
35
# machines
spee
dup
MLbaseVWIdeal
Fig. 8: Strong scaling for logistic regression
with respect to computation. In practice, we see comparablescaling results as more machines are added.
In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.
Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.
From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.
System Lines of CodeMLbase 32
GraphLab 383Mahout 865
MATLAB-Mex 124MATLAB 20
TABLE II: Lines of code for various implementations of ALS
B. Collaborative Filtering: Alternating Least Squares
Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n
be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV
T .Commonly, U and V are estimated using the following bi-convex objective:
min
U,V
X
(i,j)2⌦(M)
(Mij � U
Ti Vj)
2+ �(||U ||2F + ||V ||2F ) . (2)
Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u
⇤i 2 Rk is the optimal solution for the
i
th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the
MLlib
MLbase VW Matlab0
1000
2000
3000
4000
wal
ltim
e (s
)
n=12K, d=160Kn=25K, d=160Kn=50K, d=160Kn=100K, d=160Kn=200K, d=160K
Fig. 5: Walltime for weak scaling for logistic regression.
0 5 10 15 20 25 300
2
4
6
8
10
rela
tive
wal
ltim
e# machines
MLbaseVWIdeal
Fig. 6: Weak scaling for logistic regression
MLbase VW Matlab0
200
400
600
800
1000
1200
1400
wal
ltim
e (s
)
1 Machine2 Machines4 Machines8 Machines16 Machines32 Machines
Fig. 7: Walltime for strong scaling for logistic regression.
0 5 10 15 20 25 300
5
10
15
20
25
30
35
# machines
spee
dup
MLbaseVWIdeal
Fig. 8: Strong scaling for logistic regression
with respect to computation. In practice, we see comparablescaling results as more machines are added.
In MATLAB, we implement gradient descent instead ofSGD, as gradient descent requires roughly the same numberof numeric operations as SGD but does not require an innerloop to pass over the data. It can thus be implemented in a’vectorized’ fashion, which leads to a significantly more favor-able runtime. Moreover, while we are not adding additionalprocessing units to MATLAB as we scale the dataset size, weshow MATLAB’s performance here as a reference for traininga model on a similarly sized dataset on a single multicoremachine.
Results: In our weak scaling experiments (Figures 5 and6), we can see that our clustered system begins to outperformMATLAB at even moderate levels of data, and while MATLABruns out of memory and cannot complete the experiment onthe 200K point dataset, our system finishes in less than 10minutes. Moreover, the highly specialized VW is on average35% faster than our system, and never twice as fast. Thesetimes do not include time spent preparing data for input inputfor VW, which was significant, but expect that they’d be aone-time cost in a fully deployed environment.
From the perspective of strong scaling (Figures 7 and 8),our solution actually outperforms VW in raw time to train amodel on a fixed dataset size when using 16 and 32 machines,and exhibits stronger scaling properties, much closer to thegold standard of linear scaling for these algorithms. We areunsure whether this is due to our simpler (broadcast/gather)communication paradigm, or some other property of the sys-tem.
System Lines of CodeMLbase 32
GraphLab 383Mahout 865
MATLAB-Mex 124MATLAB 20
TABLE II: Lines of code for various implementations of ALS
B. Collaborative Filtering: Alternating Least Squares
Matrix factorization is a technique used in recommendersystems to predict user-product associations. Let M 2 Rm⇥n
be some underlying matrix and suppose that only a smallsubset, ⌦(M), of its entries are revealed. The goal of matrixfactorization is to find low-rank matrices U 2 Rm⇥k andV 2 Rn⇥k, where k ⌧ n,m, such that M ⇡ UV
T .Commonly, U and V are estimated using the following bi-convex objective:
min
U,V
X
(i,j)2⌦(M)
(Mij � U
Ti Vj)
2+ �(||U ||2F + ||V ||2F ) . (2)
Alternating least squares (ALS) is a widely used method formatrix factorization that solves (2) by alternating betweenoptimizing U with V fixed, and V with U fixed. ALS iswell-suited for parallelism, as each row of U can be solvedindependently with V fixed, and vice-versa. With V fixed, theminimization problem for each row ui is solved with the closedform solution. where u
⇤i 2 Rk is the optimal solution for the
i
th row vector of U , V⌦i is a sub-matrix of rows vj such thatj 2 ⌦i, and Mi⌦i is a sub-vector of observed entries in the
MLlib
32
![Page 33: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/33.jpg)
Collaborative filtering
33
![Page 34: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/34.jpg)
Collaborative filtering
• Recover a ra-ng matrix from a subset of its entries. ?
?
?
?
?
34
![Page 35: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/35.jpg)
Alternating least squares (ALS)
35
![Page 36: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/36.jpg)
ALS - wall-clock time
• Dataset: scaled version of Netflix data (9X in size). • Cluster: 9 machines. • MLlib is an order of magnitude faster than Mahout. • MLlib is within factor of 2 of GraphLab.
System Wall-‐clock /me (seconds)
Matlab 15443
Mahout 4206
GraphLab 291
MLlib 481
36
![Page 37: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/37.jpg)
Implementation
![Page 38: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/38.jpg)
Implementation of k-means
Initialization:
• random
• k-means++
• k-means||
![Page 39: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/39.jpg)
Iterations:
• For each point, find its closest center.
• Update cluster centers.
Implementation of k-means
li = argminj
kxi � cjk22
cj =
Pi,li=j xjPi,li=j 1
![Page 40: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/40.jpg)
Implementation of k-meansThe points are usually sparse, but the centers are most likely to be dense. Computing the distance takes O(d) time. So the time complexity is O(n d k) per iteration. We don’t take any advantage of sparsity on the running time. However, we have
kx� ck22 = kxk22 + kck22 � 2hx, ci
Computing the inner product only needs non-zero elements. So we can cache the norms of the points and of the centers, and then only need the inner products to obtain the distances. This reduce the running time to O(nnz k + d k) per iteration. !However, is it accurate?
![Page 41: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/41.jpg)
Implementation of ALS
• broadcast everything • data parallel • fully parallel
41
![Page 42: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/42.jpg)
Broadcast everything• Master loads (small)
data file and initializes models.
• Master broadcasts data and initial models.
• At each iteration, updated models are broadcast again.
• Works OK for small data.
• Lots of communication overhead - doesn’t scale well.
• Ships with Spark ExamplesMaster
Workers
Ratings
Movie!Factors
User!Factors
42
![Page 43: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/43.jpg)
Data parallel• Workers load data
• Master broadcasts initial models
• At each iteration, updated models are broadcast again
• Much better scaling
• Works on large datasets
• Works well for smaller models. (low K)
MasterWorkers
Ratings
Movie!Factors
User!Factors
Ratings
Ratings
Ratings
43
![Page 44: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/44.jpg)
Fully parallel• Workers load data
• Models are instantiated at workers.
• At each iteration, models are shared via join between workers.
• Much better scalability.
• Works on large datasets
MasterWorkers
RatingsMovie!FactorsUser!Factors
RatingsMovie!FactorsUser!Factors
RatingsMovie!FactorsUser!Factors
RatingsMovie!FactorsUser!Factors
44
![Page 45: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/45.jpg)
Implementation of ALS
• broadcast everything • data parallel • fully parallel • block-wise parallel
• Users/products are partitioned into blocks and join is based on blocks instead of individual user/product.
45
![Page 46: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/46.jpg)
New features for v1.0
• Sparse data support
• Classification and regression tree (CART)
• Tall-and-skinny SVD and PCA
• L-BFGS
• Model evaluation
46
![Page 47: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/47.jpg)
MLlib v1.1?• Model selection!
• training multiple models in parallel
• separating problem/algorithm/parameters/model
• Learning algorithms!• Latent Dirichlet allocation (LDA)
• Random Forests
• Online updates with Spark Streaming
• Optimization algorithms!• Alternating direction method of multipliers (ADMM)
• Accelerated gradient descent
And?
![Page 48: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/48.jpg)
Contributors
Ameet Talwalkar, Andrew Tulloch, Chen Chao, Nan Zhu, DB Tsai, Evan Sparks, Frank Dai, Ginger Smith, Henry Saputra, Holden Karau, Hossein Falaki, Jey Kottalam, Cheng Lian, Marek Kolodziej, Mark Hamstra, Martin Jaggi, Martin Weindel, Matei Zaharia, Nick Pentreath, Patrick Wendell, Prashant Sharma, Reynold Xin, Reza Zadeh, Sandy Ryza, Sean Owen, Shivaram Venkataraman, Tor Myklebust, Xiangrui Meng, Xinghao Pan, Xusen Yin, Jerry Shao, Sandeep Singh, Ryan LeCompte
48
![Page 49: Large-Scale Machine Learning with Apache Spark](https://reader035.fdocuments.net/reader035/viewer/2022081716/53f47ae38d7f72c80e8b4a5b/html5/thumbnails/49.jpg)
Interested?• Website: http://spark.apache.org
• Tutorials: http://ampcamp.berkeley.edu
• Spark Summit: http://spark-summit.org
• Github: https://github.com/apache/spark
• Mailing lists: [email protected] [email protected]
49