Spark Meetup @ Netflix, 05/19/2015
-
Upload
moustaki -
Category
Engineering
-
view
2.311 -
download
10
Transcript of Spark Meetup @ Netflix, 05/19/2015
![Page 1: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/1.jpg)
![Page 2: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/2.jpg)
Spark and GraphX in the Netflix Recommender System
Ehtsham Elahi and Yves Raimond(@EhtshamElahi) (@moustaki)
Algorithms EngineeringNetflix
![Page 3: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/3.jpg)
Machine Learning @ Netflix
![Page 4: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/4.jpg)
Recommendations @ Netflix● Goal: Help members find
content that they’ll enjoy to maximize satisfaction and retention
● Core part of product○ Every impression is a
recommendation
![Page 5: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/5.jpg)
5
▪ Regression (Linear, logistic, elastic net)
▪ SVD and other Matrix Factorizations
▪ Factorization Machines
▪ Restricted Boltzmann Machines
▪ Deep Neural Networks
▪ Markov Models and Graph Algorithms
▪ Clustering
▪ Latent Dirichlet Allocation
▪ Gradient Boosted Decision Trees/Random Forests
▪ Gaussian Processes
▪ …
Models & Algorithms
![Page 6: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/6.jpg)
Main Challenge - Scale● Algorithms @ Netflix Scale
○ > 62 M Members○ > 50 Countries○ > 1000 device types○ > 100M Hours / day
● Can distributed Machine Learning algorithms help with Scale?
![Page 7: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/7.jpg)
Spark and GraphX
![Page 8: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/8.jpg)
Spark and GraphX● Spark - Distributed in-memory computational engine
using Resilient Distributed Datasets (RDDs)
● GraphX - extends RDDs to Multigraphs and provides graph analytics
● Convenient and fast, all the way from prototyping (spark-notebook, iSpark, Zeppelin) to production
![Page 9: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/9.jpg)
Two Machine Learning Problems● Generate ranking of items with respect to a given item
from an interaction graph
○ Graph Diffusion algorithms (e.g. Topic Sensitive Pagerank)
● Find Clusters of related items using co-occurrence data
○ Probabilistic Graphical Models (Latent Dirichlet Allocation)
![Page 10: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/10.jpg)
Iterative Algorithms in GraphX
v1
v2v3
v4v6
v7Vertex Attribute
Edge Attribute
![Page 11: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/11.jpg)
Iterative Algorithms in GraphX
v1
v2v3
v4v6
v7Vertex Attribute
Edge Attribute
GraphX represents the graph as RDDs. e.g. VertexRDD, EdgeRDD
![Page 12: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/12.jpg)
Iterative Algorithms in GraphX
v1
v2v3
v4v6
v7Vertex Attribute
Edge Attribute
GraphX provides APIs to propagate and update attributes
![Page 13: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/13.jpg)
Iterative Algorithms in GraphX
v1
v2v3
v4v6
v7Vertex Attribute
Edge Attribute
Iterative Algorithm proceeds by creating updated graphs
![Page 14: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/14.jpg)
Graph Diffusion algorithms
![Page 15: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/15.jpg)
● Popular graph diffusion algorithm
● Capturing vertex importance with regards to a particular vertex
● e.g. for the topic “Seattle”
Topic Sensitive Pagerank @ Netflix
![Page 16: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/16.jpg)
Iteration 0
We start by activating a single node
“Seattle”
related to
shot in
featured in
related to
cast
cast
cast
related to
![Page 17: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/17.jpg)
Iteration 1
With some probability, we follow outbound edges, otherwise we go back to the origin.
![Page 18: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/18.jpg)
Iteration 2
Vertex accumulates higher mass
![Page 19: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/19.jpg)
Iteration 2
And again, until convergence
![Page 20: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/20.jpg)
GraphX implementation● Running one propagation for each possible starting
node would be slow
● Keep a vector of activation probabilities at each vertex
● Use GraphX to run all propagations in parallel
![Page 21: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/21.jpg)
Topic Sensitive Pagerank in GraphX
activation probability, starting from vertex 1
activation probability, starting from vertex 2
activation probability, starting from vertex 3
...
Activation probabilities as vertex attributes
...
...
... ...
...
...
![Page 22: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/22.jpg)
Example graph diffusion results
“Matrix”
“Zombies”
“Seattle”
![Page 23: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/23.jpg)
Distributed Clustering algorithms
![Page 24: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/24.jpg)
LDA @ Netflix● A popular clustering/latent factors model● Discovers clusters/topics of related videos from Netflix
data● e.g, a topic of Animal Documentaries
![Page 25: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/25.jpg)
LDA - Graphical Model
Per-topic word distributions
Per-document topic distributions
Topic label for document d and word w
![Page 26: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/26.jpg)
LDA - Graphical Model
Question: How to parallelize inference?
![Page 27: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/27.jpg)
LDA - Graphical Model
Question: How to parallelize inference?Answer: Read conditional independenciesin the model
![Page 28: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/28.jpg)
Gibbs Sampler 1 (Semi Collapsed)
![Page 29: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/29.jpg)
Gibbs Sampler 1 (Semi Collapsed)
Sample Topic Labels in a given document SequentiallySample Topic Labels in different documents In parallel
![Page 30: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/30.jpg)
Gibbs Sampler 2 (UnCollapsed)
![Page 31: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/31.jpg)
Gibbs Sampler 2 (UnCollapsed)
Sample Topic Labels in a given document In parallelSample Topic Labels in different documents In parallel
![Page 32: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/32.jpg)
Gibbs Sampler 2 (UnCollapsed)
Suitable For GraphX
Sample Topic Labels in a given document In parallelSample Topic Labels in different documents In parallel
![Page 33: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/33.jpg)
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
A distributed parameterized graph for LDA with 3 Topics
![Page 34: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/34.jpg)
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
A distributed parameterized graph for LDA with 3 Topics
document
![Page 35: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/35.jpg)
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
A distributed parameterized graph for LDA with 3 Topics
word
![Page 36: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/36.jpg)
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
A distributed parameterized graph for LDA with 3 Topics
Edge: if word appeared in the document
![Page 37: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/37.jpg)
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
A distributed parameterized graph for LDA with 3 Topics
Per-document topic distribution
![Page 38: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/38.jpg)
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
A distributed parameterized graph for LDA with 3 Topics
Per-topic word distributions
![Page 39: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/39.jpg)
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
(vertex, edge, vertex) = triplet
![Page 40: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/40.jpg)
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
Categorical distributionfor the triplet usingvertex attributes
![Page 41: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/41.jpg)
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
Categorical distributions forall triplets
![Page 42: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/42.jpg)
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.3
0.4
0.1
0.3
0.2
0.8
0.4
0.4
0.1
0.3 0.6 0.1
0.2 0.5 0.3
1
1
2
0
Sample Topics for all edges
![Page 43: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/43.jpg)
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0
1
0
0
1
1
1
0
0
0 2 0
1 0 1
1
1
2
0
Neighborhood aggregation for topic histograms
![Page 44: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/44.jpg)
Distributed Gibbs Sampler
w1
w2
w3
d1
d2
0.1
0.4
0.3
0.1
0.4
0.4
0.8
0.2
0.3
0.1 0.8 0.1
0.45 0.1 0.45
Realize samples from Dirichlet to update the graph
![Page 45: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/45.jpg)
Example LDA Results
Cluster of Bollywood Movies
Cluster of Kids shows
Cluster of Western movies
![Page 46: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/46.jpg)
GraphX performance comparison
![Page 47: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/47.jpg)
Algorithm Implementations● Topic Sensitive Pagerank
○ Distributed GraphX implementation○ Alternative Implementation: Broadcast graph adjacency matrix,
Scala/Breeze code, triggered by Spark
● LDA○ Distributed GraphX implementation○ Alternative Implementation: Single machine, Multi-threaded Java code
● All implementations are Netflix internal code
![Page 48: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/48.jpg)
Performance Comparison
![Page 49: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/49.jpg)
Performance Comparison
Open Source DBPedia dataset
![Page 50: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/50.jpg)
Performance Comparison
Sublinear rise in time with GraphX Vs Linear rise in the Alternative
![Page 51: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/51.jpg)
Performance Comparison
Doubling the size of cluster:2.0 speedup in the Alternative Impl Vs 1.2 in GraphX
![Page 52: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/52.jpg)
Performance Comparison
Large number of vertices propagated in parallel lead to large shuffle data, causing failures in GraphX for small clusters
![Page 53: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/53.jpg)
Performance Comparison
Netflix datasetNumber of Topics = 100
![Page 54: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/54.jpg)
Performance Comparison
GraphX setup:8 x Resources than the Multi-Core setup
![Page 55: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/55.jpg)
Performance Comparison
Wikipedia dataset, 100 Topic LDACluster: (16 x r3.2xl)(source: Databricks)
![Page 56: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/56.jpg)
Performance Comparison
GraphX for very large datasets outperforms the multi-core unCollapsed Impl
![Page 57: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/57.jpg)
Lessons Learned
![Page 58: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/58.jpg)
What we learned so far...
● Where is the cross-over point for your iterative ML algorithm?○ GraphX brings performance benefits if you’re on the right side of that
point○ GraphX lets you easily throw more hardware at a problem
● GraphX very useful (and fast) for other graph processing tasks○ Data pre-processing○ Efficient joins
![Page 59: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/59.jpg)
What we learned so far ...
● Regularly save the state○ With a 99.9% success rate, what’s the probability of successfully
running 1,000 iterations?
● Multi-Core Machine learning (r3.8xl, 32 threads, 220 GB) is very efficient○ if your data fits in memory of single machine !
![Page 60: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/60.jpg)
What we learned so far ...
● Regularly save the state○ With a 99.9% success rate, what’s the probability of successfully
running 1,000 iterations?○ ~36%
● Multi-Core Machine learning (r3.8xl, 32 threads, 220 GB) is very efficient○ if your data fits in memory of single machine !
![Page 61: Spark Meetup @ Netflix, 05/19/2015](https://reader031.fdocuments.net/reader031/viewer/2022020218/55b6d3afbb61eb93418b48b1/html5/thumbnails/61.jpg)
We’re hiring!(come talk to us)
https://jobs.netflix.com/