Machine Learning in the Cloud with GraphLab
-
Upload
danny-bickson -
Category
Technology
-
view
905 -
download
3
description
Transcript of Machine Learning in the Cloud with GraphLab
Danny Bickson
Machine Learning in the Cloud with GraphLab
Applied machine learning day, January 20, 2014 MS
Needless to Say, We Need Machine Learning for Big Data
72 Hours a Minute YouTube 28 Million
Wikipedia Pages
1 Billion Facebook Users
6 Billion Flickr Photos
“… data a new class of economic asset, like currency or gold.”
How will we design and implement
parallel learning systems?
Big Learning
A ShiU Towards Parallelism
GPUs Multicore Clusters Clouds Supercomputers
! ML experts repeatedly solve the same parallel design challenges: ! Race condiZons, distributed state, communicaZon…
! The resulZng code is: ! difficult to maintain, extend, debug…
Graduate students
Avoid these problems by using high-‐level abstrac4ons
MapReduce for Data-‐Parallel ML
! Excellent for large data-‐parallel tasks!
Data-Parallel Graph-Parallel
Cross ValidaZon
Feature ExtracZon
MapReduce
CompuZng Sufficient StaZsZcs
Graphical Models Gibbs Sampling
Belief PropagaZon VariaZonal Opt.
Semi-‐Supervised Learning
Label PropagaZon CoEM
Graph Analysis PageRank
Triangle CounZng
Collabora4ve Filtering
Tensor FactorizaZon
Is there more to Machine Learning
?
Carnegie Mellon University
The Power of Dependencies
where the value is!
Label a Face and Propagate
Pairwise similarity not enough…
Not similar enough to be sure
Propagate SimilariZes & Co-‐occurrences for Accurate PredicZons
similarity edges
co-‐occurring faces
further evidence
CollaboraZve Filtering: Independent Case
Lord of the Rings
Star Wars IV
Star Wars I
Harry Poder
Pirates of the Caribbean
CollaboraZve Filtering: ExploiZng Dependencies
City of God
Wild Strawberries
The CelebraZon
La Dolce Vita
Women on the Verge of a Nervous Breakdown
What do I recommend???
Data
Machine Learning Pipeline
images
docs
movie raZngs
Extract Features
faces
important words
side info
Graph Formation
similar faces
shared words
rated movies
Structured Machine Learning Algorithm
belief propagaZon
LDA
collaboraZve filtering
Value from Data
face labels
doc topics
movie recommend.
Data
Parallelizing Machine Learning
Extract Features
Graph Formation Structured
Machine Learning Algorithm
Value from Data
Graph Ingress mostly data-‐parallel
Graph-‐Structured Computa4on graph-‐parallel
ML Tasks Beyond Data-‐Parallelism
Data-Parallel Graph-Parallel
Cross ValidaZon
Feature ExtracZon
Map Reduce
CompuZng Sufficient StaZsZcs
Graphical Models Gibbs Sampling
Belief PropagaZon VariaZonal Opt.
Semi-‐Supervised Learning
Label PropagaZon CoEM
Graph Analysis PageRank
Triangle CounZng
Collabora4ve Filtering
Tensor FactorizaZon
Carnegie Mellon University
Example of a Graph-‐Parallel Algorithm
PageRank
What’s the rank of this user?
Rank?
Depends on rank of who follows her
Depends on rank of who follows them…
Loops in graph è Must iterate!
PageRank IteraZon
! α is the random reset probability ! wji is the prob. transiZoning (similarity) from j to i
R[i] = ↵+ (1� ↵)X
(j,i)2E
wjiR[j]R[i]
R[j] wji Iterate unZl convergence:
“My rank is weighted average of my friends’ ranks”
ProperZes of Graph Parallel Algorithms
Dependency Graph
IteraZve ComputaZon
My Rank
Friends Rank
Local Updates
Addressing Graph-‐Parallel ML
Data-Parallel Graph-Parallel
Cross ValidaZon
Feature ExtracZon
Map Reduce
CompuZng Sufficient StaZsZcs
Graphical Models Gibbs Sampling
Belief PropagaZon VariaZonal Opt.
Semi-‐Supervised Learning
Label PropagaZon CoEM
Data-‐Mining PageRank
Triangle CounZng
Collabora4ve Filtering
Tensor FactorizaZon
Map Reduce? Graph-‐Parallel AbstracZon
Carnegie Mellon University
Data Graph Data associated with verZces and edges
Vertex Data: • User profile text • Current interests esZmates
Edge Data: • Similarity weights
Graph: • Social Network
Carnegie Mellon University
How do we program graph computaZon?
“Think like a Vertex.” -‐Malewicz et al. [SIGMOD’10]
pagerank(i, scope){ // Get Neighborhood data (R[i], wij, R[j]) ßscope;
// Update the vertex data // Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); }
R[i]←α + (1−α) wji ×R[ j]j∈N [i]∑ ;
Update FuncZons User-‐defined program: applied to vertex transforms data in scope of vertex
Dynamic computa4on
Update funcZon applied (asynchronously) in parallel unZl convergence
Many schedulers available to prioriZze computaZon
The GraphLab Framework
Scheduler Consistency Model
Graph Based Data Representa4on
Update FuncZons User Computa4on
Bayesian Tensor FactorizaZon
Gibbs Sampling Dynamic Block Gibbs Sampling
Matrix FactorizaZon
Lasso
SVM
Belief PropagaZon PageRank
CoEM
K-‐Means
SVD
LDA
…Many others… Linear Solvers
Splash Sampler AlternaZng Least
Squares
Never Ending Learner Project (CoEM)
Hadoop 95 Cores 7.5 hrs
Distributed GraphLab
32 EC2 machines
80 secs
0.3% of Hadoop time
2 orders of mag faster è 2 orders of mag cheaper
Carnegie Mellon University
GraphLab 1 provided exciZng scaling performance
But…
Thus far…
We couldn’t scale up to Altavista Webgraph 2002 1.4B ver4ces, 6.7B edges
Carnegie Mellon University
Natural Graphs
[Image from WikiCommons]
Carnegie Mellon University
Problem: ExisZng distributed graph
computaZon systems perform poorly on Natural Graphs
Achilles Heel: Idealized Graph AssumpZon
Assumed… But, Natural Graphs…
Small degree è Easy to parZZon Many high degree verZces
(power-‐law degree distribuZon) è
Very hard to parZZon
Power-‐Law Degree DistribuZon
100 102 104 106 108100
102
104
106
108
1010
degree
count
High-‐Degree VerZces:
1% verZces adjacent to 50% of edges
Num
ber o
f VerZces
AltaVista WebGraph 1.4B VerZces, 6.6B Edges
Degree
High Degree VerZces are Common
Users
Movies
NeQlix
“Social” People Popular Movies
θ Z w Z w Z w Z w
θ Z w Z w Z w Z w
θ Z w Z w Z w Z w
θ Z w Z w Z w Z w
B α
Hyper Parameters
Docs
Words
LDA
Common Words
Obama
Power-‐Law Graphs are Difficult to Par44on
! Power-‐Law graphs do not have low-‐cost balanced cuts [Leskovec et al. 08, Lang 04]
! TradiZonal graph-‐parZZoning algorithms perform poorly on Power-‐Law Graphs. [Abou-‐Rjeili et al. 06]
33
CPU 1 CPU 2
Machine 1 Machine 2
! Split High-‐Degree verZces ! New Abstrac4on à Leads to this Split Vertex Strategy
Program For This
Run on This
GraphLab 2 Solu4on
GAS DecomposiZon Y
+ … + à
Y
Parallel “Sum”
Y
Gather (Reduce) Apply the accumulated value to center vertex
Apply Update adjacent edges
and verZces.
Scader
⌃
Accumulate informaZon about neighborhood
Y
+
Y Σ Y’ Y’
GraphChi: Going small with GraphLab
Solve huge problems on small or embedded
devices?
Key: Exploit non-‐volaZle memory (starZng with SSDs and HDs)
6. Before
8. A!er
7. A!er
GraphChi – disk-‐based GraphLab
Challenge: Random Accesses
Novel GraphChi solu4on: Parallel sliding windows method è minimizes number of random accesses
GraphChi – disk-‐based GraphLab
Novel Parallel Sliding Windows algorithm
! Fast! ! Solves tasks as large as current distributed systems
! Minimizes non-‐sequenZal disk accesses ! Efficient on both SSD and hard-‐drive
! Parallel, asynchronous execuZon
Sample Results Triangle Coun4ng Belief Propaga4on
0 100 200 300 400 500
Hadoop -‐ 1600 nodes [1]
GraphChi -‐ 1 Mac Mini
TwiYer graph (1.5B edges)
0 5 10 15 20 25 30
Hadoop -‐ 100 machines [2]
GraphChi -‐ 1 Mac Mini
Altavista Graph (6.7B edges)
minutes
[2] U. Kang, D. H. Chau, and C. Faloutsos. Inference of Beliefs on Billion-‐Scale Graphs. KDD-‐LDMTA’10, pages 1–7, June 2010.
[1] S. Suri and S. Vassilvitskii. CounZng triangles and the curse of the last reducer. WWW’ 2011
minutes
Triangle CounZng on Twider Graph 40M Users 1.2B Edges
Total: 34.8 Billion Triangles
Hadoop results from [Suri & Vassilvitskii WWW ‘11]
59 Minutes
64 Machines, 1024 Cores 1.5 Minutes
GraphLab2
GraphChi
Hadoop
1636 Machines 423 Minutes
59 Minutes, 1 Mac Mini!
Carnegie Mellon University
Efficient MulZcore CollaboraZve Filtering
LeBuSiShu team – 5th place in track1, ACM KDD CUP 2011
InsZtute of AutomaZon Chinese Academy of Sciences
Machine Learning Dept Carnegie Mellon University
ACM KDD CUP Workshop 2011
Yao Wu Qiang Yan Danny Bickson Yucheng Low Qing Yang
Neylix CollaboraZve Filtering
! AlternaZng Least Squares Matrix FactorizaZon
Model: 0.5 million nodes, 99 million edges
4 8 16 24 32 40 48 56 64101
102
103
104
#Nodes
Runtim
e(s) Hadoop MPI
GraphLab
Hadoop MPI
GraphLab
Intel Labs Report on GraphLab
Data source: Nezih Yigitbasi, Intel Labs
ACM KDD CUP 2012
GraphLab team @ WSDM 13
Future Plans
Learn: GraphLab Notebook
Prototype: pip install graphlab
è �local prototyping
ProducZon: Same code scales -‐ execute on EC2
cluster
Future Plans
GraphLab Internship Plan
GraphLab Conferences
2012 è 2013