Machine Learning in the Cloud with GraphLab

Danny Bickson

Machine Learning in the Cloud with GraphLab

Applied machine learning day, January 20, 2014 MS

Needless to Say, We Need Machine Learning for Big Data

72 Hours a Minute YouTube 28 Million

Wikipedia Pages

1 Billion Facebook Users

6 Billion Flickr Photos

“… data a new class of economic asset, like currency or gold.”

How will we design and implement

parallel learning systems?

Big Learning

A ShiU Towards Parallelism

GPUs Multicore Clusters Clouds Supercomputers

! ML experts repeatedly solve the same parallel design challenges: ! Race condiZons, distributed state, communicaZon…

! The resulZng code is: ! difficult to maintain, extend, debug…

Graduate students

Avoid these problems by using high-‐level abstrac4ons

MapReduce for Data-‐Parallel ML

! Excellent for large data-‐parallel tasks!

Data-Parallel Graph-Parallel

Cross ValidaZon

Feature ExtracZon

MapReduce

CompuZng Sufficient StaZsZcs

Graphical Models Gibbs Sampling

Belief PropagaZon VariaZonal Opt.

Semi-‐Supervised Learning

Label PropagaZon CoEM

Graph Analysis PageRank

Triangle CounZng

Collabora4ve Filtering

Tensor FactorizaZon

Is there more to Machine Learning

?

Carnegie Mellon University

The Power of Dependencies

where the value is!

Label a Face and Propagate

Pairwise similarity not enough…

Not similar enough to be sure

Propagate SimilariZes & Co-‐occurrences for Accurate PredicZons

similarity edges

co-‐occurring faces

further evidence

CollaboraZve Filtering: Independent Case

Lord of the Rings

Star Wars IV

Star Wars I

Harry Poder

Pirates of the Caribbean

CollaboraZve Filtering: ExploiZng Dependencies

City of God

Wild Strawberries

The CelebraZon

La Dolce Vita

Women on the Verge of a Nervous Breakdown

What do I recommend???

Data

Machine Learning Pipeline

images

docs

movie raZngs

Extract Features

faces

important words

side info

Graph Formation

similar faces

shared words

rated movies

Structured Machine Learning Algorithm

belief propagaZon

LDA

collaboraZve filtering

Value from Data

face labels

doc topics

movie recommend.

Data

Parallelizing Machine Learning

Extract Features

Graph Formation Structured

Machine Learning Algorithm

Value from Data

Graph Ingress mostly data-‐parallel

Graph-‐Structured Computa4on graph-‐parallel

ML Tasks Beyond Data-‐Parallelism


Cross ValidaZon

Feature ExtracZon

Map Reduce






Graph Analysis PageRank

Triangle CounZng


Tensor FactorizaZon


Example of a Graph-‐Parallel Algorithm

PageRank

What’s the rank of this user?

Rank?

Depends on rank of who follows her

Depends on rank of who follows them…

Loops in graph è Must iterate!

PageRank IteraZon

! α is the random reset probability ! wji is the prob. transiZoning (similarity) from j to i

R[i] = ↵+ (1� ↵)X

(j,i)2E

wjiR[j]R[i]

R[j] wji Iterate unZl convergence:

“My rank is weighted average of my friends’ ranks”

ProperZes of Graph Parallel Algorithms

Dependency Graph

IteraZve ComputaZon

My Rank

Friends Rank

Local Updates

Addressing Graph-‐Parallel ML


Cross ValidaZon

Feature ExtracZon

Map Reduce






Data-‐Mining PageRank

Triangle CounZng


Tensor FactorizaZon

Map Reduce? Graph-‐Parallel AbstracZon

Data Graph Data associated with verZces and edges

Vertex Data: •  User profile text •  Current interests esZmates

Edge Data: •  Similarity weights

Graph: •  Social Network


How do we program graph computaZon?

“Think like a Vertex.” -‐Malewicz et al. [SIGMOD’10]

pagerank(i, scope){ // Get Neighborhood data (R[i], wij, R[j]) ßscope;

// Update the vertex data // Reschedule Neighbors if needed if R[i] changes then reschedule_neighbors_of(i); }

R[i]←α + (1−α) wji ×R[ j]j∈N [i]∑ ;

Update FuncZons User-‐defined program: applied to vertex transforms data in scope of vertex

Dynamic computa4on

Update funcZon applied (asynchronously) in parallel unZl convergence

Many schedulers available to prioriZze computaZon

The GraphLab Framework

Scheduler Consistency Model

Graph Based Data Representa4on

Update FuncZons User Computa4on

Bayesian Tensor FactorizaZon

Gibbs Sampling Dynamic Block Gibbs Sampling

Matrix FactorizaZon

Lasso

SVM

Belief PropagaZon PageRank

CoEM

K-‐Means

SVD

LDA

…Many others… Linear Solvers

Splash Sampler AlternaZng Least

Squares

Never Ending Learner Project (CoEM)

Hadoop 95 Cores 7.5 hrs

Distributed GraphLab

32 EC2 machines

80 secs

0.3% of Hadoop time

2 orders of mag faster è 2 orders of mag cheaper


GraphLab 1 provided exciZng scaling performance

But…

Thus far…

We couldn’t scale up to Altavista Webgraph 2002 1.4B ver4ces, 6.7B edges


Natural Graphs

[Image from WikiCommons]


Problem: ExisZng distributed graph

computaZon systems perform poorly on Natural Graphs

Achilles Heel: Idealized Graph AssumpZon

Assumed… But, Natural Graphs…

Small degree è Easy to parZZon Many high degree verZces

(power-‐law degree distribuZon) è

Very hard to parZZon

Power-‐Law Degree DistribuZon

100 102 104 106 108100

102

104

106

108

1010

degree

count

High-‐Degree VerZces:

1% verZces adjacent to 50% of edges

Num

ber o

f VerZces

AltaVista WebGraph 1.4B VerZces, 6.6B Edges

Degree

High Degree VerZces are Common

Users

Movies

NeQlix

“Social” People Popular Movies

θ Z w Z w Z w Z w

θ Z w Z w Z w Z w

θ Z w Z w Z w Z w

θ Z w Z w Z w Z w

B α

Hyper Parameters

Docs

Words

LDA

Common Words

Obama

Power-‐Law Graphs are Difficult to Par44on

! Power-‐Law graphs do not have low-‐cost balanced cuts [Leskovec et al. 08, Lang 04]

! TradiZonal graph-‐parZZoning algorithms perform poorly on Power-‐Law Graphs. [Abou-‐Rjeili et al. 06]

33

CPU 1 CPU 2

Machine 1 Machine 2

! Split High-‐Degree verZces ! New Abstrac4on à Leads to this Split Vertex Strategy

Program For This

Run on This

GraphLab 2 Solu4on

GAS DecomposiZon Y

+ … + à

Y

Parallel “Sum”

Y

Gather (Reduce) Apply the accumulated value to center vertex

Apply Update adjacent edges

and verZces.

Scader

⌃

Accumulate informaZon about neighborhood

Y

+

Y Σ Y’ Y’

GraphChi: Going small with GraphLab

Solve huge problems on small or embedded

devices?

Key: Exploit non-‐volaZle memory (starZng with SSDs and HDs)

6. Before

8. A!er

7. A!er

GraphChi – disk-‐based GraphLab

Challenge: Random Accesses

Novel GraphChi solu4on: Parallel sliding windows method è minimizes number of random accesses

GraphChi – disk-‐based GraphLab

Novel Parallel Sliding Windows algorithm

! Fast! ! Solves tasks as large as current distributed systems

! Minimizes non-‐sequenZal disk accesses ! Efficient on both SSD and hard-‐drive

! Parallel, asynchronous execuZon

Sample Results Triangle Coun4ng Belief Propaga4on

0 100 200 300 400 500

Hadoop -‐ 1600 nodes [1]

GraphChi -‐ 1 Mac Mini

TwiYer graph (1.5B edges)

0 5 10 15 20 25 30

Hadoop -‐ 100 machines [2]

GraphChi -‐ 1 Mac Mini

Altavista Graph (6.7B edges)

minutes

[2] U. Kang, D. H. Chau, and C. Faloutsos. Inference of Beliefs on Billion-‐Scale Graphs. KDD-‐LDMTA’10, pages 1–7, June 2010.

[1] S. Suri and S. Vassilvitskii. CounZng triangles and the curse of the last reducer. WWW’ 2011

minutes

Triangle CounZng on Twider Graph 40M Users 1.2B Edges

Total: 34.8 Billion Triangles

Hadoop results from [Suri & Vassilvitskii WWW ‘11]

59 Minutes

64 Machines, 1024 Cores 1.5 Minutes

GraphLab2

GraphChi

Hadoop

1636 Machines 423 Minutes

59 Minutes, 1 Mac Mini!


Efficient MulZcore CollaboraZve Filtering

LeBuSiShu team – 5th place in track1, ACM KDD CUP 2011

InsZtute of AutomaZon Chinese Academy of Sciences

Machine Learning Dept Carnegie Mellon University

ACM KDD CUP Workshop 2011

Yao Wu Qiang Yan Danny Bickson Yucheng Low Qing Yang

Neylix CollaboraZve Filtering

! AlternaZng Least Squares Matrix FactorizaZon

Model: 0.5 million nodes, 99 million edges

4 8 16 24 32 40 48 56 64101

102

103

104

#Nodes

Runtim

e(s) Hadoop MPI

GraphLab

Hadoop MPI

GraphLab

Intel Labs Report on GraphLab

Data source: Nezih Yigitbasi, Intel Labs

ACM KDD CUP 2012

GraphLab team @ WSDM 13

Future Plans

Learn: GraphLab Notebook

Prototype: pip install graphlab

è �local prototyping

ProducZon: Same code scales -‐ execute on EC2

cluster

Future Plans

GraphLab Internship Plan

GraphLab Conferences

2012 è 2013

Machine Learning in the Cloud with GraphLab

Technology

Transcript of Machine Learning in the Cloud with GraphLab