GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized...

43
GraphX: Unifying Table and Graph Analytics Presented by Joseph Gonzalez Joint work with Reynold Xin, Daniel Crankshaw, Ankur Dave, Michael Franklin, and Ion Stoica IPDPS 2014 *These slides are best viewed in PowerPoint with anima

Transcript of GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized...

Page 1: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

GraphX: Unifying Table and Graph Analytics

Presented by Joseph Gonzalez

Joint work with Reynold Xin, Daniel Crankshaw, AnkurDave, Michael Franklin, and Ion Stoica

IPDPS 2014

*These slides are best viewed in PowerPoint with anima

Page 2: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Graphs are Central to Analytics

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 PageTitle PR

TextTableTitle Body

Topic Model(LDA) Word Topics

Word Topic

Editor GraphCommunityDetection

User Community

User Com.

Term-DocGraph

DiscussionTable

User Disc.

CommunityTopic

Topic Com.

Page 3: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Update ranks in parallel

Iterate until convergence

Rank of user i Weighted sum of

neighbors’ ranks

3

PageRank: Identifying Leaders

Page 4: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Low-Rank Matrix Factorization:

5

r13

r14

r24

r25

f(1)

f(2)

f(3)

f(4)

f(5)Use

r Fac

tors

(U) M

ovie Factors (M)

Use

rs

MoviesNetflix

Use

rs≈x

Movies

f(i)

f(j)

Iterate:

Recommending Products

Page 5: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

The Graph-Parallel Pattern

6

Model / Alg. State

Computation depends only on the neighbors

Page 6: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Many Graph-Parallel Algorithms• Collaborative Filtering

– Alternating Least Squares– Stochastic Gradient

Descent– Tensor Factorization

• Structured Prediction– Loopy Belief Propagation– Max-Product Linear

Programs– Gibbs Sampling

• Semi-supervised ML– Graph SSL

– CoEM• Community Detection

– Triangle-Counting– K-core Decomposition– K-Truss

• Graph Analytics– PageRank– Personalized PageRank– Shortest Path– Graph Coloring

• Classification– Neural Networks 7

MACHINE LEARNING

SOCIAL NETWORK ANALYSIS

GRAPH ALGORITHMS

Page 7: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Graph-Parallel Systems

8

oogle

Expose specialized APIs to simplify graph programming.

Page 8: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

“Think like a Vertex.”

- Pregel [SIGMOD’10]

9

Page 9: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

The Pregel (Push) AbstractionVertex-Programs interact by sending messages.

iiPregel_PageRank(i, messages) : // Receive all the messagestotal = 0foreach( msg in messages) :

total = total + msg

// Update the rank of this vertexR[i] = 0.15 + total

// Send new messages to neighborsforeach(j in out_neighbors[i]) :

Send  msg(R[i]) to vertex j

10Malewicz et al. [PODC’09, SIGMOD’10]

Page 10: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

The GraphLab (Pull) AbstractionVertex Programs directly access adjacent vertices and edgesGraphLab_PageRank(i) // Compute sum over neighborstotal = 0foreach( j in neighbors(i)): 

total = total + R[j] * wji

// Update the PageRankR[i] = 0.15 + total 

11

R[4] * w41

++

44 11

33 22

Data movement is managed by the system and not the user.

Page 11: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

BarrierIterative Bulk Synchronous

ExecutionCompute Communicate

Page 12: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Graph-Parallel Systems

13

oogle

Expose specialized APIs to simplify graph programming.

Exploit graph structure to achieve orders-of-magnitude performance gains

over more general data-parallel systems

Page 13: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

PageRank on the Live-Journal Graph

22

354

1340

0 200 400 600 800 1000 1200 1400 1600

GraphLab

Naïve Spark

Mahout/Hadoop

Runtime (in seconds, PageRank for 10 iterations)

Spark is 4x faster than HadoopGraphLab is 16x faster than Spark

Page 14: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Counted: 34.8 Billion Triangles

15

Triangle Counting on Twitter

64 Machines15 Seconds

1536 Machines423 Minutes

Hadoop[WWW’11]

S. Suri and S. Vassilvitskii, “Counting triangles and the curse of the last reducer,” WWW’11

1000 x Faster

40M Users, 1.4 Billion Links

Page 15: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Graph Analytics Pipeline

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 PageTitle PR

TextTableTitle Body

Topic Model(LDA) Word Topics

Word Topic

Editor GraphCommunityDetection

User Community

User Com.

Term-DocGraph

DiscussionTable

User Disc.

CommunityTopic

Topic Com.

Page 16: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Tables

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 PageTitle PR

TextTableTitle Body

Topic Model(LDA) Word Topics

Word Topic

Editor GraphCommunityDetection

User Community

User Com.

Term-DocGraph

DiscussionTable

User Disc.

CommunityTopic

Topic Com.

Page 17: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Graphs

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 PageTitle PR

TextTableTitle Body

Topic Model(LDA) Word Topics

Word Topic

Editor GraphCommunityDetection

User Community

User Com.

Term-DocGraph

DiscussionTable

User Disc.

CommunityTopic

Topic Com.

Page 18: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Separate Systems to Support Each View

Table View Graph View

Dependency Graph

Table

R lt

Result

RowRow

RowRow

RowRow

RowRow

Page 19: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Having separate systems for each view is

difficult to use and inefficient

20

Page 20: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Difficult to Program and Use

Users must Learn, Deploy, and Manage multiple systems

Leads to brittle and often complex interfaces

21

Page 21: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Inefficient

22

Extensive data movement and duplication across the network and file system

< / >< / >< / >XML

HDFSHDFS HDFSHDFS HDFSHDFS HDFSHDFS

Limited reuse internal data-structures across stages

Page 22: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

GraphX Solution: Tables and Graphs are

views of the same physical data

GraphX UnifiedRepresentation

Graph ViewTable View

Each view has its own operators that exploit the semantics of the view

to achieve efficient execution

Page 23: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Graphs Relational Algebra

1. Encode graphs as distributed tables

2. Express graph computation in relational algebra

3. Recast graph systems optimizations as:1. Distributed join optimization2. Incremental materialized maintenance

Integrate Graph and Table data

processing systems.

Achieve performance parity with

specialized systems.

Page 24: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Part. 2

Part. 1

Vertex Table

B C

AA DD

F E

AA DD

Distributed Graphs as Distributed Tables

D

Property Graph

B C

D

E

AA

F

Edge Table

A B

A C

C D

B C

A E

A F

E F

E D

B

C

D

E

A

F

RoutingTable

B

C

D

E

A

F

1

2

1 2

1 2

1

2

2D Vertex Cut Heuristic

Page 25: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Table OperatorsTable operators are inherited from Spark:

26

map

filter

groupBy

sort

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

groupByKey

cogroup

cross

zip

sample

take

first

partitionBy

mapWith

pipe

save

...

Page 26: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

class Graph [ V, E ] {def Graph(vertices: Table[ (Id, V) ],

edges: Table[ (Id, Id, E) ])// Table Views -----------------def vertices: Table[ (Id, V) ]def edges: Table[ (Id, Id, E) ]def triplets: Table [ ((Id, V), (Id, V), E) ]// Transformations ------------------------------def reverse: Graph[V, E]def subgraph(pV: (Id, V) => Boolean,

pE: Edge[V,E] => Boolean): Graph[V,E]def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T]// Joins ----------------------------------------def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ]def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]// Computation ----------------------------------def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],

reduceF: (T, T) => T): Graph[T, E]}

Graph Operators

27

Page 27: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Triplets Join Vertices and Edges

The triplets operator joins vertices and edges:

The mrTriplets operator sums adjacent triplets.SELECT t.dstId, reduce( map(t) ) AS sum FROM triplets AS t GROUPBY t.dstId

TripletsVertices

B

A

C

D

Edges

A BA CB CC D

A BA

B A CB CC D

SELECT s.Id, d.Id, s.P, e.P, d.PFROM edges AS eJOIN vertices AS s, vertices AS dON e.srcId = s.Id AND e.dstId = d.Id

Page 28: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

F

E

Example: Oldest Follower

D

B

A

CCalculate the number of older followers for each user?val olderFollowerAge = graph.mrTriplets(

e => // Mapif(e.src.age < e.dst.age) {(e.srcId, 1)

else { Empty },(a,b) => a + b // Reduce

).vertices

23 42

30

19 75

1629

Page 29: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

We express enhanced Pregel and GraphLab

abstractions using the GraphX operatorsin less than 50 lines of code!

30

Page 30: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Enhanced Pregel in GraphX

31Malewicz et al. [PODC’09, SIGMOD’10]

pregelPR(i, messageList ):

// Receive all the messagestotal = 0foreach( msg in messageList) :total = total + msg

// Update the rank of this vertexR[i] = 0.15 + total

// Send new messages to neighborsforeach(j in out_neighbors[i]) :Send msg(R[i]/E[i,j]) to vertex

Require Message CombinersmessageSum

messageSum

Remove Message Computation

from theVertex Program

sendMsg(ij, R[i], R[j], E[i,j]):// Compute single messagereturn msg(R[i]/E[i,j])

combineMsg(a, b):// Compute sum of two messagesreturn a + b

Page 31: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

PageRank in GraphX

32

// Load and initialize the graph

val graph = GraphBuilder.text(“hdfs://web.txt”)

val prGraph = graph.joinVertices(graph.outDegrees)

// Implement and Run PageRank

val pageRank =

prGraph.pregel(initialMessage = 0.0, iter = 10)(

(oldV, msgSum) => 0.15 + 0.85 * msgSum,

triplet => triplet.src.pr / triplet.src.deg,

(msgA, msgB) => msgA + msgB)

Page 32: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Join EliminationIdentify and bypass joins for unused triplet fields

33

02000400060008000

100001200014000

0 5 10 15 20

Com

mun

icat

ion

(MB

)

Iteration

PageRank on Twitter Three Way Join

Join Elimination

Factor of 2 reduction in communication

sendMsg(ij, R[i], R[j], E[i,j]):// Compute single messagereturn msg(R[i]/E[i,j])

Page 33: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

We express the Pregel and GraphLab likeabstractions using the GraphX operators

in less than 50 lines of code!

34

By composing these operators we canconstruct entire graph-analytics pipelines.

Page 34: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Example Analytics Pipeline// Load raw data tables

val verts = sc.textFile(“hdfs://users.txt”).map(parserV)

val edges = sc.textFile(“hdfs://follow.txt”).map(parserE)

// Build the graph from tables and restrict to recent links

val graph = new Graph(verts, edges)

val recent = graph.subgraph(edge => edge.date > LAST_MONTH)

// Run PageRank Algorithm

val pr = graph.PageRank(tol = 1.0e-5)

// Extract and print the top 25 users

val topUsers = verts.join(pr).top(25).collect

topUsers.foreach(u => println(u.name + ‘\t’ + u.pr))

Page 35: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

The GraphX Stack(Lines of Code)

GraphX (3575)

Spark

Pregel (28) + GraphLab (50)

PageRank (5)

Connected Comp. (10)

Shortest Path (10)

ALS(40) LDA

(120)

K-core(51)

Triangle

Count(45)

SVD(40)

Page 36: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Performance Comparisons

22

68207

3541340

0 200 400 600 800 1000 1200 1400 1600

GraphLabGraphXGiraph

Naïve SparkMahout/Hadoop

Runtime (in seconds, PageRank for 10 iterations)

GraphX is roughly 3x slower than GraphLab

Live-Journal: 69 Million Edges

Page 37: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

GraphX scales to larger graphs

203

451

749

0 200 400 600 800

GraphLab

GraphX

Giraph

Runtime (in seconds, PageRank for 10 iterations)

GraphX is roughly 2x slower than GraphLab»Scala + Java overhead: Lambdas, GC time, …»No shared memory parallelism: 2x increase in comm.

Twitter Graph: 1.5 Billion Edges

Page 38: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

PageRank is just one stage….

What about a pipeline?

Page 39: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

HDFSHDFSHDFSHDFS

ComputeComputeSpark PreprocessSpark Preprocess Spark Post.Spark Post.

A Small Pipeline in GraphX

Timed end-to-end GraphX is faster than GraphLab

Raw Wikipedia

< / >< / >< / >XML

Hyperlinks PageRank Top 20 Pages

342

1492

0 200 400 600 800 1000 1200 1400 1600

GraphLab + SparkGraphX

Giraph + SparkSpark

Total Runtime (in Seconds)

605

375

Page 40: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

StatusPart of Apache Spark

In production at several large technology companies

Page 41: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

GraphX: Unified Analytics

Enabling users to easily and efficientlyexpress the entire graph analytics

pipeline

New APIBlurs the distinction between Tables and

Graphs

New SystemCombines Data-Parallel Graph-Parallel Systems

Page 42: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

A Case for Algebra in Graphs

A standard algebra is essential for graph systems:

• e.g.: SQL proliferation of relational system

By embedding graphs in relational algebra:

• Integration with tables and preprocessing

• Leverage advances in relational systems

• Graph opt. recast to relational systems

Page 43: GraphX: Unifying Table and Graph AnalyticsDistributed join optimization 2. Incremental materialized maintenance Integrate Graph and Table data processing systems. Achieve performance

Thanks!

[email protected]@eecs.berkeley.edu

[email protected]@eecs.berkeley.edu

http://amplab.cs.berkeley.edu/projects/graphx/