Graph Analytics in Spark · Graph Analytics in Spark Ankur Dave! ... Many Graph-Parallel...
Transcript of Graph Analytics in Spark · Graph Analytics in Spark Ankur Dave! ... Many Graph-Parallel...
UC BERKELEY
GraphXGraph Analytics in Spark
Ankur Dave���Graduate Student, UC Berkeley AMPLab
Joint work with Joseph Gonzalez, Reynold Xin, Daniel Crankshaw, Michael Franklin, and Ion Stoica
Collaborative Filtering
r13
r14
r24
r25
f(1)
f(2)
f(3)
f(4)
f(5)U
ser F
acto
rsProduct Factors
f [i] = arg minw2Rd
X
j2Nbrs(i)
�rij � wT f [j]
�2+ �||w||22
Collaborative Filtering» Alternating Least Squares» Stochastic Gradient Descent» Tensor Factorization
Structured Prediction» Loopy Belief Propagation» Max-Product Linear
Programs» Gibbs Sampling
Semi-supervised ML» Graph SSL » CoEM
Community Detection» Triangle-Counting» K-core Decomposition» K-Truss
Graph Analytics» PageRank» Personalized PageRank» Shortest Path» Graph Coloring
Classification» Neural Networks
Many Graph-Parallel Algorithms
Challenges:
1. Storage: How to store a graph where one vertex’s edges don’t fit on a machine?
2. API: How to expose parallelism within vertex neighborhoods?
High-Degree Vertices
Complex Pipelines
Raw Wikipedia
< / >< / >< / >XML
Hyperlinks PageRank Top 20 PagesTitle PR
Link TableTitle Link
Editor GraphCommunityDetection
User Community
User Com.
EditorTable
Editor Title
Top CommunitiesCom. PR..
Solution: Embed graph processing within a table-oriented system (Spark)
Challenges:
1. Storage: How to store graphs as tables?
2. Computation: How to express graph ops as table ops (map, reduce, join, etc.)?
3. API: How to present the two views to the user?
Complex Pipelines
Property Graphs
Vertex Property:• User Profile• Current PageRank Value
Edge Property:• Weights• Relationships• Timestamps
Graphtype VertexId = Long val vertices: RDD[(VertexId, String)] = sc.parallelize(List( (1L, “Alice”), (2L, “Bob”), (3L, “Charlie”))) class Edge[ED]( val srcId: VertexId, val dstId: VertexId, val attr: ED) val edges: RDD[Edge[String]] = sc.parallelize(List( Edge(1L, 2L, “coworker”), Edge(2L, 3L, “friend”))) val graph = Graph(vertices, edges)
Creating a Graph (Scala)
1
3
2
Alice
Bob
Charlie
coworker
friend
class Graph[VD, ED] { // Table Views -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ def vertices: RDD[(VertexId, VD)] def edges: RDD[Edge[ED]] def triplets: RDD[EdgeTriplet[VD, ED]] // Transformations -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ def mapVertices[VD2](f: (VertexId, VD) => VD2): Graph[VD2, ED] def mapEdges[ED2](f: Edge[ED] => ED2): Graph[VD2, ED] def reverse: Graph[VD, ED] def subgraph(epred: EdgeTriplet[VD, ED] => Boolean, vpred: (VertexId, VD) => Boolean): Graph[VD, ED] // Joins -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ def outerJoinVertices[U, VD2]
(tbl: RDD[(VertexId, U)]) (f: (VertexId, VD, Option[U]) => VD2): Graph[VD2, ED]
// Computation -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ def aggregateMessages[A](
sendMsg: EdgeContext[VD, ED, A] => Unit, mergeMsg: (A, A) => A): RDD[(VertexId, A)]
Graph Operations (Scala)
Built-in Algorithms (Scala) // Continued from previous slide def pageRank(tol: Double): Graph[Double, Double] def triangleCount(): Graph[Int, ED] def connectedComponents(): Graph[VertexId, ED] // ...and more: org.apache.spark.graphx.lib
}
PageRank Triangle Count Connected Components
RDD
The triplets view
Graph
1
3
2
Alice
Bob
Charlie
coworker
friend
class Graph[VD, ED] { def triplets: RDD[EdgeTriplet[VD, ED]] } class EdgeTriplet[VD, ED]( val srcId: VertexId, val dstId: VertexId, val attr: ED, val srcAttr: VD, val dstAttr: VD)
srcAttr dstAttr attrAlice coworker BobBob friend Charlie
triplets
The subgraph transformationclass Graph[VD, ED] { def subgraph(epred: EdgeTriplet[VD, ED] => Boolean, vpred: (VertexId, VD) => Boolean): Graph[VD, ED] } graph.subgraph(epred = (edge) => edge.attr != “relative”)
subgraph
Graph
Alice Bob
Charlie
relativefriend
coworker
Davidrelative
Graph
Alice Bob
Charlie
friend
coworker
David
The subgraph transformationclass Graph[VD, ED] { def subgraph(epred: EdgeTriplet[VD, ED] => Boolean, vpred: (VertexId, VD) => Boolean): Graph[VD, ED] } graph.subgraph(vpred = (id, name) => name != “Bob”)
subgraph
Graph
Alice Bob
Charlie
relativefriend
coworker
Davidrelative
Graph
Alice
Charlie
relative
Davidrelative
Computation with aggregateMessagesclass Graph[VD, ED] { def aggregateMessages[A]( sendMsg: EdgeContext[VD, ED, A] => Unit, mergeMsg: (A, A) => A): RDD[(VertexId, A)] } class EdgeContext[VD, ED, A]( val srcId: VertexId, val dstId: VertexId, val attr: ED, val srcAttr: VD, val dstAttr: VD) { def sendToSrc(msg: A) def sendToDst(msg: A) } graph.aggregateMessages( ctx => { ctx.sendToSrc(1) ctx.sendToDst(1) }, _ + _)
Graph
Alice Bob
Charlie
relativefriend
coworker
Davidrelative
aggregateMessages
RDD
vertex id degreeAlice 2Bob 2Charlie 3David 1
Computation with aggregateMessages
Example: Graph CoarseningWeb Pages
Intra-Domain Linkssubgraph
Pages by DomainConnected
Components
Domains
Domain Graph
Graph constructor
Storing Graphs as Tables
B C
A D
F E
A DD
Property Graph
B C
D
E
AA
F
Vertex Table
(RDD)
Machine 1
Machine 2
B
C
D
E
A
F
Edge Table(RDD) A B
A C
C D
B C
A E
A F
E F
E D
Reuse vertices or edges across multiple graphs
Simple Operations
Input Graph
Transform Vertex Properties
Transformed GraphVertexTable
EdgeTable
VertexTable
EdgeTable
Implementing triplets Vertex Table
(RDD)
Machine 1
Machine 2
B
C
D
E
A
F
Edge Table(RDD) A B
A C
C D
B C
A E
A F
E F
E D
RoutingTable
(RDD)
B
C
D
E
A
F
1
2
1 2
1 2
1
2
Implementing tripletsEdge Table
(RDD)A B
A C
C D
B C
A E
A F
E F
E D
MirrorCache
BCD
A
MirrorCache
DEF
A
Vertex Table
(RDD)
B
C
D
E
A
F
B
C
D
E
A
F
A
D
RoutingTable
(RDD)
B
C
D
E
A
F
1
2
1 2
1 2
1
2
0.1
1
10
100
1000
10000
0 2 4 6 8 10 12 14 16
Net
wor
k Co
mm
. (M
B)
Iteration
Connected Components on Twitter Graph (1.4B edges)
Reduction in Communication Due to Cached Updates
Most vertices are within 8 hops���of all vertices in their component
Implementing aggregateMessagesEdge Table
(RDD)A B
A C
C D
B C
A E
A F
E F
E D
MirrorCache
BCD
A
MirrorCache
DEF
A
Scan
Result Vertex Table
B
C
D
E
A
F
B
C
D
E
A
F
A
D
0
5
10
15
20
25
30
0 2 4 6 8 10 12 14 16
Runt
ime
(Sec
onds
)
Iteration
Connected Components on Twitter (1.4B edges)
Scan Indexed
Speedup Due To Access Method Selection
Scan All Edges
Index of “Active” Edges
PageRank Benchmark
0500
100015002000250030003500
GraphX GraphLab Giraph Naïve Spark
0100020003000400050006000700080009000
GraphX GraphLab Giraph Naïve Spark
Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges)
GraphX performs comparably to ���state-of-the-art graph processing systems.
Runt
ime
(Sec
onds
)
EC2 Cluster of 16 x m2.4xLarge (8 cores) + 1GigE
7x 18x
• Alpha release with Spark 0.9.0 in Feb 2014
• Stable release with Spark 1.2.0 in Dec 2014
Open Source
1. Language supporta) Java API: PR #3234b) Python API: collaborating with Intel, SPARK-3789
2. More algorithmsa) LDA (topic modeling): PR #2388b) Correlation clustering
3. Researcha) Local graphsb) Streaming/time-varying graphsc) Graph database–like queries
Future of GraphX
IndexedRDDSupport efficient updates to immutable RDDs using purely functional data structures
https://github.com/amplab/spark-indexedrdd