Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8:...
Transcript of Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8:...
![Page 1: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/1.jpg)
Data-Intensive Distributed Computing
Part 8: Analyzing Graphs, Redux (1/2)
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
CS 431/631 451/651 (Winter 2019)
Adam RoegiestKira Systems
March 21, 2019
These slides are available at http://roegiest.com/bigdata-2019w/
![Page 2: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/2.jpg)
Graph Algorithms, again?(srsly?)
![Page 3: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/3.jpg)
What makes graphs hard?
Irregular structureFun with data structures!
Irregular data access patternsFun with architectures!
IterationsFun with optimizations!
✗
![Page 4: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/4.jpg)
Characteristics of Graph Algorithms
Parallel graph traversalsLocal computations
Message passing along graph edges
Iterations
![Page 5: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/5.jpg)
n0
n3n2
n1
n7
n6
n5
n4
n9
n8
Visualizing Parallel BFS
![Page 6: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/6.jpg)
Given page x with inlinks t1…tn, where
C(t) is the out-degree of t
is probability of random jump
N is the total number of nodes in the graph
X
t1
t2
tn
…
PageRank: Defined
![Page 7: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/7.jpg)
n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]
n2 n4 n3 n5 n1 n2 n3n4 n5
n2 n4n3 n5n1 n2 n3 n4 n5
n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]
Map
Reduce
PageRank in MapReduce
![Page 8: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/8.jpg)
Map
Reduce
PageRank BFS
PR/N d+1
sum min
PageRank vs. BFS
![Page 9: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/9.jpg)
Characteristics of Graph Algorithms
Parallel graph traversalsLocal computations
Message passing along graph edges
Iterations
![Page 10: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/10.jpg)
reduce
map
HDFS
HDFS
Convergence?
BFS
![Page 11: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/11.jpg)
Convergence?reduce
map
HDFS
HDFS
map
HDFS
PageRank
![Page 12: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/12.jpg)
MapReduce Sucks
Hadoop task startup time
Stragglers
Needless graph shuffling
Checkpointing at each iteration
![Page 13: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/13.jpg)
reduce
HDFS
…
map
HDFS
reduce
map
HDFS
reduce
map
HDFS
Let’s Spark!
![Page 14: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/14.jpg)
reduce
HDFS
…
map
reduce
map
reduce
map
![Page 15: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/15.jpg)
reduce
HDFS
map
reduce
map
reduce
map
Adjacency Lists PageRank Mass
Adjacency Lists PageRank Mass
Adjacency Lists PageRank Mass
…
![Page 16: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/16.jpg)
join
HDFS
map
join
map
join
map
Adjacency Lists PageRank Mass
Adjacency Lists PageRank Mass
Adjacency Lists PageRank Mass
…
![Page 17: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/17.jpg)
join
join
join
…
HDFS HDFS
Adjacency Lists PageRank vector
PageRank vector
flatMap
reduceByKey
PageRank vector
flatMap
reduceByKey
![Page 18: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/18.jpg)
join
join
join
…
HDFS HDFS
Adjacency Lists PageRank vector
PageRank vector
flatMap
reduceByKey
PageRank vector
flatMap
reduceByKey
Cache!
![Page 19: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/19.jpg)
PageRankPerformance
171
80
72
28
020406080100120140160180
30 60
Tim
eperIteration(s)
Numberofmachines
Hadoop
Spark
Source: http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-part-2-amp-camp-2012-standalone-programs.pdf
MapReduce vs. Spark
![Page 20: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/20.jpg)
Characteristics of Graph Algorithms
Parallel graph traversalsLocal computations
Message passing along graph edges
Iterations
Even faster?
![Page 21: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/21.jpg)
Big Data Processing in a Nutshell
Partition
Replicate
Reduce cross-partition communication
![Page 22: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/22.jpg)
Simple Partitioning Techniques
Hash partitioning
Range partitioning on some underlying linearizationWeb pages: lexicographic sort of domain-reversed URLs
![Page 23: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/23.jpg)
“Best Practices”
Lin and Schatz. (2010) Design Patterns for Efficient Graph Algorithms in MapReduce.
PageRank over webgraph(40m vertices, 1.4b edges)
How much difference does it make?
![Page 24: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/24.jpg)
+18%1.4b
674m
Lin and Schatz. (2010) Design Patterns for Efficient Graph Algorithms in MapReduce.
PageRank over webgraph(40m vertices, 1.4b edges)
How much difference does it make?
![Page 25: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/25.jpg)
+18%
-15%
1.4b
674m
Lin and Schatz. (2010) Design Patterns for Efficient Graph Algorithms in MapReduce.
PageRank over webgraph(40m vertices, 1.4b edges)
How much difference does it make?
![Page 26: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/26.jpg)
+18%
-15%
-60%
1.4b
674m
86m
Lin and Schatz. (2010) Design Patterns for Efficient Graph Algorithms in MapReduce.
PageRank over webgraph(40m vertices, 1.4b edges)
How much difference does it make?
![Page 27: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/27.jpg)
Schimmy Design Pattern
Basic implementation contains two dataflows:Messages (actual computations)Graph structure (“bookkeeping”)
Schimmy: separate the two dataflows, shuffle only the messagesBasic idea: merge join between graph structure and messages
Lin and Schatz. (2010) Design Patterns for Efficient Graph Algorithms in MapReduce.
S T
both relations sorted by join key
S1 T1 S2 T2 S3 T3
both relations consistently partitioned and sorted by join key
![Page 28: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/28.jpg)
join
join
join
…
HDFS HDFS
Adjacency Lists PageRank vector
PageRank vector
flatMap
reduceByKey
PageRank vector
flatMap
reduceByKey
![Page 29: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/29.jpg)
+18%
-15%
-60%
1.4b
674m
86m
Lin and Schatz. (2010) Design Patterns for Efficient Graph Algorithms in MapReduce.
PageRank over webgraph(40m vertices, 1.4b edges)
How much difference does it make?
![Page 30: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/30.jpg)
+18%
-15%
-60%-69%
1.4b
674m
86m
Lin and Schatz. (2010) Design Patterns for Efficient Graph Algorithms in MapReduce.
PageRank over webgraph(40m vertices, 1.4b edges)
How much difference does it make?
![Page 31: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/31.jpg)
Simple Partitioning Techniques
Hash partitioning
Range partitioning on some underlying linearizationWeb pages: lexicographic sort of domain-reversed URLsWeb pages: lexicographic sort of domain-reversed URLs
Social networks: sort by demographic characteristics
![Page 32: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/32.jpg)
Ugander et al. (2011) The Anatomy of the Facebook Social Graph.
Analysis of 721 million active users (May 2011)
54 countries w/ >1m active users, >50% penetration
Country Structure in Facebook
![Page 33: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/33.jpg)
Simple Partitioning Techniques
Hash partitioning
Range partitioning on some underlying linearizationWeb pages: lexicographic sort of domain-reversed URLs
Social networks: sort by demographic characteristicsWeb pages: lexicographic sort of domain-reversed URLs
Social networks: sort by demographic characteristicsGeo data: space-filling curves
![Page 34: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/34.jpg)
Aside: Partitioning Geo-data
![Page 35: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/35.jpg)
Geo-data = regular graph
![Page 36: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/36.jpg)
Space-filling curves: Z-Order Curves
![Page 37: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/37.jpg)
Space-filling curves: Hilbert Curves
![Page 38: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/38.jpg)
Simple Partitioning Techniques
Hash partitioning
Range partitioning on some underlying linearizationWeb pages: lexicographic sort of domain-reversed URLs
Social networks: sort by demographic characteristicsGeo data: space-filling curves
But what about graphs in general?
![Page 39: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/39.jpg)
Source: http://www.flickr.com/photos/fusedforces/4324320625/
![Page 40: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/40.jpg)
General-Purpose Graph Partitioning
Graph coarsening
Recursive bisection
![Page 41: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/41.jpg)
Karypis and Kumar. (1998) A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs.
General-Purpose Graph Partitioning
![Page 42: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/42.jpg)
Karypis and Kumar. (1998) A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs.
Graph Coarsening
![Page 43: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/43.jpg)
Chicken-and-Egg
To coarsen the graph you need to identify dense local regions
To identify dense local regions quickly you to need traverse local edgesBut to traverse local edges efficiently you need the local structure!
To efficiently partition the graph, you need to already know what the partitions are!
Industry solution?
![Page 44: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/44.jpg)
Big Data Processing in a Nutshell
Partition
Replicate
Reduce cross-partition communication
![Page 45: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/45.jpg)
Partition
![Page 46: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/46.jpg)
Partition
What’s the fundamental issue?
![Page 47: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/47.jpg)
Characteristics of Graph Algorithms
Parallel graph traversalsLocal computations
Message passing along graph edges
Iterations
![Page 48: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/48.jpg)
Partition
FastFast
Slow
![Page 49: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/49.jpg)
State-of-the-Art Distributed Graph Algorithms
Fast asynchronous iterations
Fast asynchronous iterations
Periodic synchronization
![Page 50: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/50.jpg)
Source: Wikipedia (Waste container)
Graph Processing Frameworks
![Page 51: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/51.jpg)
join
join
join
…
HDFS HDFS
Adjacency Lists PageRank vector
PageRank vector
flatMap
reduceByKey
PageRank vector
flatMap
reduceByKey
Cache!
![Page 52: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/52.jpg)
Pregel: Computational Model
Based on Bulk Synchronous Parallel (BSP)Computational units encoded in a directed graphComputation proceeds in a series of supersteps
Message passing architecture
Each vertex, at each superstep:Receives messages directed at it from previous superstep
Executes a user-defined function (modifying state)Emits messages to other vertices (for the next superstep)
Termination:A vertex can choose to deactivate itselfIs “woken up” if new messages received
Computation halts when all vertices are inactive
![Page 53: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/53.jpg)
superstep t
superstep t+1
superstep t+2
Source: Malewicz et al. (2010) Pregel: A System for Large-Scale Graph Processing. SIGMOD.
![Page 54: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/54.jpg)
Pregel: Implementation
Master-Worker architectureVertices are hash partitioned (by default) and assigned to workers
Everything happens in memory
Processing cycle:Master tells all workers to advance a single superstep
Worker delivers messages from previous superstep, executing vertex computationMessages sent asynchronously (in batches)
Worker notifies master of number of active vertices
Fault toleranceCheckpointing
Heartbeat/revert
![Page 55: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/55.jpg)
class ShortestPathVertex : public Vertex<int, int, int> {void Compute(MessageIterator* msgs) {int mindist = IsSource(vertex_id()) ? 0 : INF;for (; !msgs->Done(); msgs->Next())
mindist = min(mindist, msgs->Value());if (mindist < GetValue()) {
*MutableValue() = mindist;OutEdgeIterator iter = GetOutEdgeIterator();for (; !iter.Done(); iter.Next())SendMessageTo(iter.Target(),
mindist + iter.GetValue());}VoteToHalt();
}};
Source: Malewicz et al. (2010) Pregel: A System for Large-Scale Graph Processing. SIGMOD.
Pregel: SSSP
![Page 56: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/56.jpg)
class PageRankVertex : public Vertex<double, void, double> {public:virtual void Compute(MessageIterator* msgs) {if (superstep() >= 1) {
double sum = 0;for (; !msgs->Done(); msgs->Next())sum += msgs->Value();
*MutableValue() = 0.15 / NumVertices() + 0.85 * sum;}
if (superstep() < 30) {const int64 n = GetOutEdgeIterator().size();SendMessageToAllNeighbors(GetValue() / n);
} else {VoteToHalt();
}}
};
Source: Malewicz et al. (2010) Pregel: A System for Large-Scale Graph Processing. SIGMOD.
Pregel: PageRank
![Page 57: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/57.jpg)
class MinIntCombiner : public Combiner<int> {virtual void Combine(MessageIterator* msgs) {
int mindist = INF;for (; !msgs->Done(); msgs->Next())mindist = min(mindist, msgs->Value());Output("combined_source", mindist);
}
};
Source: Malewicz et al. (2010) Pregel: A System for Large-Scale Graph Processing. SIGMOD.
Pregel: Combiners
![Page 58: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/58.jpg)
![Page 59: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/59.jpg)
Giraph Architecture
Master – Application coordinatorSynchronizes supersteps
Assigns partitions to workers before superstep begins
Workers – Computation & messagingHandle I/O – reading and writing the graph
Computation/messaging of assigned partitions
ZooKeeperMaintains global application state
![Page 60: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/60.jpg)
Part 0
Part 1
Part 2
Part 3
Compute / Send
Messages
Wo
rker
1
Compute / Send
Messages
Mas
ter
Wo
rker
0
In-memory graph
Send stats / iterate!
Compute/Iterate
2
Wo
rke
r 1
Wo
rker
0 Part 0
Part 1
Part 2
Part 3
Output format
Part 0
Part 1
Part 2
Part 3
Storing the graph
3
Split 0
Split 1
Split 2
Split 3
Wo
rker
1
Mas
ter
Wo
rker
0
Input format
Load / Send
Graph
Load / Send
Graph
Loading the graph
1
Split 4
Split
Giraph Dataflow
![Page 61: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/61.jpg)
Active Inactive
Vote to Halt
Received Message
Vertex Lifecycle
Giraph Lifecycle
![Page 62: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/62.jpg)
Output
All Vertices Halted?
Input
Compute Superstep
No
Master halted?
No
Yes
Yes
Giraph Lifecycle
![Page 63: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/63.jpg)
Giraph Example
![Page 64: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/64.jpg)
5
15
2
5
5
25
5
5
5
5
1
2
Processor 1
Processor 2
Time
Execution Trace
![Page 65: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/65.jpg)
join
join
join
…
HDFS HDFS
Adjacency Lists PageRank vector
PageRank vector
flatMap
reduceByKey
PageRank vector
flatMap
reduceByKey
Cache!
![Page 66: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/66.jpg)
State-of-the-Art Distributed Graph Algorithms
Fast asynchronous iterations
Fast asynchronous iterations
Periodic synchronization
![Page 67: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/67.jpg)
Source: Wikipedia (Waste container)
Graph Processing Frameworks
![Page 68: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/68.jpg)
GraphX: Motivation
![Page 69: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/69.jpg)
GraphX = Spark for Graphs
Integration of record-oriented and graph-oriented processing
Extends RDDs to Resilient Distributed Property Graphs
class Graph[VD, ED] {val vertices: VertexRDD[VD]val edges: EdgeRDD[ED]
}
![Page 70: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/70.jpg)
Property Graph: Example
![Page 71: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/71.jpg)
Underneath the Covers
![Page 72: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/72.jpg)
GraphX Operators
val vertices: VertexRDD[VD] val edges: EdgeRDD[ED] val triplets: RDD[EdgeTriplet[VD, ED]]
“collection” view
Transform vertices and edgesmapVerticesmapEdgesmapTriplets
Join vertices with external table
Aggregate messages within local neighborhood
Pregel programs
![Page 73: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/73.jpg)
join
join
join
…
HDFS HDFS
Adjacency Lists PageRank vector
PageRank vector
flatMap
reduceByKey
PageRank vector
flatMap
reduceByKey
Cache!
![Page 74: Data-Intensive Distributed Computing - RoegiestData-Intensive Distributed Computing Part 8: Analyzing Graphs, Redux (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share](https://reader030.fdocuments.net/reader030/viewer/2022041017/5ecb18d2175edb27d35fd520/html5/thumbnails/74.jpg)
Source: Wikipedia (Japanese rock garden)