Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on...
Transcript of Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on...
![Page 1: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/1.jpg)
Apache Giraph Large-scale Graph Processing on Hadoop
Claudio Martella
<[email protected]> @claudiomartella
![Page 2: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/2.jpg)
2
![Page 3: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/3.jpg)
Graphs are simple
3
![Page 4: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/4.jpg)
A computer network
4
![Page 5: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/5.jpg)
A social network
5
![Page 6: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/6.jpg)
A semantic network
6
![Page 7: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/7.jpg)
A map
7
![Page 8: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/8.jpg)
Predicting break ups
8
Aggregation approach Graph approach
![Page 9: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/9.jpg)
Graphs are nasty.
9
![Page 10: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/10.jpg)
Each vertex depends
on its neighbours,
recursively.
10
![Page 11: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/11.jpg)
Recursive problems
are nicely solved
iteratively.
11
![Page 12: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/12.jpg)
12
![Page 13: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/13.jpg)
PageRank in
MapReduce
• Record: < v_i, pr, [ v_j, ..., v_k ] >
• Mapper: emits < v_j, pr / #neighbours >
• Reducer: sums the partial values
13
![Page 14: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/14.jpg)
MapReduce dataflow
14
![Page 15: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/15.jpg)
Drawbacks
• Each job is executed N times
• Job bootstrap
• Mappers send PR values and structure
• Extensive IO at input, shuffle & sort,
output
15
![Page 16: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/16.jpg)
16
![Page 17: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/17.jpg)
Timeline
• Inspired by Google Pregel (2010)
• Donated to ASF by Yahoo! in 2011
• Top-level project in 2012
• 1.0 release in January 2013
• 1.1 release in November 2014
17
![Page 18: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/18.jpg)
Plays well with
Hadoop
18
![Page 19: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/19.jpg)
Vertex-centric API
19
![Page 20: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/20.jpg)
Shortest Paths
20
![Page 21: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/21.jpg)
Shortest Paths
21
![Page 22: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/22.jpg)
Shortest Paths
22
![Page 23: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/23.jpg)
Shortest Paths
23
![Page 24: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/24.jpg)
Shortest Paths
24
![Page 25: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/25.jpg)
Code def compute(vertex, messages):
minValue = Inf # float(‘Inf’)
for m in messages:
minValue = min(minValue, m)
if minValue < vertex.getValue():
vertex.setValue(minValue)
for edge in vertex.getEdges():
message = minValue + edge.getValue()
sendMessage(edge.getTargetId(), message)
vertex.voteToHalt()
25
![Page 26: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/26.jpg)
26
![Page 27: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/27.jpg)
27
![Page 28: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/28.jpg)
28
![Page 29: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/29.jpg)
29
![Page 30: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/30.jpg)
BSP & Giraph
30
![Page 31: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/31.jpg)
Advantages
• No locks: message-based
communication
• No semaphores: global synchronization
• Iteration isolation: massively
parallelizable
31
![Page 32: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/32.jpg)
Designed for
iterations
• Stateful (in-memory)
• Only intermediate values (messages)
sent
• Hits the disk at input, output, checkpoint
• Can go out-of-core
32
![Page 33: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/33.jpg)
Giraph job lifetime
33
![Page 34: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/34.jpg)
Architecture
34
![Page 35: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/35.jpg)
Composable API
35
![Page 36: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/36.jpg)
Checkpointing
36
![Page 37: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/37.jpg)
No SPoFs
37
![Page 38: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/38.jpg)
Giraph scales
38
ref: https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion-
edges/10151617006153920
![Page 39: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/39.jpg)
Giraph is
fast
• 100x over MR (Pr)
• jobs run within minutes
• given you have resources
;-)
39
![Page 40: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/40.jpg)
Serialised objects
40
![Page 41: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/41.jpg)
Primitive types
• Autoboxing is expensive
• Objects overhead (JVM)
• Use primitive types on your own
• Use primitive types-based libs (e.g.
fastutils)
41
![Page 42: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/42.jpg)
Sharded aggregators
42
![Page 43: Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are](https://reader035.fdocuments.net/reader035/viewer/2022063008/5fbe8dd2e6811e253915d823/html5/thumbnails/43.jpg)
Okapi
• Apache Mahout for graphs
• Graph-based
recommenders: ALS, SGD,
SVD++, etc.
• Graph analytics: Graph
partitioning, Community
Detection, K-Core, etc.
43