One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as...

Indian Institute of Science Bangalore, India

भारतीय विज्ञान संस्थान

बंगलौर, भारत

One Trillion Edges :

Graph Processing at Facebook Scale A v e r y C h i n g , S e r g e y E d u n o v , M a j a K a b i l j o ,

D i o n y s i o s L o g o t h e t i s , S a m b h a v i M u t h u k r i s h n a n

F a c e b o o k

P r e s e n t e d b y : S w a p n i l G a n d h i

2 1 s t N o v e m b e r 2 0 1 8

Konigsberg* Bridge Problem

Its negative resolution by Leonhard Euler in 1736 laid the foundations of graph theory.

* Located in Kingdom of Prussia (Now in Russia)

Graphs are common Web & Social Networks

‣ Web graph, Citation Networks, Twitter, Facebook, Internet

Knowledge networks & relationships ‣ Google’s Knowledge Graph, CMU’s NELL

Cybersecurity ‣ Telecom call logs, financial transactions, Malware

Internet of Things ‣ Transport, Power, Water networks

Bioinformatics ‣ Gene sequencing, Gene expression networks

Graph Algorithms

Traversals: Paths & flows between different parts of the graph

‣ Breadth First Search, Shortest path, Minimum Spanning Tree, Eulerian paths, Max-Cut

Clustering: Closeness between sets of vertices ‣ Community detection & Evolution, Connected

components, K-means clustering, Max Independent Set

Centrality: Relative importance of vertices ‣ PageRank, Betweenness Centrality

Graphs are Central to Analytics

Wikipedia

< / > < / > < / > XML

Hyperlinks PageRank Top 20 Pages

Title PR Text

Title Body Topic Model

(LDA) Word Topics

Word Topic

Editor Graph

Community

Detection

Community

User Com.

Term-Doc

Discussion

User Disc.

Community

Topic Com.

But Graphs can be challenging Shared memory algorithms don’t scale!

Do not fit naturally to Hadoop/MapReduce ‣ Multiple MR jobs (iterative MR)

‣ Topology & Data written to HDFS each time

‣ Tuple, rather than graph-centric, abstraction

Lot of work on parallel graph libraries for HPC ‣ Boost Graph Library, Graph500

‣ Storage & compute are (loosely) coupled, not fault tolerant

‣ But everyone does not have a supercomputer! • If in-case you do own a supercomputer, stick with HPC

algorithms 6

PageRank using MR

7 MapReduce : https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-

osdi04.pdf

PageRank using MR MR will run for multiple iterations (Typically 30)

Mapper will ‣ Initially, load adjacency list and initialize default PR ‣ Subsequent iterations will load adjacency list and

new PR ‣ Emit two types of messages from Map

Reducer will ‣ Reconstruct the adjacency list for each vertex ‣ Update the PageRank values for the vertex based

on neighbour’s PR messages ‣ Write adjacency list and new PR values to HDFS, to

be used by next Map iteration

8 SQL v/s MapReduce : http://www.science.smith.edu/dftwiki/images/6/6a/ComparisonOfApproachesToLargeScaleDataAnalysis.p

Two Birds

Half-caf Double Expresso

Less Data movement over Network

Fault Tolerance

Credits : Google Images

One Stone

Pregel

Credits : Google Images

It’s a word play on the English proverb : “Kill two birds with one stone”

Pregel To overcome these challenges, Google came up with Pregel

Valiant’s BSP

Superstep 1

Superstep 2

P1 P2 P3 P4

Computation

Communication

Barrier Synchronization

Computation

“Often expensive and should be used as sparingly as possible”

Vertex State Machine

In superstep 0, every vertex is in the active state.

A vertex deactivates itself by voting to halt.

It can be reactivated by receiving an (external) message.

Algorithm termination is based on every vertex voting to halt.

Vertex Centric Programming

Vertex Centric Programming Model ‣ Logic written from perspective on a single vertex.

‣ Executed on all vertices.

Vertices know about ‣ Their own value(s) ‣ Their outgoing edges

6 6 2 6

6 6 6 6

3 6 2 1 Superstep 0

Superstep 1

Superstep 2

Superstep 3

Active

to Halt

Message

Finding Largest Value in a Graph using Pregel

Worker

Advantages

Makes distributed programming easy ‣ No locks, semaphores, race conditions ‣ Separates computing from communication phase

Vertex-level parallelization ‣ Bulk message passing for efficiency

Stateful (in-memory) ‣ Only messages & checkpoints hit disk

Lifecycle of a Pregel Program

17 Apache Giraph, Claudio Martella, Hadoop Summit, Amsterdam, April

Applications

SSSP class ShortestPathVertex: public Vertex<int, int, int> {

void Compute(MessageIterator* msgs) { int mindist = IsSource(vertex_id()) ? 0 : INF; for ( ; !msgs->Done(); msgs->Next())

mindist = min(mindist, msgs->Value());

if (mindist < GetValue()) {

*MutableValue() = mindist; OutEdgeIterator iter = GetOutEdgeIterator();

for ( ; !iter.Done(); iter.Next())

SendMessageTo(iter.Target(),

mindist + iter.GetValue());

VoteToHalt();

In the 0th superstep, only source vertex will

update its value

SSSP (1/6)

Input Graph

Worker 1

Worker 2

Worker 3

Worker 4

SSSP (2/6)

Superstep 0

Worker 1

Worker 2

Worker 3

Worker 4

SSSP (3/6)

Superstep 1

Worker 1

Worker 2

Worker 3

Worker 4

SSSP (4/6)

Superstep 2

Worker 1

Worker 2

Worker 3

Worker 4

SSSP (5/6)

Superstep 3

Worker 1

Worker 2

Worker 3

Worker 4

SSSP (6/6)

Worker 1

Worker 2

Worker 3

Worker 4

Algorithm has converged

Apache Giraph

Platform Improvements (1/2)

Efficient Memory Management (MM) ‣ Vertex and Edge data stored using serialized byte array

‣ Better MM -> Less GC

Support for Multi-Threading ‣ Maximized resource utilization

‣ Linear speed-up for CPU bound applications like K-Means Clustering

Platform Improvements (2/2)

Flexible IO Format ‣ Reduces Pre-processing

‣ Allows Vertex and Edge data to be loaded from different sources

Sharded Aggregator ‣ Aggregator responsibilities are balanced across workers

‣ Different Aggregators can be assigned to different workers.

Refer Class-room Discussion

Beyond Pregel

Master Compute ‣ Allows centralized execution of computation

‣ Refer Class-room Discussion

Worker Phases ‣ Special methods which by-pass Pregel Model, but add

ease of usability

‣ Applicability is application specific

Computation Composability ‣ Decouples Vertex and Computation

‣ Existing Computation implementation can be decoupled for multiple applications

Superstep Splitting

Master runs same “Message Heavy” Superstep for fixed number of iterations

For an iteration: ‣ Vertex computation invoked if vertex passes hash

function H ‣ Message sent to destination only if destination passes

hash function H’

Applicable to computation where messages are not “aggregatable”. ‣ If they can be aggregated (commutative and associative)

then stick with Combiners

Example : Friends-of-Friends Computation

Key Take-aways

Usability, Performance and scalability improvement to Apache Giraph ‣ Code available as open-source to try out !

Memoir detailing Facebook’s experience of using Giraph for production applications

Headline Grabber :

“Scales to a Trillion Edge graph”

One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as...

Documents

Transcript of One Trillion Edgescourse/DBMS/class/bigdata/giraph.pdf · to Apache Giraph ‣Code available as...

f4: Facebook’s Warm BLOB Storage System

Introducing Apache Giraph for Large Scale Graph Processingresearcher.watson.ibm.com/researcher/files/us-heq/Large Scale Graph... · Large Scale Graph Processing with Apache Giraph

Apache Giraph...Agenda • A brief history of Hadoop-based bigdata management • Extracting graph relationships from unstructured data • A case for iterative and explorative workloads

Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are

Giraph+Gora in ApacheCon14

Introducing Apache Giraph for Large Scale Graph Processing

Pregel and giraph

BigData @ comScore

Facebook’s Graph Search

Hadoop Graph Processing with Apache Giraph

Lighthouse: Large-scale graph pattern matching on Giraph

Apache Giraph

Giraph Travelling Salesman Example

Beyond Map-Reduce & Sparktorlone/bigdata/L7-BeyondMR.pdf · Spark Tez Dremel Storm BSP MapReduce Pregel Giraph Hama Graph Giraph GraphLab GraphX GDBMS SQL on Hadoop Hive Spark SQL

APROVISIONAMIENTO ÁGIL OPTIMIZACIÓN GIRAPH

A Brief Historyhari/teaching/bigdata/spark.pdf · 2018-02-22 · A Brief History: MapReduce MapReduce General Batch Processing Pregel Giraph Dremel Drill Tez Impala GraphLab Storm

Apache Giraph · 2017. 12. 14. · • Apache Giraph: a Hadoop-based BSP graph analysis framework • Giraph application development • Demos! Code! Lots of it! On day one Doug created

A Distributed Force-Directed Algorithm on Giraph: Design ... · A Distributed Force-Directed Algorithm on Giraph: Design and Experiments ... Two main simple principles have guided

Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

The State of BigData - meetup bigdata @ovh