Introducing Apache Giraph for Large Scale Graph Processing

Sebastian Schelter

PhD student at the Database Systems and Information Management Group of TU Berlin

Committer and PMC member at Apache Mahout and Apache Giraph

mail ssc@apache.org blog http://ssc.io

Graph recap

graph: abstract representation of a set of objects (vertices), where some pairs of these objects are connected by links (edges), which can be directed or undirected

Graphs can be used to model arbitrary things like road networks, social networks, flows of goods, etc.

Majority of graph algorithmsare iterative and traverse the graph in some way A

Real world graphs are really large!

• the World Wide Web has several billion pages with several billion links

• Facebook‘s social graph had more than 700 million users and more than 68 billion friendships in 2011

• twitter‘s social graph has billions of follower relationships

Why not use MapReduce/Hadoop?

• Example: PageRank, Google‘s famous algorithm for measuring the authority of a webpage based on the underlying network of hyperlinks

• defined recursively: each vertex distributes its authority to its neighbors in equal proportions

),( ijj j

Textbook approach to PageRank in MapReduce

• PageRank p is the principal eigenvector of the Markov matrix M defined by the transition probabilities between web pages

• it can be obtained by iteratively multiplying an initial PageRank vector by M (power iteration)

row 1 of M

row n of M

pi+1row 2 of M

Drawbacks

• Not intuitive: only crazy scientists think in matrices and eigenvectors

• Unnecessarily slow: Each iteration is scheduled as separate MapReduce job with lots of overhead– the graph structure is read from disk

– the map output is spilled to disk

– the intermediary result is written to HDFS

• Hard to implement: a join has to be implemented by hand, lots of work, best strategy is data dependent

Google Pregel

• distributed system especially developed for large scale graph processing

• intuitive API that let‘s you ‚think like a vertex‘

• Bulk Synchronous Parallel (BSP) as execution model

• fault tolerance by checkpointing

Bulk Synchronous Parallel (BSP)

processors

local computation

communication

barrier synchronization

superstep

Vertex-centric BSP

• each vertex has an id, a value, a list of its adjacent vertex ids and the corresponding edge values

• each vertex is invoked in each superstep, can recompute its value and send messages to other vertices, which are delivered over superstep barriers

• advanced features : termination votes, combiners, aggregators, topology mutations

vertex1

vertex2

vertex3

vertex1

vertex2

vertex3

vertex1

vertex2

vertex3

superstep i superstep i + 1 superstep i + 2

Master-slave architecture

• vertices are partitioned and assigned to workers – default: hash-partitioning

– custom partitioning possible

• master assigns and coordinates, while workers execute vertices and communicate with each other

Master

Worker 1 Worker 2 Worker 3

PageRank in Pregel

class PageRankVertex {

void compute(Iterator messages) {

if (getSuperstep() > 0) {

// recompute own PageRank from the neighbors messages

pageRank = sum(messages);

setVertexValue(pageRank);

if (getSuperstep() < k) {

// send updated PageRank to each neighbor

sendMessageToAllNeighbors(pageRank / getNumOutEdges());

} else {

voteToHalt(); // terminate

),( ijj j

PageRank toy example

.33 .33 .33

.33.17

Superstep 0

.17 .50 .34

.34.25

Superstep 1

.25 .43 .34

.34.22

Superstep 2

Input graph

Cool, where can I download it?

• Pregel is proprietary, but:

– Apache Giraph is an open source implementation of Pregel

– runs on standard Hadoop infrastructure

– computation is executed in memory

– can be a job in a pipeline (MapReduce, Hive)

– uses Apache ZooKeeper for synchronization

Giraph‘s Hadoop usage

TaskTracker

worker worker

TaskTracker

worker worker

TaskTracker

worker worker

TaskTracker

master workerZooKeeperJobTracker NameNode

Anatomy of an execution

Setup• load the graph from disk• assign vertices to workers• validate workers health

Compute• assign messages to workers• iterate on active vertices• call vertices compute()

Synchronize• send messages to workers• compute aggregators• checkpoint

Teardown• write back result• write back aggregators

Who is doing what?

• ZooKeeper: responsible for computation state– partition/worker mapping – global state: #superstep– checkpoint paths, aggregator values, statistics

• Master: responsible for coordination– assigns partitions to workers– coordinates synchronization– requests checkpoints– aggregates aggregator values– collects health statuses

• Worker: responsible for vertices– invokes active vertices compute() function– sends, receives and assigns messages– computes local aggregation values

What do you have to implement?

• your algorithm as a Vertex

– Subclass one of the many existing implementations: BasicVertex, MutableVertex, EdgeListVertex, HashMapVertex, LongDoubleFloatDoubleVertex,...

• a VertexInputFormat to read your graph

– e.g. from a text file with adjacency lists like <vertex> <neighbor1> <neighbor2> ...

• a VertexOutputFormat to write back the result

– e.g. <vertex> <pageRank>

Starting a Giraph job

• no difference to starting a Hadoop job:

$ hadoop jar giraph-0.1-jar-with-dependencies.jar o.a.g.GiraphRunner o.a.g.examples.ConnectedComponentsVertex --inputFormat o.a.g.examples.IntIntNullIntTextInputFormat --inputPath hdfs:///wikipedia/pagelinks.txt --outputFormat o.a.g.examples.ComponentOutputFormat --outputPath hdfs:///wikipedia/results/--workers 89 --combiner o.a.g.examples.MinimumIntCombiner

What‘s to come?

• Current and future work in Giraph– graduation from the incubator

– out-of-core messaging

– algorithms library

• 2-day workshop after Berlin Buzzwords– topic: ‚Parallel Processing beyond MapReduce‘

– meet the developers of Giraph and Stratospherehttp://berlinbuzzwords.de/content/workshops-berlin-buzzwords

Everything is a network!

Further resources

• Apache Giraph homepagehttp://incubator.apache.org/giraph

• Claudio Martella: “Apache Giraph: Distributed Graph Processing in the Cloud” http://prezi.com/9ake_klzwrga/apache-giraph-distributed-graph-processing-in-the-cloud/

• Malewicz et al.: „Pregel – a system for large scale graph processing“, PODC 09http://dl.acm.org/citation.cfm?id=1582723

Thank you.

Questions?

Introducing Apache Giraph for Large Scale Graph Processing

Education

Transcript of Introducing Apache Giraph for Large Scale Graph Processing

Apache Giraph - Centrum Wiskunde & Informatica · Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella @claudiomartella . 2 . Graphs are

Large Scale Graph Processing with Apache Giraph

What’s a graph? - UHgabriel/courses/cosc6339_s17/BDA_10_Giraph.pdf · Apache Giraph • Master: responsible for coordination – assigns partitions to workers – coordinates synchronization

Introducing Apache Lucene with two demos

Introducing Apache Derbydb.apache.org/derby/binaries/djd_derby_intro.pdf · Introducing Apache Derby Dan Debrunner ... standards-based Java database that can be tightly ... backup

Introducing Exactly Once Semantics To Apache Kafka

The ubiquity of large graphs and surprising …jimmylin/publications/Sahu_etal...Distributed graph processing engine Apache Flink (Gelly) [30] 24 39 Apache Giraph [34]8 Apache Spark

Introducing Apache Unomi - JavaOne 2015 Session

Introducing Apache Camel

Introducing Apache Giraph for Large Scale Graph Processing Scale... · graph, the so called web graph –pages are vertices connected by edges that represent hyperlinks –the web

Compute "Closeness" in Graphs using Apache Giraph.

Introducing Apache Tomcat 7 - Community Central

Introducing Apache Mahout

Introduction into scalable graph analysis with Apache Giraph and Spark GraphX

Introducing the Apache Unomi Project

Apache Giraph: start analyzing graph relationships in your bigdata in 45 minutes (or your money back)!

Introducing Apache Tomcat 6 - Community Central

Introducing Apache Pivot - Meetupfiles.meetup.com/87071/Introducing Apache Pivot.pdf · org.apache.pivot.web.GetQuery to retrieve the data • POST, PUT, and DELETE also supported

Apache Giraph: Large-scale graph processing done better

Selecting the Best VM across Multiple Public Clouds: A ... · (a) Building Apache Giraph (b) YCSB-benchmarks Workload A Figure 2: (a) Runtime for building Apache Giraph (lower the