Apache Giraph: Large-scale graph processing done better

Post on 15-Apr-2017

252 views 3 download

Transcript of Apache Giraph: Large-scale graph processing done better

Apache GiraphLarge-scale graph processing done better

Data Mining Class

Sapienza, University of Rome

A. Y. 2016 - 2017

Basic concepts Let’s start Get our hands dirty

Hi!Simone Santacrocesantacroce.1542338@studenti.uniroma1.it

https://it.linkedin.com/in/simone-santacroce-272739134

Manuel Coppotellicoppotelli.1540732@studenti.uniroma1.it

https://it.linkedin.com/in/manuelcoppotelli

George Adrian Munteanumunteanu.1540833@studenti.uniroma1.it

https://it.linkedin.com/in/george-adrian-munteanu-707744134

Lorenzo Marconimarconi.1494505@studenti.uniroma1.it

https://www.linkedin.com/in/lorenzo-marconi-1a2580105

Antonio La Torrealatorre182@hotmail.it

https://www.linkedin.com/in/antonio-la-torre-768738134

Lucio Burliniburlini.1705432@studenti.uniroma1.it

https://www.linkedin.com/in/lucio-burlini-827739134

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Agenda

1 Basic concepts• Graphs in the real world• Challenges on graphs• MapReduce• Giraph

2 Let’s start• Out-Degree & In-Degree

3 Get our hands dirty• Simple PageRank

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Agenda

1 Basic concepts• Graphs in the real world• Challenges on graphs• MapReduce• Giraph

2 Let’s start• Out-Degree & In-Degree

3 Get our hands dirty• Simple PageRank

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Graphs 101

• Graph: representation of a setof objects G =< V ,E >

• Captures pairwise relationshipsbetween objects

• Can have directions, weights,. . .

Apache Giraph

Basic concepts Let’s start Get our hands dirty

A computer network

Apache Giraph

Basic concepts Let’s start Get our hands dirty

A road map

Apache Giraph

Basic concepts Let’s start Get our hands dirty

The web

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Social networks

• Both physical and Internet mediated

• Users are vertices

• Any kind of interaction generates edges

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Graph are huge!

∼ 50B pages

∼ 1.1B users

∼ 570M users

∼ 530M users

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Graph are nasty

• Graph needs processing

• Each vertex depends on its neighbors, recursively

• Recursive problems are nicely solved iteratively

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Graph are nasty

• Graph needs processing

• Each vertex depends on its neighbors, recursively

• Recursive problems are nicely solved iteratively

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Graph are nasty

• Graph needs processing

• Each vertex depends on its neighbors, recursively

• Recursive problems are nicely solved iteratively

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Graph are nasty

• Graph needs processing

• Each vertex depends on its neighbors, recursively

• Recursive problems are nicely solved iteratively

So what?

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Why not MapReduce?1

MapReduce is the current standard to manage big sets of data forintensive computing.

Repeat N times . . .1https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf

Apache Giraph

Basic concepts Let’s start Get our hands dirty

MapReduce Drawbacks

• Each job is executed N times

• Job bootstrap

• Mappers send values and structure

• Extensive IO at input, shuffle & sort, output

Disk I/O and Job scheduling quickly dominate the algorithm

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Google’s Pregel2

• Especially developed for large scale graph processing

• Intuitive API that let’s you “think like a vertex”

• Bulk Synchronous Parallel (BSP) as execution model

• Fault tolerance by checkpointing

2https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Google’s Pregel2

• Especially developed for large scale graph processing

• Intuitive API that let’s you “think like a vertex”

• Bulk Synchronous Parallel (BSP) as execution model

• Fault tolerance by checkpointing

2https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Google’s Pregel2

• Especially developed for large scale graph processing

• Intuitive API that let’s you “think like a vertex”

• Bulk Synchronous Parallel (BSP) as execution model

• Fault tolerance by checkpointing

2https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Google’s Pregel2

• Especially developed for large scale graph processing

• Intuitive API that let’s you “think like a vertex”

• Bulk Synchronous Parallel (BSP) as execution model

• Fault tolerance by checkpointing

2https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Giraph

Apache Giraph

Basic concepts Let’s start Get our hands dirty

The Story

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Think like a vertex

• Each vertex has an id, a value, a list of adjacent neighbors andcorresponding edge values

• Vertices implement algorithms by sending messages• Messages are delivered at the start of each superstep

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Bulk Synchronous Parallel (BSP)

• Master-Slave architecture

• Batch oriented processing

• Computation happens in-memory

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Advantages

• No locks: message-based communication

• No semaphores: global synchronization

• Iteration isolation: massively parallelizable

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Architecture

Single Map-only Job

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Jobs Schema

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Other things

Aggregators

• Mechanism for global communication and global computation

• Global value calculated in superstep t available in t + 1

• Pre-defined (e.g. sum, max, min) or user-definable functions3

Combiners

• User-defined function3 for messages before being sent or delivered

• Similar to Hadoop ones

• Saves on network or memory

Checkpointing

• Store work to disk at user-defined intervals (isn’t always evil)

• Restart on failure

3The function has to be both commutative and associative

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Other things

Aggregators

• Mechanism for global communication and global computation

• Global value calculated in superstep t available in t + 1

• Pre-defined (e.g. sum, max, min) or user-definable functions3

Combiners

• User-defined function3 for messages before being sent or delivered

• Similar to Hadoop ones

• Saves on network or memory

Checkpointing

• Store work to disk at user-defined intervals (isn’t always evil)

• Restart on failure

3The function has to be both commutative and associative

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Other things

Aggregators

• Mechanism for global communication and global computation

• Global value calculated in superstep t available in t + 1

• Pre-defined (e.g. sum, max, min) or user-definable functions3

Combiners

• User-defined function3 for messages before being sent or delivered

• Similar to Hadoop ones

• Saves on network or memory

Checkpointing

• Store work to disk at user-defined intervals (isn’t always evil)

• Restart on failure3The function has to be both commutative and associative

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Agenda

1 Basic concepts• Graphs in the real world• Challenges on graphs• MapReduce• Giraph

2 Let’s start• Out-Degree & In-Degree

3 Get our hands dirty• Simple PageRank

Apache Giraph

Basic concepts Let’s start Get our hands dirty

LongLongNullTextInputFormat

org.apache.giraph.io.formats.LongLongNullTextInputFormat

If there is ad edge from Node 1 to Node 2 thenNode 2 appears in the neighbor list of Node 1

<NODE1 ID> <SPACE> <NEIGHBOR1 ID> <SPACE> <NEIGHBOR2 ID> ...

<NODE2 ID> <SPACE> <NEIGHBOR1 ID> <SPACE> <NEIGHBOR2 ID> ...

...

Apache Giraph

Basic concepts Let’s start Get our hands dirty

IdWithValueTextOutputFormat

org.apache.giraph.io.formats.IdWithValueTextOutputFormat

For each node print the Node ID and the Node Value

<NODE1 ID> <TAB> <NODE1 VALUE>

<NODE2 ID> <TAB> <NODE2 VALUE>

...

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Demo

Demo code

https://github.com/manuelcoppotelli/giraph-demo

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Agenda

1 Basic concepts• Graphs in the real world• Challenges on graphs• MapReduce• Giraph

2 Let’s start• Out-Degree & In-Degree

3 Get our hands dirty• Simple PageRank

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Google’s PageRank4

• The success factor of Google’s search engine

• A graph algorithm computing the “importance” of webpages

◦ Important pages have a lot of links from other important pages◦ Look at the structure of the underlying network

• Ability to conduct web scale graph processing

4http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Google’s PageRank4

• The success factor of Google’s search engine• A graph algorithm computing the “importance” of webpages

◦ Important pages have a lot of links from other important pages◦ Look at the structure of the underlying network

• Ability to conduct web scale graph processing

4http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Google’s PageRank4

• The success factor of Google’s search engine• A graph algorithm computing the “importance” of webpages

◦ Important pages have a lot of links from other important pages

◦ Look at the structure of the underlying network

• Ability to conduct web scale graph processing

4http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Google’s PageRank4

• The success factor of Google’s search engine• A graph algorithm computing the “importance” of webpages

◦ Important pages have a lot of links from other important pages◦ Look at the structure of the underlying network

• Ability to conduct web scale graph processing

4http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Google’s PageRank4

• The success factor of Google’s search engine• A graph algorithm computing the “importance” of webpages

◦ Important pages have a lot of links from other important pages◦ Look at the structure of the underlying network

• Ability to conduct web scale graph processing

4http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Simple PageRank

• Recursive definition

PageRanki+1(v) =1 − d

N+ d ·

∑u→v

PageRanki (u)

O(u)

• Where:◦ d: damping factor; which percentage of the PageRank must be

transferred to the neighbors. Usually 0.85◦ N: total number of pages◦ O: out-degree; total number of link within a page

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Simple PageRank

• Recursive definition

PageRanki+1(v) =1 − d

N+ d ·

∑u→v

PageRanki (u)

O(u)

• Where:◦ d: damping factor; which percentage of the PageRank must be

transferred to the neighbors. Usually 0.85◦ N: total number of pages◦ O: out-degree; total number of link within a page

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Simple PageRank Example

1.0

1.0

1.0

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Simple PageRank Example

1.0

1.0

1.0

0.5

0.5

1

1

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Simple PageRank Example

1 · 0.85 + 0.1

5/3

0.5 · 0.85 + 0.15/3

1.5 · 0.85 + 0.15/3

0.5

0.5

1

1

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Simple PageRank Example

0.43

0.21

0.64

Apache Giraph

Basic concepts Let’s start Get our hands dirty

JsonLongDoubleFloatDoubleVertexInputFormat

org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat

Express both nodes and edges information using JSON arrays

[<vertex id>, <vertex value>,

[

[<dest vertex id>, <edge value>],

...

]

]

NoticeFore more in/out formats visit https://github.com/apache/giraph/tree/trunk/giraph-core/src/main/java/org/apache/giraph/io/formats

Apache Giraph

Basic concepts Let’s start Get our hands dirty

DemoDemo code

https://github.com/manuelcoppotelli/giraph-demo

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Q? & A!

Apache Giraph

Basic concepts Let’s start Get our hands dirty

Thank you for your attention

Contact us for any questions or problem

Demo code

https://github.com/manuelcoppotelli/giraph-demo

Homework

https://github.com/manuelcoppotelli/giraph-homework

Apache Giraph