Apache Giraph: Large-scale graph processing done better

50
Apache Giraph Large-scale graph processing done better Data Mining Class Sapienza, University of Rome A. Y. 2016 - 2017

Transcript of Apache Giraph: Large-scale graph processing done better

Page 1: Apache Giraph: Large-scale graph processing done better

Apache GiraphLarge-scale graph processing done better

Data Mining Class

Sapienza, University of Rome

A. Y. 2016 - 2017

Page 2: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Hi!Simone [email protected]

https://it.linkedin.com/in/simone-santacroce-272739134

Manuel [email protected]

https://it.linkedin.com/in/manuelcoppotelli

George Adrian [email protected]

https://it.linkedin.com/in/george-adrian-munteanu-707744134

Lorenzo [email protected]

https://www.linkedin.com/in/lorenzo-marconi-1a2580105

Antonio La [email protected]

https://www.linkedin.com/in/antonio-la-torre-768738134

Lucio [email protected]

https://www.linkedin.com/in/lucio-burlini-827739134

Apache Giraph

Page 3: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Agenda

1 Basic concepts• Graphs in the real world• Challenges on graphs• MapReduce• Giraph

2 Let’s start• Out-Degree & In-Degree

3 Get our hands dirty• Simple PageRank

Apache Giraph

Page 4: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Agenda

1 Basic concepts• Graphs in the real world• Challenges on graphs• MapReduce• Giraph

2 Let’s start• Out-Degree & In-Degree

3 Get our hands dirty• Simple PageRank

Apache Giraph

Page 5: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Graphs 101

• Graph: representation of a setof objects G =< V ,E >

• Captures pairwise relationshipsbetween objects

• Can have directions, weights,. . .

Apache Giraph

Page 6: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

A computer network

Apache Giraph

Page 7: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

A road map

Apache Giraph

Page 8: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

The web

Apache Giraph

Page 9: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Social networks

• Both physical and Internet mediated

• Users are vertices

• Any kind of interaction generates edges

Apache Giraph

Page 10: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Graph are huge!

∼ 50B pages

∼ 1.1B users

∼ 570M users

∼ 530M users

Apache Giraph

Page 11: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Graph are nasty

• Graph needs processing

• Each vertex depends on its neighbors, recursively

• Recursive problems are nicely solved iteratively

Apache Giraph

Page 12: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Graph are nasty

• Graph needs processing

• Each vertex depends on its neighbors, recursively

• Recursive problems are nicely solved iteratively

Apache Giraph

Page 13: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Graph are nasty

• Graph needs processing

• Each vertex depends on its neighbors, recursively

• Recursive problems are nicely solved iteratively

Apache Giraph

Page 14: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Graph are nasty

• Graph needs processing

• Each vertex depends on its neighbors, recursively

• Recursive problems are nicely solved iteratively

So what?

Apache Giraph

Page 15: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Why not MapReduce?1

MapReduce is the current standard to manage big sets of data forintensive computing.

Repeat N times . . .1https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf

Apache Giraph

Page 16: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

MapReduce Drawbacks

• Each job is executed N times

• Job bootstrap

• Mappers send values and structure

• Extensive IO at input, shuffle & sort, output

Disk I/O and Job scheduling quickly dominate the algorithm

Apache Giraph

Page 17: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Google’s Pregel2

• Especially developed for large scale graph processing

• Intuitive API that let’s you “think like a vertex”

• Bulk Synchronous Parallel (BSP) as execution model

• Fault tolerance by checkpointing

2https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf

Apache Giraph

Page 18: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Google’s Pregel2

• Especially developed for large scale graph processing

• Intuitive API that let’s you “think like a vertex”

• Bulk Synchronous Parallel (BSP) as execution model

• Fault tolerance by checkpointing

2https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf

Apache Giraph

Page 19: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Google’s Pregel2

• Especially developed for large scale graph processing

• Intuitive API that let’s you “think like a vertex”

• Bulk Synchronous Parallel (BSP) as execution model

• Fault tolerance by checkpointing

2https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf

Apache Giraph

Page 20: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Google’s Pregel2

• Especially developed for large scale graph processing

• Intuitive API that let’s you “think like a vertex”

• Bulk Synchronous Parallel (BSP) as execution model

• Fault tolerance by checkpointing

2https://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/p135-malewicz.pdf

Apache Giraph

Page 21: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Giraph

Apache Giraph

Page 22: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

The Story

Apache Giraph

Page 23: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Think like a vertex

• Each vertex has an id, a value, a list of adjacent neighbors andcorresponding edge values

• Vertices implement algorithms by sending messages• Messages are delivered at the start of each superstep

Apache Giraph

Page 24: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Bulk Synchronous Parallel (BSP)

• Master-Slave architecture

• Batch oriented processing

• Computation happens in-memory

Apache Giraph

Page 25: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Advantages

• No locks: message-based communication

• No semaphores: global synchronization

• Iteration isolation: massively parallelizable

Apache Giraph

Page 26: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Architecture

Single Map-only Job

Apache Giraph

Page 27: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Jobs Schema

Apache Giraph

Page 28: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Other things

Aggregators

• Mechanism for global communication and global computation

• Global value calculated in superstep t available in t + 1

• Pre-defined (e.g. sum, max, min) or user-definable functions3

Combiners

• User-defined function3 for messages before being sent or delivered

• Similar to Hadoop ones

• Saves on network or memory

Checkpointing

• Store work to disk at user-defined intervals (isn’t always evil)

• Restart on failure

3The function has to be both commutative and associative

Apache Giraph

Page 29: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Other things

Aggregators

• Mechanism for global communication and global computation

• Global value calculated in superstep t available in t + 1

• Pre-defined (e.g. sum, max, min) or user-definable functions3

Combiners

• User-defined function3 for messages before being sent or delivered

• Similar to Hadoop ones

• Saves on network or memory

Checkpointing

• Store work to disk at user-defined intervals (isn’t always evil)

• Restart on failure

3The function has to be both commutative and associative

Apache Giraph

Page 30: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Other things

Aggregators

• Mechanism for global communication and global computation

• Global value calculated in superstep t available in t + 1

• Pre-defined (e.g. sum, max, min) or user-definable functions3

Combiners

• User-defined function3 for messages before being sent or delivered

• Similar to Hadoop ones

• Saves on network or memory

Checkpointing

• Store work to disk at user-defined intervals (isn’t always evil)

• Restart on failure3The function has to be both commutative and associative

Apache Giraph

Page 31: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Agenda

1 Basic concepts• Graphs in the real world• Challenges on graphs• MapReduce• Giraph

2 Let’s start• Out-Degree & In-Degree

3 Get our hands dirty• Simple PageRank

Apache Giraph

Page 32: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

LongLongNullTextInputFormat

org.apache.giraph.io.formats.LongLongNullTextInputFormat

If there is ad edge from Node 1 to Node 2 thenNode 2 appears in the neighbor list of Node 1

<NODE1 ID> <SPACE> <NEIGHBOR1 ID> <SPACE> <NEIGHBOR2 ID> ...

<NODE2 ID> <SPACE> <NEIGHBOR1 ID> <SPACE> <NEIGHBOR2 ID> ...

...

Apache Giraph

Page 33: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

IdWithValueTextOutputFormat

org.apache.giraph.io.formats.IdWithValueTextOutputFormat

For each node print the Node ID and the Node Value

<NODE1 ID> <TAB> <NODE1 VALUE>

<NODE2 ID> <TAB> <NODE2 VALUE>

...

Apache Giraph

Page 34: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Demo

Demo code

https://github.com/manuelcoppotelli/giraph-demo

Apache Giraph

Page 35: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Agenda

1 Basic concepts• Graphs in the real world• Challenges on graphs• MapReduce• Giraph

2 Let’s start• Out-Degree & In-Degree

3 Get our hands dirty• Simple PageRank

Apache Giraph

Page 36: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Google’s PageRank4

• The success factor of Google’s search engine

• A graph algorithm computing the “importance” of webpages

◦ Important pages have a lot of links from other important pages◦ Look at the structure of the underlying network

• Ability to conduct web scale graph processing

4http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

Apache Giraph

Page 37: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Google’s PageRank4

• The success factor of Google’s search engine• A graph algorithm computing the “importance” of webpages

◦ Important pages have a lot of links from other important pages◦ Look at the structure of the underlying network

• Ability to conduct web scale graph processing

4http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

Apache Giraph

Page 38: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Google’s PageRank4

• The success factor of Google’s search engine• A graph algorithm computing the “importance” of webpages

◦ Important pages have a lot of links from other important pages

◦ Look at the structure of the underlying network

• Ability to conduct web scale graph processing

4http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

Apache Giraph

Page 39: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Google’s PageRank4

• The success factor of Google’s search engine• A graph algorithm computing the “importance” of webpages

◦ Important pages have a lot of links from other important pages◦ Look at the structure of the underlying network

• Ability to conduct web scale graph processing

4http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

Apache Giraph

Page 40: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Google’s PageRank4

• The success factor of Google’s search engine• A graph algorithm computing the “importance” of webpages

◦ Important pages have a lot of links from other important pages◦ Look at the structure of the underlying network

• Ability to conduct web scale graph processing

4http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

Apache Giraph

Page 41: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Simple PageRank

• Recursive definition

PageRanki+1(v) =1 − d

N+ d ·

∑u→v

PageRanki (u)

O(u)

• Where:◦ d: damping factor; which percentage of the PageRank must be

transferred to the neighbors. Usually 0.85◦ N: total number of pages◦ O: out-degree; total number of link within a page

Apache Giraph

Page 42: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Simple PageRank

• Recursive definition

PageRanki+1(v) =1 − d

N+ d ·

∑u→v

PageRanki (u)

O(u)

• Where:◦ d: damping factor; which percentage of the PageRank must be

transferred to the neighbors. Usually 0.85◦ N: total number of pages◦ O: out-degree; total number of link within a page

Apache Giraph

Page 43: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Simple PageRank Example

1.0

1.0

1.0

Apache Giraph

Page 44: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Simple PageRank Example

1.0

1.0

1.0

0.5

0.5

1

1

Apache Giraph

Page 45: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Simple PageRank Example

1 · 0.85 + 0.1

5/3

0.5 · 0.85 + 0.15/3

1.5 · 0.85 + 0.15/3

0.5

0.5

1

1

Apache Giraph

Page 46: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Simple PageRank Example

0.43

0.21

0.64

Apache Giraph

Page 47: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

JsonLongDoubleFloatDoubleVertexInputFormat

org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat

Express both nodes and edges information using JSON arrays

[<vertex id>, <vertex value>,

[

[<dest vertex id>, <edge value>],

...

]

]

NoticeFore more in/out formats visit https://github.com/apache/giraph/tree/trunk/giraph-core/src/main/java/org/apache/giraph/io/formats

Apache Giraph

Page 48: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

DemoDemo code

https://github.com/manuelcoppotelli/giraph-demo

Apache Giraph

Page 49: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Q? & A!

Apache Giraph

Page 50: Apache Giraph: Large-scale graph processing done better

Basic concepts Let’s start Get our hands dirty

Thank you for your attention

Contact us for any questions or problem

Demo code

https://github.com/manuelcoppotelli/giraph-demo

Homework

https://github.com/manuelcoppotelli/giraph-homework

Apache Giraph