Divide and conquer · 2014. 7. 6. · 1Hungarian Academy of Sciences { Institute for Computer...
Transcript of Divide and conquer · 2014. 7. 6. · 1Hungarian Academy of Sciences { Institute for Computer...
Families of distributed graph algorithmsDivide and conquer
Marton Balassi1
1Hungarian Academy of Sciences – Institute for Computer Science and ControlData Mining & Search Group
June 24, 2014
Families of distributed graph algorithms
2 / 61
Table of contents
Distributing data-intensive algorithmsMotivationMapReduce & PregelCounting the number of triangles in a graph
Families of distributed graph algorithmsLocal algorithmsGraph traversal based algorithmsMatrix multiplication based algorithms
ExperimentsRepresentative algorithmsResults
Families of distributed graph algorithms
Distributing data-intensive algorithms
3 / 61
Table of contents
Distributing data-intensive algorithmsMotivationMapReduce & PregelCounting the number of triangles in a graph
Families of distributed graph algorithmsLocal algorithmsGraph traversal based algorithmsMatrix multiplication based algorithms
ExperimentsRepresentative algorithmsResults
Families of distributed graph algorithms
Distributing data-intensive algorithms
Motivation
4 / 61
A bit about myself
My background
I BSc, MSc in Computer Science, Eotvos University Budapest
I BA in Economics, TU Budapest
I Distributed algorithms
I Big data architecture
Families of distributed graph algorithms
Distributing data-intensive algorithms
Motivation
5 / 61
A bit about myself
My background
I BSc, MSc in Computer Science, Eotvos University Budapest
I BA in Economics, TU Budapest
I Distributed algorithms
I Big data architecture
Families of distributed graph algorithms
Distributing data-intensive algorithms
Motivation
6 / 61
A bit about myself
My background
I BSc, MSc in Computer Science, Eotvos University Budapest
I BA in Economics, TU Budapest
I Distributed algorithms
I Big data architecture
Families of distributed graph algorithms
Distributing data-intensive algorithms
Motivation
7 / 61
A bit about myself
My background
I BSc, MSc in Computer Science, Eotvos University Budapest
I BA in Economics, TU Budapest
I Distributed algorithms
I Big data architecture
Families of distributed graph algorithms
Distributing data-intensive algorithms
Motivation
8 / 61
Motivation
Let’s do a PageRank on this graph. . .
I A large Portugese webcrawl1
I 3.1 · 109 nodes
I 1.1 · 1011 edges
I 80 GB of compressed data
I Divide and conquer is almost mandatory
1a large Portuguese crawl of the Portuguese Web Archive obtained fromDaniel Gomes
Families of distributed graph algorithms
Distributing data-intensive algorithms
Motivation
9 / 61
Motivation
Let’s do a PageRank on this graph. . .
I A large Portugese webcrawl1
I 3.1 · 109 nodes
I 1.1 · 1011 edges
I 80 GB of compressed data
I Divide and conquer is almost mandatory
1a large Portuguese crawl of the Portuguese Web Archive obtained fromDaniel Gomes
Families of distributed graph algorithms
Distributing data-intensive algorithms
Motivation
10 / 61
Motivation
Let’s do a PageRank on this graph. . .
I A large Portugese webcrawl1
I 3.1 · 109 nodes
I 1.1 · 1011 edges
I 80 GB of compressed data
I Divide and conquer is almost mandatory
1a large Portuguese crawl of the Portuguese Web Archive obtained fromDaniel Gomes
Families of distributed graph algorithms
Distributing data-intensive algorithms
Motivation
11 / 61
Motivation
Let’s do a PageRank on this graph. . .
I A large Portugese webcrawl1
I 3.1 · 109 nodes
I 1.1 · 1011 edges
I 80 GB of compressed data
I Divide and conquer is almost mandatory
1a large Portuguese crawl of the Portuguese Web Archive obtained fromDaniel Gomes
Families of distributed graph algorithms
Distributing data-intensive algorithms
Motivation
12 / 61
Motivation
Let’s do a PageRank on this graph. . .
I A large Portugese webcrawl1
I 3.1 · 109 nodes
I 1.1 · 1011 edges
I 80 GB of compressed data
I Divide and conquer is almost mandatory
1a large Portuguese crawl of the Portuguese Web Archive obtained fromDaniel Gomes
Families of distributed graph algorithms
Distributing data-intensive algorithms
MapReduce & Pregel
13 / 61
MapReduce [DG04]
Families of distributed graph algorithms
Distributing data-intensive algorithms
MapReduce & Pregel
14 / 61
Pregel [MAB+10]
Traits
I Bulk SynchronousParallel [Val90]
I ,,Think like a vertex”
I Graph kept in memory
Scheme of the BSP systemWikipedia, public domain
Families of distributed graph algorithms
Distributing data-intensive algorithms
MapReduce & Pregel
15 / 61
Pregel [MAB+10]
Traits
I Bulk SynchronousParallel [Val90]
I ,,Think like a vertex”
I Graph kept in memory
Scheme of the BSP systemWikipedia, public domain
Families of distributed graph algorithms
Distributing data-intensive algorithms
MapReduce & Pregel
16 / 61
Pregel [MAB+10]
Traits
I Bulk SynchronousParallel [Val90]
I ,,Think like a vertex”
I Graph kept in memory
Scheme of the BSP systemWikipedia, public domain
Families of distributed graph algorithms
Distributing data-intensive algorithms
MapReduce & Pregel
17 / 61
Pregel [MAB+10]
Vertex. . .
In1
Inn
. . .
Out1
Outm
t
t
t
t − 1
t − 1
t − 1
Pregel schema as perceived from a vertex
Families of distributed graph algorithms
Distributing data-intensive algorithms
Counting the number of triangles in a graph
18 / 61
Triangle Counter – Sequential algorithm
Sequential algorithm
Every vertex executes a search of itself bounded in depth of three.Thus every triangle is counted three times.
Families of distributed graph algorithms
Distributing data-intensive algorithms
Counting the number of triangles in a graph
19 / 61
Triangle Counter – Sequential algorithm
Sequential algorithm
Every vertex executes a search of itself bounded in depth of three.Thus every triangle is counted three times.You can do better by making use of the ordering on the vertices.
Families of distributed graph algorithms
Distributing data-intensive algorithms
Counting the number of triangles in a graph
20 / 61
Triangle Counter – distributed algorithm
Representation
0 1 21 22 03
0 1
23
Families of distributed graph algorithms
Distributing data-intensive algorithms
Counting the number of triangles in a graph
21 / 61
Triangle Counter – distributed algorithm
First Map
Let’s send our ID to all of ourneighbours possessing a higher IDthan ours. Let’s send our neighboursto ourselves.
First Reduce
Let’s write out the informationreceived.
0 1
2
0
1
Families of distributed graph algorithms
Distributing data-intensive algorithms
Counting the number of triangles in a graph
22 / 61
Triangle Counter – distributed algorithm
Second Map
If the ID received is smaller thenours let’s pass it on to ourneighbours.Let’s send our neighbours toourselves.
Second Reduce
If the ID received is our neighbourthen let’s increment a globalcounter.
0 [] 1 [0]
2 [1]
1
Families of distributed graph algorithms
Distributing data-intensive algorithms
Counting the number of triangles in a graph
23 / 61
Triangle Counter – distributed algorithm
Second Map
If the ID received is smaller thenours let’s pass it on to ourneighbours.Let’s send our neighbours toourselves.
Second Reduce
If the ID received is our neighbourthen let’s increment a globalcounter.
0 + + 1
2
Families of distributed graph algorithms
Families of distributed graph algorithms
24 / 61
Table of contents
Distributing data-intensive algorithmsMotivationMapReduce & PregelCounting the number of triangles in a graph
Families of distributed graph algorithmsLocal algorithmsGraph traversal based algorithmsMatrix multiplication based algorithms
ExperimentsRepresentative algorithmsResults
Families of distributed graph algorithms
Families of distributed graph algorithms
Local algorithms
25 / 61
Local algorithms
Traits
I Dependant on a small environment of the given vertex or edge.
I ,,Trivial” candidates for parallel computing.
I Examples are fingerprint computation, local clusteringcoefficient and the number of triangles.
Families of distributed graph algorithms
Families of distributed graph algorithms
Local algorithms
26 / 61
Local algorithms
Traits
I Dependant on a small environment of the given vertex or edge.
I ,,Trivial” candidates for parallel computing.
I Examples are fingerprint computation, local clusteringcoefficient and the number of triangles.
Families of distributed graph algorithms
Families of distributed graph algorithms
Local algorithms
27 / 61
Local algorithms
Traits
I Dependant on a small environment of the given vertex or edge.
I ,,Trivial” candidates for parallel computing.
I Examples are fingerprint computation, local clusteringcoefficient and the number of triangles.
Families of distributed graph algorithms
Families of distributed graph algorithms
Graph traversal based algorithms
28 / 61
Graph traversal based algorithms
Traits
I Dependant on taking long routes in the graph.
I Difficult to implement in a distributed environment.
I The distributed algorithm can be less effective than thesequential as the representation is less powerful.
I Examples could be accessibility, betweenness centrality andstrongly connected components.
Families of distributed graph algorithms
Families of distributed graph algorithms
Graph traversal based algorithms
29 / 61
Graph traversal based algorithms
Traits
I Dependant on taking long routes in the graph.
I Difficult to implement in a distributed environment.
I The distributed algorithm can be less effective than thesequential as the representation is less powerful.
I Examples could be accessibility, betweenness centrality andstrongly connected components.
Families of distributed graph algorithms
Families of distributed graph algorithms
Graph traversal based algorithms
30 / 61
Graph traversal based algorithms
Traits
I Dependant on taking long routes in the graph.
I Difficult to implement in a distributed environment.
I The distributed algorithm can be less effective than thesequential as the representation is less powerful.
I Examples could be accessibility, betweenness centrality andstrongly connected components.
Families of distributed graph algorithms
Families of distributed graph algorithms
Graph traversal based algorithms
31 / 61
Graph traversal based algorithms
Traits
I Dependant on taking long routes in the graph.
I Difficult to implement in a distributed environment.
I The distributed algorithm can be less effective than thesequential as the representation is less powerful.
I Examples could be accessibility, betweenness centrality andstrongly connected components.
Families of distributed graph algorithms
Families of distributed graph algorithms
Matrix multiplication based algorithms
32 / 61
Matrix multiplication based algorithms
Traits
I Are basically power iterations of matrix-vector multiplicationswith typically fast convergence traits.
I Suitable to implement in Pregel-based systems and not thatstraight-forward in plain MapReduce. [KBEK14]
I Representatives could be eigenvalue centrality, PageRank orLineRank.
Families of distributed graph algorithms
Families of distributed graph algorithms
Matrix multiplication based algorithms
33 / 61
Matrix multiplication based algorithms
Traits
I Are basically power iterations of matrix-vector multiplicationswith typically fast convergence traits.
I Suitable to implement in Pregel-based systems and not thatstraight-forward in plain MapReduce. [KBEK14]
I Representatives could be eigenvalue centrality, PageRank orLineRank.
Families of distributed graph algorithms
Families of distributed graph algorithms
Matrix multiplication based algorithms
34 / 61
Matrix multiplication based algorithms
Traits
I Are basically power iterations of matrix-vector multiplicationswith typically fast convergence traits.
I Suitable to implement in Pregel-based systems and not thatstraight-forward in plain MapReduce. [KBEK14]
I Representatives could be eigenvalue centrality, PageRank orLineRank.
Families of distributed graph algorithms
Experiments
35 / 61
Table of contents
Distributing data-intensive algorithmsMotivationMapReduce & PregelCounting the number of triangles in a graph
Families of distributed graph algorithmsLocal algorithmsGraph traversal based algorithmsMatrix multiplication based algorithms
ExperimentsRepresentative algorithmsResults
Families of distributed graph algorithms
Experiments
Representative algorithms
36 / 61
Representative algorithms
Local
Graph traversal based
Matrix multiplication based
Families of distributed graph algorithms
Experiments
Representative algorithms
37 / 61
Representative algorithms
Local
TriangleCounter already presented . . .
Graph traversal based
Matrix multiplication based
Families of distributed graph algorithms
Experiments
Representative algorithms
38 / 61
Representative algorithms
Local
Graph traversal based
LineRank Construct the linegraph, where vertices represent theedges of the original graph and a directed edge pointsfrom e1 to e2 if the target of e1 is the source of e2.
Matrix multiplication based
Families of distributed graph algorithms
Experiments
Representative algorithms
39 / 61
Representative algorithms
Local
Graph traversal based
LineRank Construct the linegraph, where vertices represent theedges of the original graph and a directed edge pointsfrom e1 to e2 if the target of e1 is the source of e2.Run a PageRank on this. [KPST11, PBMW98]
Matrix multiplication based
Families of distributed graph algorithms
Experiments
Representative algorithms
40 / 61
Representative algorithms
Local
Graph traversal based
Matrix multiplication based
SCC Run a label propagation to detect connected components.Remove edges with different labels, then do a reverse labelpropagation. [MIHPR05]
Families of distributed graph algorithms
Experiments
Representative algorithms
41 / 61
Representative algorithms
Local
Graph traversal based
Matrix multiplication based
SCC Run a label propagation to detect connected components.Remove edges with different labels, then do a reverse labelpropagation. [MIHPR05] Remove the sccs found and iterate.
Families of distributed graph algorithms
Experiments
Representative algorithms
42 / 61
Representative algorithms
Local
Graph traversal based
Matrix multiplication based
SCC Run a label propagation to detect connected components.Remove edges with different labels, then do a reverse labelpropagation. [MIHPR05] Remove the sccs found and iterate.The sequential algorithm is more efficient. [BCVDP11]
Families of distributed graph algorithms
Experiments
Results
43 / 61
Results [EBKK14]
Name Vertices Edges Source
web-Google 8, 7 · 105 5, 0 · 107 [LLDM08]wiki-Talk 2, 4 · 106 5, 1 · 107 [LHK10]soc-LiveJournal 4, 8 · 106 6, 9 · 108 [BHKL06]forest10M 107 2, 4 · 108 generated2
forest20M 2 · 107 4, 8 · 108 generated2
Summary of the graphs used
2A slightly modified version of the ForestFire model. [LF06, EBKK14]
Families of distributed graph algorithms
Experiments
Results
44 / 61
Results [EBKK14]
Implementations
sequential Plain Java.
MapReduce Apache Hadoop on 20 cores, 40 reducers.
Pregel Apache Giraph on 17 cores, 3 occupied by Zookeeper.
Families of distributed graph algorithms
Experiments
Results
45 / 61
Results [EBKK14]
Implementations
sequential Plain Java.
MapReduce Apache Hadoop on 20 cores, 40 reducers.
Pregel Apache Giraph on 17 cores, 3 occupied by Zookeeper.
Families of distributed graph algorithms
Experiments
Results
46 / 61
Results [EBKK14]
Implementations
sequential Plain Java.
MapReduce Apache Hadoop on 20 cores, 40 reducers.
Pregel Apache Giraph on 17 cores, 3 occupied by Zookeeper.
Families of distributed graph algorithms
Experiments
Results
47 / 61
Results [EBKK14]
Implementations
sequential Plain Java.
MapReduce Apache Hadoop on 20 cores, 40 reducers.
Pregel Apache Giraph on 17 cores, 3 occupied by Zookeeper.
Families of distributed graph algorithms
Experiments
Results
48 / 61
Results [EBKK14]
Families of distributed graph algorithms
Experiments
Results
49 / 61
Results [EBKK14]
Families of distributed graph algorithms
Experiments
Results
50 / 61
Results [EBKK14]
Families of distributed graph algorithms
Experiments
Results
51 / 61
Summary
Take home messages
I If your data is ,,big” you have to go multi-core/machine
I If multi-machine, then try MapReduce
I For distributed graph algorithms it is useful to distinguishthree families
I The families behave differently
I It is worth to distribute even for data not that ,,big”
Families of distributed graph algorithms
Experiments
Results
52 / 61
Summary
Take home messages
I If your data is ,,big” you have to go multi-core/machine
I If multi-machine, then try MapReduce
I For distributed graph algorithms it is useful to distinguishthree families
I The families behave differently
I It is worth to distribute even for data not that ,,big”
Families of distributed graph algorithms
Experiments
Results
53 / 61
Summary
Take home messages
I If your data is ,,big” you have to go multi-core/machine
I If multi-machine, then try MapReduce
I For distributed graph algorithms it is useful to distinguishthree families
I The families behave differently
I It is worth to distribute even for data not that ,,big”
Families of distributed graph algorithms
Experiments
Results
54 / 61
Summary
Take home messages
I If your data is ,,big” you have to go multi-core/machine
I If multi-machine, then try MapReduce
I For distributed graph algorithms it is useful to distinguishthree families
I The families behave differently
I It is worth to distribute even for data not that ,,big”
Families of distributed graph algorithms
Experiments
Results
55 / 61
Summary
Take home messages
I If your data is ,,big” you have to go multi-core/machine
I If multi-machine, then try MapReduce
I For distributed graph algorithms it is useful to distinguishthree families
I The families behave differently
I It is worth to distribute even for data not that ,,big”
Families of distributed graph algorithms
Experiments
Results
56 / 61
Marton [email protected]
Hungarian Academy of Sciences – Institute for Computer Science and ControlData Mining & Search Group
Families of distributed graph algorithms
Experiments
Results
57 / 61
Jirı Barnat, Jakub Chaloupka, and Jaco Van De Pol.Distributed algorithms for scc decomposition.J. Log. and Comput., 21(1):23–44, February 2011.
Lars Backstrom, Dan Huttenlocher, Jon Kleinberg, andXiangyang Lan.Group formation in large social networks: Membership,growth, and evolution.In Proceedings of the 12th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, pages44–54, 2006.
Jeffrey Dean and Sanjay Ghemawat.Mapreduce: Simplified data processing on large clusters.In OSDI’04: Sixth Symposium on Operating System Designand Implementation, 2004.
Families of distributed graph algorithms
Experiments
Results
58 / 61
Peter Englert, Marton Balassi, Balazs Kosa, and Attila Kiss.Efficiency issues of computing graph properties of socialnetworks.In Proceedings of the 9th International Conference on AppliedInformatics, page to appear, 2014.
Balazs Kosa, Marton Balassi, Peter Englert, and Attila Kiss.Betweenness versus linerank, 2014.sent to the 6th International Conference on ComputationalCollective Intelligence Technologies and Applications,download from: http://people.inf.elte.hu/balhal/
publications/BetweennessVsLinerank.pdf.
Families of distributed graph algorithms
Experiments
Results
59 / 61
U Kang, Spiros Papadimitriou, Jimeng Sun, and HanghangTong.Centralities in large networks: Algorithms and observations.In SDM, pages 119–130. SIAM / Omnipress, 2011.
Jure Leskovec and Christos Faloutsos.Sampling from large graphs.In Proceedings of the 12th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, KDD’06, pages 631–636, 2006.
Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg.Signed networks in social media.In Proceedings of the SIGCHI Conference on Human Factors inComputing Systems, pages 1361–1370, 2010.
Families of distributed graph algorithms
Experiments
Results
60 / 61
Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, andMichael W. Mahoney.Community structure in large networks: Natural cluster sizesand the absence of large well-defined clusters, 2008.cite arxiv:0810.1355Comment: 66 pages, a much expandedversion of our WWW 2008 paper.
Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik,James C. Dehnert, Ilan Horn, Naty Leiser, and GrzegorzCzajkowski.Pregel: a system for large-scale graph processing.In Proceedings of the 2010 ACM SIGMOD InternationalConference on Management of data, SIGMOD ’10, pages135–146, 2010.
Families of distributed graph algorithms
Experiments
Results
61 / 61
William McLendon III, Bruce Hendrickson, Steven J.Plimpton, and Lawrence Rauchwerger.Finding strongly connected components in distributed graphs.Journal of Parallel and Distributed Computing, 65(8):901–910,August 2005.
L. Page, S. Brin, R. Motwani, and T. Winograd.The pagerank citation ranking: Bringing order to the web.In Proceedings of the 7th International World Wide WebConference, pages 161–172, 1998.
Leslie G. Valiant.A bridging model for parallel computation.Commun. ACM, 33(8):103–111, August 1990.