Rabbit Order: Just-in-time Reordering for Fast Graph Analysis

39
Copyright© 2016 NTT Corp. All Rights Reserved. Rabbit Order: Just-in-time Parallel Reordering for Fast Graph Analysis Junya Arai Nippon Telegraph and Telephone Corp. (NTT) Hiroaki Shiokawa Univ. of Tsukuba Takeshi Yamamuro NTT Makoto Onizuka Osaka Univ. Sotetsu Iwamura NTT

Transcript of Rabbit Order: Just-in-time Reordering for Fast Graph Analysis

Copyright© 2016 NTT Corp. All Rights Reserved.

Rabbit Order:Just-in-time Parallel Reorderingfor Fast Graph Analysis

Junya Arai Nippon Telegraph and Telephone Corp. (NTT)

Hiroaki Shiokawa Univ. of Tsukuba

Takeshi Yamamuro NTT

Makoto Onizuka Osaka Univ.

Sotetsu Iwamura NTT

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 2

Summary

• Vertex reordering has been used to improve locality of graph processing

• However, overheads of reordering tend to increase end-to-end runtime (= reordering + analysis)

• Thus, we propose a fast reordering algorithm, Rabbit Order

• Exploit community structures in real-world graphs

• Up to 3.5x speedup for PageRank

• Including reordering overheads!

• Also effective for various graph analysis

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 3

Graph analysis

To deal with large-scale graphs, performance of various analysis algorithms need to be improved

Real-world graphs

• Web graphs

Over 50B pages*1

• Social graphs

1B users200B friendships*1

Analysis algorithms

• Community detection• Ranking (e.g., PageRank)• Shortest Path• Diameter• Connected components• k-core decomposition• ......

Large-scale Various

*1: Andrew+, “Parallel Graph Analytics,” CACM, 59(5), ‘16

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 4

Poor locality

• Poor locality in memory accesses is a problem common to various analysis algorithms

Poor locality causes ...

• Frequent cache misses

• Frequent inter-core communications

• Memory bandwidth saturation

• Simultaneous memory access from cores

Poor scalability

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 5

Memory access example

• PageRank

until convergence do

for each vertex 𝑣 do𝒔 𝑣 = σ𝑢∈𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟(𝑣) Τ𝒔[𝑢] degree(𝑢)

Accessed elements in array 𝒔

0 1 2 3 4 5 6 7

𝑣 = 0

Access PageRank score 𝒔of each neighbor

𝒔 0 , 𝒔 2 , 𝒔 4 and 𝒔[7] are accessed

when 𝑣 = 0

5

2

07

4

13

6

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 6

Memory access example

• PageRank

until convergence do

for each vertex 𝑣 do𝒔 𝑣 = σ𝑢∈𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟(𝑣) Τ𝒔[𝑢] degree(𝑢)

Access PageRank score 𝒔of each neighbor

Accessed elements in array 𝒔

0 1 2 3 4 5 6 7

𝑣 = 0

𝑣 = 1

5

2

07

4

13

6

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 7

Memory access example

• PageRank

until convergence do

for each vertex 𝑣 do𝒔 𝑣 = σ𝑢∈𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟(𝑣) Τ𝒔[𝑢] degree(𝑢)

Access PageRank score 𝒔of each neighbor

Accessed elements in array 𝒔

0 1 2 3 4 5 6 7

𝑣 = 0

𝑣 = 1

𝑣 = 2

5

2

07

4

13

6

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 8

Memory access example

• PageRank

until convergence do

for each vertex 𝑣 do𝒔 𝑣 = σ𝑢∈𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟(𝑣) Τ𝒔[𝑢] degree(𝑢)

Access PageRank score 𝒔of each neighbor

Accessed elements in array 𝒔

0 1 2 3 4 5 6 7

𝑣 = 0

𝑣 = 1

𝑣 = 2

𝑣 = 3

𝑣 = 4

𝑣 = 5

𝑣 = 6

𝑣 = 7

5

2

07

4

13

6

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 9

Memory access example

• PageRank

until convergence do

for each vertex 𝑣 do𝒔 𝑣 = σ𝑢∈𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟(𝑣) Τ𝒔[𝑢] degree(𝑢)

Access PageRank score 𝒔of each neighbor

Accessed elements in array 𝒔

0 1 2 3 4 5 6 7

𝑣 = 0

𝑣 = 1

𝑣 = 2

𝑣 = 3

𝑣 = 4

𝑣 = 5

𝑣 = 6

𝑣 = 7

Poor spatial locality

Poor temporal locality5

2

07

4

13

6

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 10

Reordering

• Preprocess for optimizing vertex ordering (ID numbering)

• No change of analysis algorithms and implementations is required

• Improve locality by co-locating neighboring vertices in memory

• Existing algorithms: RCM, LLP, Nested Dissection, ...

Random ordering High-locality ordering

5

2

07

4

13

6

0

2

31

4

76

5

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 11

On a reordered graph

• PageRank

until convergence do

for each vertex 𝑣 do𝒔 𝑣 = σ𝑢∈𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟(𝑣) Τ𝒔[𝑢] degree(𝑢)

Access PageRank score 𝒔of each neighbor

Accessed elements in array 𝒔

0 1 2 3 4 5 6 7

𝑣 = 0

𝑣 = 1

𝑣 = 2

𝑣 = 3

𝑣 = 4

𝑣 = 5

𝑣 = 6

𝑣 = 7

0

2

31

4

76

5

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 12

On a reordered graph

• PageRank

until convergence do

for each vertex 𝑣 do𝒔 𝑣 = σ𝑢∈𝑁𝑒𝑖𝑔ℎ𝑏𝑜𝑟(𝑣) Τ𝒔[𝑢] degree(𝑢)

Access PageRank score 𝒔of each neighbor

Accessed elements in array 𝒔

0 1 2 3 4 5 6 7

𝑣 = 0

𝑣 = 1

𝑣 = 2

𝑣 = 3

𝑣 = 4

𝑣 = 5

𝑣 = 6

𝑣 = 7

0

2

31

4

76

5

High spatial locality

High temporal locality

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 13

Problem in reordering

• Reordering tends to increase end-to-end runtime

• end-to-end = reordering + analysis (e.g., PageRank)

• ‘Speedup’ by ahead-of-time reordering

Reordering:

SlowAnalysis:

Fast

Reorder again when the graph is modified

Result

0

2

31

4

76

5

5

2

07

4

13

6

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 14

Problem in reordering

• Reordering tends to increase end-to-end runtime

• end-to-end = reordering + analysis (e.g., PageRank)

• ‘Speedup’ by ahead-of-time reordering

w/o reordering Analysis

w/ reordering Reordering Analysis

Time

Slowdown!!!

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 15

Our contribution: Rabbit Order

• Reordering algorithm to reduce end-to-end runtime

• Speedup by just-in-time reordering

w/o reordering Analysis

w/ reordering Reordering Analysis

Time

ReorderingAnalysis

Fast reordering High locality&

Rabbit Order

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 16

Rabbit Order

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 17

Two main techniques

1. Hierarchical community-based ordering

• For high locality

2. Parallel incremental aggregation

• For fast reordering

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 18

Two main techniques

1. Hierarchical community-based ordering

• For high locality

2. Parallel incremental aggregation

• For fast reordering

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 19

Community-based ordering

• Community: a group of densely connected vertices

• Common in real-world graphs (e.g., web, social, ...)

• Co-locate vertices within each community in memory(cf. [Prat-Perez ‘11][Boldi+ ‘11])

Accessed elements in array 𝒔

0 1 2 3 4 5 6 7

𝑣 = 0

𝑣 = 1

𝑣 = 2

𝑣 = 3

𝑣 = 4

𝑣 = 5

𝑣 = 6

𝑣 = 7

0

2

31

4

76

5

Community 1Vertex 0~4

Community 2Vertex 5~7

Community1

Community2

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 20

Community-based ordering

• Community: a group of densely connected vertices

• Common in real-world graphs (e.g., web, social, ...)

• Co-locate vertices within each community in memory(cf. [Prat-Perez ‘11][Boldi+ ‘11])

Real social networkhttp://snap.stanford.edu/data/egonets-Facebook.html

Accessed elements in array 𝒔0 1 2 3 ……

𝑣 = 0

𝑣 = 1

𝑣 = 2

……

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 21

Hierarchy of communities

• A community contains inner nested communities

• e.g., social network of students

• Hierarchy of schools, grades, and classes

Real social networkhttp://snap.stanford.edu/data/egonets-Facebook.html

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 22

Hierarchical community-based ordering

• Hierarchical community co-location for further locality

• Recursively co-locate vertices within each inner-community

• Inner communities produce denser blocks, higher locality

Accessed elements in array 𝒔0 1 2 3 ……

𝑣 = 0

𝑣 = 1

𝑣 = 2…

…Denser block

Real social networkhttp://snap.stanford.edu/data/egonets-Facebook.html

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 23

Hierarchical community-based ordering

• Hierarchical community co-location for further locality

• Recursively co-locate vertices within each inner-community

• Inner communities produce denser blocks, higher locality

Accessed elements in array 𝒔0 1 2 3 ……

𝑣 = 0

𝑣 = 1

𝑣 = 2…

…Denser block

Real social networkhttp://snap.stanford.edu/data/egonets-Facebook.html

How can we obtainhierarchical communities?

Reordering time must be short!

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 24

Two main techniques

1. Hierarchical community-based ordering

• For high locality

2. Parallel incremental aggregation

• For fast reordering

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 25

Incremental aggregation [Shiokawa+ '13]

• Extract hierarchical communities by merging vertex pairs

• Fast since it rapidly coarsens the graph, but sequential

Merged to a neighbor that most improves modularity 𝑸

2

07

4

𝜟𝑸(𝒗𝟎, 𝒗𝟐) = 𝟎. 𝟎𝟓𝟐

𝛥𝑄(𝑣0, 𝑣4) = 0.031

𝛥𝑄(𝑣0, 𝑣7) = 0.042

Gain of modularity for merging vertex 𝒖 and 𝒗:

𝛥𝑄 𝑢, 𝑣 = 2𝑤𝑢𝑣2𝑚

−𝑑𝑒𝑔 𝑢 𝑑𝑒𝑔(𝑣)

2𝑚 2

𝒘𝒖𝒗 Edge weight between vertex 𝑢 and 𝑣

𝒎 Total number of edges in the graph

Community

20 75 4

Community

13 6

5

2

07

4

63

1 2

7

43

18

6

2

43

[Newman+ ‘04]

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 26

Parallelization issues

• Naive per-vertex parallelization causes conflicts

• Mutex: large overheads

• Fine-grained locking (per vertex) is required

• Atomic operation: too small operands (16 bytes on x86-64)

• Cannot atomically merge vertices

• by reattaching edges and removing a one of the vertices

1

2

4

5

6

3

0

Thread 1

Thread 2

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 27

Solution: lazy aggregation (1/2)

• Lightweight concurrency control by atomic operations

• Delay merges until the merged vertex is required• to reduce data size to be atomically modified

1

2

4

5

6

3

01

2

4

5

6

3

0

Just register vertices as a community member

• This can be performed using compare-and-swapby storing the members in a singly-linked list

• All the members are virtually treated as vertex 1

CommunityThread 1

Thread 2

1

1

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 28

Solution: lazy aggregation (2/2)

• Lightweight concurrency control by atomic operations

• Delay merges until the merged vertex is required• to reduce data size to be atomically modified

1

2

4

5

6

3

0

Which vertexshould vertex 1be merged to?

1

5

6

301

1

Actually merge the members

• Only one thread is assigned to eachvertex, and so it can merge themembers without conflicts

Compute

𝜟𝐐

6

2

Thread Thread

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 29

Communities to orderingin a hierarchical community-based manner

• Construct a dendrogram while extracting communities

1 3 65 7 0 2 4

Community 2

Community 1

Innercommunity

5

2

07

4

13

6

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 30

Communities to orderingin a hierarchical community-based manner

• Construct a dendrogram while extracting communities

• Reorder vertices to DFS visit order on it

• Vertices in each inner-community are recursively co-located

1 3 65 7 0 2 4

Community 2

Community 1

DFSDFS

New ordering

5 6 70 1 2 3 4

Innercommunity

5

2

07

4

13

6

0

3

21

4

56

7

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 31

Evaluation

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 32

Setup

• Xeon E5-2697v2 12 cores x 2 socket / RAM 256GB

• Reordering methods for comparison

• Graphs

Slash SlashBurn [Lim+ TKDE’14] Sequential

BFS Unordered parallel BFS [Karantasis+ SC’14] Parallel

RCM Unordered parallel RCM [Karantasis+ SC’14] Parallel

ND Multithreaded Nested Dissection [LaSalle+ IPDPS’13] Parallel

LLP Layered Label Propagation [Boldi+ WWW’11] Parallel

Shingle The shingle ordering [Chierichetti+ KDD’09] Parallel

Degree Ascending order of degree Parallel

Random Random ordering (baseline) -

berkstan enwiki ljournal uk-2002 road-usa uk-2005 it-2004 twitter sk-2005 webbase

V 0.7M 4.2M 4.8M 18.5M 23.9M 39.5M 41.3M 41.7M 50.6M 118.1M

E 7.6M 101.4M 69.0M 298.1M 57.7M 936.4M 1.2B 1.5B 1.9B 1.0B

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 33

End-to-end PageRank speedup

Rabbit Order yields up to 3.5x (avg. 2.2x) speedup

The other methods degrade performance in most cases

• End−to−end speedup = PageRank runtime with random orderingReordering runtime + PageRank runtime

• Reordering methods and PageRank are run with 48 threads using HyperThreading

0

0.5

1

1.5

2

2.5

3

3.5

berkstan enwiki ljournal uk-2002 road-usa uk-2005 it-2004 twitter sk-2005 webbase

Speedup

Rabbit Slash

BFS RCM

ND LLP

Shingle Degree

Sp

eed

up

Slo

wd

ow

n

Best speedup

3.5x

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 34

Breakdown of PageRank runtime

• Rabbit Order achieves fast reordering and high locality at the same time• Reorder a 1.2B-edge graph in about 12 sec.

0 500 1000 1500 2000 2500

Random

Degree

Shingle

LLP

ND

RCM

BFS

Slash

Rabbit

Runtime [sec]

Reordering

PageRank

Fast

Graph: it-2004

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 35

Cache misses during PageRank

• Competitive with the best state-of-the-art algorithms

0

2E+10

4E+10

6E+10

8E+10

1E+11

1.2E+11

1.4E+11

1.6E+11

1.8E+11

Rabbit Slash BFS RCM ND LLP Shingle Degree Rand

# o

f cache m

isse

s

Graph: it-2004L1 L2 L3

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 36

Effectiveness for other analyses

• Rabbit Order is effective for various analysis algorithms

• Efficiency is affected by computational cost of analyses

• It is difficult to amortize the reordering time by short analysis time (e.g., that of DFS and BFS)

0

0.5

1

1.5

2

2.5

3

3.5

DFS BFS Connected components

Graph diameter k-core decomposition

Speedup

Average end-to-end speedup for the 10 graphs

Rabbit Slash

BFS RCM

ND LLP

Shingle Degree

Slo

wd

ow

nS

peed

upAnalysis

1-10 sec

Analysis10-100 sec

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 37

Scalability of reordering

• Highest scalability against the number of threads

• Plenty of parallelism in incremental aggregation

• Lightweight concurrency control (lazy aggregation)

0

2

4

6

8

10

12

14

16

18

20

Rabbit BFS RCM ND LLP Shingle Degree

Avg.

speedup v

s. 1

thre

ad

Reordering time

12 threads 24 threads 48 threads (HT)

Scala

ble

J. Arai+, "Rabbit Order: Just-in-time Reordering for Fast Graph Analysis," IPDPS'16. Copyright© 2016 NTT Corp. All Rights Reserved. 38

Conclusion

• Reordering improves locality of graph analysis

• But existing algorithms tend to increase end-to-end runtime

• Rabbit Order reduces the end-to-end runtimeby two main techniques:

1. Hierarchical community-based ordering for high locality

2. Parallel incremental aggregation for fast reordering

• Up to 3.5x speedup for PageRank

• Also effective for various analysis algorithms

Implementation available https://git.io/rabbit

(for evaluation purposes only)