Architecture-Aware Graph (Re)Partitioningpeople.cs.pitt.edu/~anz28/papers/proposal.slides.pdfN9 G P1...

Architecture-Aware Graph (Re)Partitioning

Thesis Proposal DefenseAngen Zheng

Committee: Alexandros Labrinidis, Depart. of Comp. Science, Pitt (Advisor)Panos K. Chrysanthis, Depart. of Comp. Science, Pitt (Co-advisor)Jack Lange, Depart. of Comp. Science, PittPeyman Givi, Swanson School of Engineering, PittPatrick Pisciuneri, Swanson School of Engineering, Pitt

1

Graph (Re)Partitioning

❖ Applications of Graph (Re)Partitioning▪ Scientific Simulations

▪ Distributed Graph Computation

o e.g., Pregel and Giraph

▪ VLSI Design

▪ Task Scheduling

▪ Linear Programming

2

Vertex-Centric BSP Computing Model

★ Vertex○ a unique identifier○ a modifiable, user-defined value

★ Edge○ a modifiable, user-defined value○ a target vertex identifier

UD

F

UD

F

UD

F

UD

F

★ Vertex-Centric UDF○ Change vertex/edge state○ Send msg to neighbours○ Receive msg from neighbors○ Mutate the graph topology ○ Deactivate at end of the superstep○ Reactivate by external msgs

3

A Balanced Partitioning = Even Load Distribution

N3N1

N2

Balanced:

4

Minimal Edge-Cut = Minimal Data Comm

N3N1

N2

Minimizing Edge-Cut:

5

Minimal Edge-Cut = Minimal Data Comm But Minimal Data Comm ≠ Minimal Comm Cost

Group neighboring vertices as close as possible

The (re)partitioner has to be Architecture-Aware

Figure 1. Pair-Wise Network Bandwidth (J. Xue, BigData’15)

STD DEV:416.82Mb/s

STD DEV:358.34Mb/s

STD DEV: 269 . 71Mb/s

6

Overview of the State-of-the-Art

Balanced Graph (Re)Partitioning

Partitioners(static graphs)

Repartitioners(dynamic graphs)

Metis’95

ICA3PP’08

SoCC’12

TKDE’15

BigData’15

DG/LDG’12

Offline Methods(High Quality)

(Poor Scalability)

Online Methods(Moderate Quality)(High Scalability)

Parmetis’

97


(Poor Scalability)

Online Methods(Moderate~High Quality)

(High Scalability)

CatchW

’13xd

gp’13

Hermes

’15Miza

n’13

Aragon’14Paragon’16

Plan

ar’16

Architecture-Aware

7

Architecture-Aware Graph Repartitioning

Given G=(V, E) and an initial Partitioning P:

Minimizing Communication:

Balancing Load:

Network Cost

Minimizing Migration:

8

Road Map

Introduction

Aragon

Paragon

Planar

Contention

Evaluation

Future Work

9

Aragon: Sequential Architecture-Aware Graph Partition Refinement [BigGraphs’14]

❖ Goal: Group neighbouring vertices as close as possible❖ Input:

o A partitioned grapho The relative network comm cost matrix

❖ Output:o A partitioning with neighbouring vertices being

grouped as close as possible.

10

N3N1

N2

Aragon: Illustrate Aragon through an example

N1 N2 N3N1 1 6N2 1 1N3 6 1

11

N3N1

N2

Aragon: Send partitions to a centralized place

● Sending partitions to one place

N1 N2 N3N1 1 6N2 1 1N3 6 1

12

N3N1

N2

Aragon: Refine each partition pair sequentially

N1 N2 N3N1 1 6N2 1 1N3 6 1

● Sending partitions to one place● Refining two partitions a time

13

Aragon: Move vertices between N1 and N2?

N1 N2 N3N1 1 6N2 1 1N3 6 1

N3N1

N2

● Sending partitions to one place● Refining (N1, N2)

14

N3N1

N2

Aragon: Compute initial gain for vertices of N1 & N2

N1 N2 N3N1 1 6N2 1 1N3 6 1


■ Compute initial gain

15

N3N1

N2

Aragon: Compute initial gain for vertex a

N1 N2 N3N1 1 6N2 1 1N3 6 1


■ Compute initial gain■ a: N

1->N

2

16

N3N1

N2

Aragon: How the movement affects comm(N1, N2)?

N1 N2 N3N1 1 6N2 1 1N3 6 1



1->N

2■ g

std(a) = (1-2)*1=-1

17

Aragon: How the movement affects comm(N1, N3)?

N3N1

N2

N1 N2 N3N1 1 6N2 1 1N3 6 1



1->N

2■ g

std(a) = (1-2)*1=-1

■ gtopo

(a) = 1*(6-1)=5

18

Aragon: What is the migration cost of vertex a?

N3N1

N2

N1 N2 N3N1 1 6N2 1 1N3 6 1



1->N

2■ g

std(a) = (1-2)*1=-1

■ gtopo

(a) = 1*(6-1)=5■ g

mig(a) = 1*1=1

19

N3N1

N2

Aragon: What’s the initial gain of moving a?

3

N1 N2 N3N1 1 6N2 1 1N3 6 1



1->N

2■ g

std(a) = (1-2)*1=-1

■ gtopo

(a) = 1*(6-1)=5■ g

mig(a) = 1*1=1

■ g(a) = -1 + 5 – 1=3

20


■ Compute initial gain

N3N1

N2

Aragon: What’s the initial gain of other vertices?

3-2

-3

-2

0

0

-2

N1 N2 N3N1 1 6N2 1 1N3 6 1

21

N3N1

N2

Aragon: Which vertex has the max gain?

3-2

-3

-2

0

0

-2

N1 N2 N3N1 1 6N2 1 1N3 6 1


■ Compute initial gain■ Select max gain vertex, a

22

Aragon: Move the vertex with max gain

N3

N1

N2

-2

-3

-2

0

0

-2

N1 N2 N3N1 1 6N2 1 1N3 6 1


■ Compute initial gain■ Select max gain vertex, a■ Move a to N2

23

N3

N1

N2

-2

-1

-2

-2

0

0

Aragon: Update the gain of a’s nbors

N1 N2 N3N1 1 6N2 1 1N3 6 1


■ Compute initial gain■ Select max gain vertex, a■ Move a to N2■ Update gain of a’s nbors

24

Aragon: Repeat the whole process

N3

N1

N2

-2

-1

-2

-2

0

0

N1 N2 N3N1 1 6N2 1 1N3 6 1


■ Compute initial gain■ Repeat

■ Select max gain vertex, a■ Move a to N2■ Update gain of a’s nbors

25

Aragon: Output for the refinement of N1 and N2

● 4 Units Comm Cost (4 Edge-Cuts)● 1 Unit Migration Cost

N3

N1

N2

N1 N2 N3N1 1 6N2 1 1N3 6 1




26

Aragon: Refining N1 and N3

N3

N1

N2

N1 N2 N3N1 1 6N2 1 1N3 6 1




27

Aragon: Refining N2 and N3

N3

N1

N2

N1 N2 N3N1 1 6N2 1 1N3 6 1




28

Road Map

Introduction

Aragon

Paragon

Planar

Contention

Evaluation

Future Work

29

❖ Goal: Group neighbouring vertices as close as possible

Paragon: Parallel Architecture-Aware Graph Partitioning Refinement [EDBT’16]

Paragon vs Aragon○ lower overhead○ scale to much larger graphs

30

Paragon: Illustrate Paragon via an example

N1

N2

N3

N4

N5

N6

N7

N8

N9

G

P1

P2

P3

P4

P5

P6

P7

P8

P931

Paragon: Partition Grouping

P1P2P3

P4P6P9

P5P7P8

P1

P2

P3

P4

P5

P6

P7

P8

P9

N1

N2

N3

N4

N5

N6

N7

N8

N9

32

Paragon: Group Server Selection

P1

P2

P3

P4

P5

P6

P7

P8

P9

N1

N2

N3

N4

N5

N6

N7

N8

N9

N9

N8

N2P1P2P3

P4P6P9

P5P7P8

33

Paragon: Sending “Partition” to Group Servers

N1

N2

N3

N4

N5

N6

N7

N8

N9

P1

P2

P3

P4

P5

P6

P7

P8

P9

P1

P3

P4

P6

P5

P7

Only send boundary vertices

N9

N8

N2P1P2P3

P4P6P9

P5P7P8

34

Paragon: Parallel Refinement

Aragon

P1

P2

P3

P4

P5

P6

P7

P8

P9

P1

P3

P4

P6

P5

P7

Aragon

Aragon

N2

N3

N4

N5

N6

N7

N8

N9

N1

N9

N8

N2P1P2P3

P4P6P9

P5P7P8

Number of Groups○ Degree of Parallelism○ Parallelism vs Quality

35

Paragon: Shuffle Refinement

N2:

N9:

N8:

P1P2P4

P5P7P9

P3P6P8

Swap

Aragon

Aragon

Aragon

Parallel

P1P2P3

P4P6P9

P5P7P8

Repeat k times

To increase the # of partition pairs being refined!

36

Road Map

Introduction

Aragon

Paragon

Planar

Contention

Evaluation

Future Work

37

Planar: Parallel Lightweight Architecture-Aware Graph Repartitioning [ICDE’16]

Phase-1: Logical Vertex Migration

Phase-2: Physical Vertex Migration

Phase-3: Convergence Check

★ Migration Planning○ What vertices to move?○ Where to move?

★ Still beneficial?

★ Perform the Migration Plan

Sk Sk+1 Sk+2 Sk+4 Sk+5Pl

anar

Plan

ar

Plan

ar

Plan

ar

Plan

ar

○ Phase-1a: Minimizing Comm Cost○ Phase-1b: Ensuring Balanced Partitions

38

Phase-1a: Minimizing Comm Cost

N1 N2 N3

N1 6 1

N2 6 1

N3 1 1

N1 N2

N3

6

6

1

1

39

★ Run Planar on each partition in Parallel○ Each boundary vertex of my partition

■ make a migration decision on my own■ Probabilistic vertex migration

N1 N2

N3

6

6

1

1


40

N1 N2

N3


6

6

1

1



41

N1 N2

N3

Phase-1a@N1: Use vertex a as an example

6

6

1

1

g(a, N1, N1) = 0



Max Gain: 0Optimal Dest: N1

42

N1 N2

N3

Phase-1a@N1: Move vertex a to N2?

new_comm(a, N2) = 1 * 6 + 1 * 1 = 7

g(a, N1, N2) = 13 - 7 - 6 = 0

old_comm(a, N1) = 2 * 6 + 1 * 1 = 13

mig(a, N1, N2) = 1 * 6 = 6




6

6

1

1

N1

N2

N3

6

1

1

43

N2

N3

6

6

1

N1

Phase-1a@N1: Move vertex a to N3?

new_comm(a, N3) = 1 * 1 + 2 * 1 = 3

old_comm(a, N1) = 2 * 6 + 1 * 1 = 13

mig(a, N1, N3) = 1 * 1 = 1

g(a, N1, N3) = 13 - 3 - 1 = 9




1

N1N2

N3

11

1

1

44

Phase-1a: Probabilistic Vertex Migration

Partition N1

Boundary Vtx a

Migration Dest N3

Gain 9

N2

b d

N3 N3

2 3

Migration Planning

Probability 9/9 2/3 3/3

Max Gain 9 3

N1 N2

N3

6

6

1

1

Migrate with a probability proportional to the gain

0

0 0

N3

e g

N3 N3

0 0

45

Phase-1b: Balancing Partitions

❖ Quota-Based Vertex Migration

Q2: What vertices to migrate?■ Phase-1a vertex migration, but limited by the quota.

Q1: How much work should each overloaded partition migrate to each underloaded partition?

■ Potential Gain Computation

● Similar to Phase-1a vertex gain computation

■ Iteratively allocate quota starting from the partition pair having the largest gain.

46

Planar: Physical Vertex Migration








anar

Plan

ar

Plan

ar

Plan

ar

Plan

ar


47

Planar: Convergence Check








anar

Plan

ar

Plan

ar

Plan

ar

Plan

ar


48

Phase-3: Convergence

Sk Sk+1 Sk+2 Sk+4 Sk+5

ConvergeEnough changes (structure/load)

Repartitioning Epoch

★ Converge○ improvement achieved per adaptation superstep < ○ after consecutive adaptation supersteps

Plan

ar

Plan

ar

Plan

ar

Plan

ar

Plan

ar

= 1% and = 10 (via Sensitivity Analysis on 12 datasets)49

Road Map

Introduction

Aragon

Paragon

Planar

Contention

Evaluation

Future Work

50

Network may not always be the bottleneck!

★ Dual-socket Xeon E5v2 server with ○ DDR3-1600○ 2 FDR 4x NICs per socket

Revisit the Impact of Memory Subsystem Carefully!

★ Infiniband: 1.7GB/s~37.5GB/s ★ DDR3: 6.25GB/s~16.6GB/s

Network vs Memory Bandwidth (C. Bing, CoRR’15)

51

Contention on Memory Subsystems: Intra-node data comm via shared memory!

Send Buffer

Sending Core Receiving Core

Receive BufferShared Buffer

1. Load 3. Load2b. Write

2a. Load 4a. Load

4b. Write

52

Cached Send/Shared/Receive Buffer

Contention on Memory Subsystems: Intra-node data comm-->cache pollution!

Multiple copies of the same data in LLC, contending for LLC and MC

53

Contention on Memory Subsystems: Intra-node data comm-->cache pollution!

Cached Send/Shared Buffer Cached Receive/Shared Buffer

Multiple copies of the same data in LLC, contending for LLC, MC, and QPI

54

(P)aragon/Planar Contention Avoiding: Penalize intra-node network comm cost!

Intra-Node Network Comm Cost

Maximal Inter-Node Network Comm Cost

Degree of Contention

System Bottleneck

Clusters with High-Speed Networks

Geo-distributed clusters or Cloud

Memory (λ=1) Network (λ=0)

55

(P)aragon/Planar Contention Avoiding: RDMA allows inter-node data comm without polluting the cache!

SendBuffer

Sending Core

Node#1

IB HCA

Receive Buffer

Receiving Core

Node#2

IB HCA

56

Road Map

Introduction

Aragon

Paragon

Planar

Contention

Evaluation

Future Work

57

Evaluation

❖ Microbenchmarks▪ Partitioning Quality

❖ Real-World Workloads▪ Breadth First Search (BFS)▪ Single Source Shortest Path (SSSP)

❖ Scalability Test▪ Scalability vs Graph Size▪ Scalability vs # of Partitions▪ Scalability vs Graph Size and # of Partitions

58

Real-World Workload: Setup

Cluster Configuration MPICluster(FDR Infiniband)

Gordon(QDR Infiniband)

# of Nodes 32 1024

Network Topology Single Switch(32 nodes / switch)

4*4*4 3D Torus of Switches(16 nodes / switch)

Network Bandwidth 56Gbps 8Gbps

Node Configuration MPICluster(Intel Haswell)

Gordon(Intel Sandy Bridge)

# of Sockets 2(10 cores / socket)

2(8 cores / socket)

L3 Cache 25MB 20MB

Memory Bandwidth 65GB/s 85GB/s

59

Real-World Workload: System Bottleneck

Intra-Node Network Comm Cost

Maximal Inter-Node Network Comm Cost

Degree of Contention

System Bottleneck

MPICluster Gordon

Memory (λ=1) Network (λ=0)

60

Real-World Workload: Baselines

CatchW

’13

Balanced Graph (Re)Partitioning

Partitioners(static graphs)

Repartitioners(dynamic graphs)

Metis’95

ICA3PP’08

SoCC’12

TKDE’15

BigData

’15

DG/LDG’12


(Poor Scalability)

Online Methods(Moderate Quality)(High Scalability)

Parmeti

s’97

Aragon’14

Para

gon’1

6xd

gp’13

Hermes

’15

Mizan’1

3


(Poor Scalability)

Online Methods(Moderate~High Quality)

(High Scalability)

Plan

ar

uniPlanar

Initial Partitioner: DG 61

BFS Exec. Time on MPICluster (λ=1): Planar achieved up to 9x speedups

9x

7.5x

5.8x

4.1x

1.48x 1.37x 1x

★ as-skitter: |V|=1.6M, |E| = 22M

★ 60 Partitions: three 20-core machines

62

BFS Comm Volume on MPICluster (λ=1): Planar had the lowest intra-node comm volume

★ as-skitter: |V|=1.6M, |E| = 22M


Reduction Intra-Socket

Inter-Socket

DG 51% 38%

METIS 51% 36%

PARMETIS 47% 34%

uniPLANAR 44% 28%

ARAGON 4.3% 0.8%

PARAGON 5.2% 2.6%

63

3.2x

1.05x 1.16x 1.21x

BFS Exec. Time on Gordon (λ=0): Planar achieved up to 3.2x speedups

1x

★ as-skitter: |V|=1.6M, |E| = 22M


64

51%

25%

11% 0.1%

BFS Comm. Volume on Gordon (λ=0): Planar had the lowest inter-node comm volume

★ as-skitter: |V|=1.6M, |E| = 22M


65

Road Map

Introduction

Aragon

Paragon

Planar

Contention

Evaluation

Future Work

66

Argo: Architecture-Aware Graph Partitioning(05/2016~09/2016)

❖ Goal: make initial partitioning architecture-aware❑ To further confirm the contention issue

(experimentally)o By collecting a set of low-level metrics

✓ (e.g., cache misses, TLB misses)❑ Architecture-aware static graph partitioner

o For the initial partitioning step

67

Sargon: Skew-Resistant Graph Partitioning(05/2016~09/2016)

❖ Goal: make initial partitioning skew-resistant❑ Workload characteristics

➢ Traversal-style graph workloads (e.g., BFS/SSSP)✓ Not all vertices are always active✓ A balanced partitioning of the entire graph ≠ even

active vertex distribution

68

Sargon: Skew-Resistant Graph Partitioning(05/2016~09/2016)

❖ Goal: make initial partitioning skew-resistant❑ Workload characteristics

➢ Traversal-style graph workloads (e.g., BFS/SSSP)✓ Not all vertices are always active✓ A balanced partitioning of the entire graph ≠ even active vertex distribution

❑ Graph structure characteristics➢ Skewed vertex degree distribution (scale-free)➢ A balanced partitioning of the entire graph ≠ even

high-degree vertex distribution.

69

Skew-Resistant Graph Repartitioning(09/2016~12/2016)

❖ Goal: make repartitioning skew-resistant❑ Workload characteristics

➢ Traversal-style graph workloads (e.g., BFS/SSSP)✓ Not all vertices are always active✓ A balanced partitioning of the entire graph ≠ even

active vertex distribution❑ Graph structure characteristics

➢ Skewed vertex degree distribution (scale-free)➢ A balanced partitioning of the entire graph ≠ even

high-degree vertex distribution.

70

Research Overview & Timeline

Timeline Work ProgressAragon: Heterogeneity-Aware Graph Partition Refinement

Completed [BigGraphs’14]

Paragon: Parallel Architecture-Aware Graph Partition Refinement

Completed [EDBT’16]

Planar: Parallel Lightweight Architecture-Aware Graph Repartitioning

Completed [ICDE’16]

05/2016~09/2016 Argo: Architecture-Aware Graph Partitioning Evaluation & Writing

05/2016~09/2016 Sargon: Skew-Resistant Graph Partitioning Evaluation & Writing

09/2016~12/2016 Skew-Resistant Graph Repartitioning Algorithm Design

07/2016~03/2017 Thesis writing Ongoing

04/2017 Thesis defense

71

Thanks!

72

Architecture-Aware Graph (Re)Partitioningpeople.cs.pitt.edu/~anz28/papers/proposal.slides.pdfN9 G P1...

Documents

Transcript of Architecture-Aware Graph (Re)Partitioningpeople.cs.pitt.edu/~anz28/papers/proposal.slides.pdfN9 G P1...