Architecture-Aware Graph (Re)Partitioningpeople.cs.pitt.edu/~anz28/papers/proposal.slides.pdfN9 G P1...
Transcript of Architecture-Aware Graph (Re)Partitioningpeople.cs.pitt.edu/~anz28/papers/proposal.slides.pdfN9 G P1...
Architecture-Aware Graph (Re)Partitioning
Thesis Proposal DefenseAngen Zheng
Committee: Alexandros Labrinidis, Depart. of Comp. Science, Pitt (Advisor)Panos K. Chrysanthis, Depart. of Comp. Science, Pitt (Co-advisor)Jack Lange, Depart. of Comp. Science, PittPeyman Givi, Swanson School of Engineering, PittPatrick Pisciuneri, Swanson School of Engineering, Pitt
1
Graph (Re)Partitioning
❖ Applications of Graph (Re)Partitioning▪ Scientific Simulations
▪ Distributed Graph Computation
o e.g., Pregel and Giraph
▪ VLSI Design
▪ Task Scheduling
▪ Linear Programming
2
Vertex-Centric BSP Computing Model
★ Vertex○ a unique identifier○ a modifiable, user-defined value
★ Edge○ a modifiable, user-defined value○ a target vertex identifier
UD
F
UD
F
UD
F
UD
F
★ Vertex-Centric UDF○ Change vertex/edge state○ Send msg to neighbours○ Receive msg from neighbors○ Mutate the graph topology ○ Deactivate at end of the superstep○ Reactivate by external msgs
3
A Balanced Partitioning = Even Load Distribution
N3N1
N2
Balanced:
4
Minimal Edge-Cut = Minimal Data Comm
N3N1
N2
Minimizing Edge-Cut:
5
Minimal Edge-Cut = Minimal Data Comm But Minimal Data Comm ≠ Minimal Comm Cost
Group neighboring vertices as close as possible
The (re)partitioner has to be Architecture-Aware
Figure 1. Pair-Wise Network Bandwidth (J. Xue, BigData’15)
STD DEV:416.82Mb/s
STD DEV:358.34Mb/s
STD DEV: 269 . 71Mb/s
6
Overview of the State-of-the-Art
Balanced Graph (Re)Partitioning
Partitioners(static graphs)
Repartitioners(dynamic graphs)
Metis’95
ICA3PP’08
SoCC’12
TKDE’15
BigData’15
DG/LDG’12
Offline Methods(High Quality)
(Poor Scalability)
Online Methods(Moderate Quality)(High Scalability)
Parmetis’
97
Offline Methods(High Quality)
(Poor Scalability)
Online Methods(Moderate~High Quality)
(High Scalability)
CatchW
’13xd
gp’13
Hermes
’15Miza
n’13
Aragon’14Paragon’16
Plan
ar’16
Architecture-Aware
7
Architecture-Aware Graph Repartitioning
Given G=(V, E) and an initial Partitioning P:
Minimizing Communication:
Balancing Load:
Network Cost
Minimizing Migration:
8
Road Map
Introduction
Aragon
Paragon
Planar
Contention
Evaluation
Future Work
9
Aragon: Sequential Architecture-Aware Graph Partition Refinement [BigGraphs’14]
❖ Goal: Group neighbouring vertices as close as possible❖ Input:
o A partitioned grapho The relative network comm cost matrix
❖ Output:o A partitioning with neighbouring vertices being
grouped as close as possible.
10
N3N1
N2
Aragon: Illustrate Aragon through an example
N1 N2 N3N1 1 6N2 1 1N3 6 1
11
N3N1
N2
Aragon: Send partitions to a centralized place
● Sending partitions to one place
N1 N2 N3N1 1 6N2 1 1N3 6 1
12
N3N1
N2
Aragon: Refine each partition pair sequentially
N1 N2 N3N1 1 6N2 1 1N3 6 1
● Sending partitions to one place● Refining two partitions a time
13
Aragon: Move vertices between N1 and N2?
N1 N2 N3N1 1 6N2 1 1N3 6 1
N3N1
N2
● Sending partitions to one place● Refining (N1, N2)
14
N3N1
N2
Aragon: Compute initial gain for vertices of N1 & N2
N1 N2 N3N1 1 6N2 1 1N3 6 1
● Sending partitions to one place● Refining (N1, N2)
■ Compute initial gain
15
N3N1
N2
Aragon: Compute initial gain for vertex a
N1 N2 N3N1 1 6N2 1 1N3 6 1
● Sending partitions to one place● Refining (N1, N2)
■ Compute initial gain■ a: N
1->N
2
16
N3N1
N2
Aragon: How the movement affects comm(N1, N2)?
N1 N2 N3N1 1 6N2 1 1N3 6 1
● Sending partitions to one place● Refining (N1, N2)
■ Compute initial gain■ a: N
1->N
2■ g
std(a) = (1-2)*1=-1
17
Aragon: How the movement affects comm(N1, N3)?
N3N1
N2
N1 N2 N3N1 1 6N2 1 1N3 6 1
● Sending partitions to one place● Refining (N1, N2)
■ Compute initial gain■ a: N
1->N
2■ g
std(a) = (1-2)*1=-1
■ gtopo
(a) = 1*(6-1)=5
18
Aragon: What is the migration cost of vertex a?
N3N1
N2
N1 N2 N3N1 1 6N2 1 1N3 6 1
● Sending partitions to one place● Refining (N1, N2)
■ Compute initial gain■ a: N
1->N
2■ g
std(a) = (1-2)*1=-1
■ gtopo
(a) = 1*(6-1)=5■ g
mig(a) = 1*1=1
19
N3N1
N2
Aragon: What’s the initial gain of moving a?
3
N1 N2 N3N1 1 6N2 1 1N3 6 1
● Sending partitions to one place● Refining (N1, N2)
■ Compute initial gain■ a: N
1->N
2■ g
std(a) = (1-2)*1=-1
■ gtopo
(a) = 1*(6-1)=5■ g
mig(a) = 1*1=1
■ g(a) = -1 + 5 – 1=3
20
● Sending partitions to one place● Refining (N1, N2)
■ Compute initial gain
N3N1
N2
Aragon: What’s the initial gain of other vertices?
3-2
-3
-2
0
0
-2
N1 N2 N3N1 1 6N2 1 1N3 6 1
21
N3N1
N2
Aragon: Which vertex has the max gain?
3-2
-3
-2
0
0
-2
N1 N2 N3N1 1 6N2 1 1N3 6 1
● Sending partitions to one place● Refining (N1, N2)
■ Compute initial gain■ Select max gain vertex, a
22
Aragon: Move the vertex with max gain
N3
N1
N2
-2
-3
-2
0
0
-2
N1 N2 N3N1 1 6N2 1 1N3 6 1
● Sending partitions to one place● Refining (N1, N2)
■ Compute initial gain■ Select max gain vertex, a■ Move a to N2
23
N3
N1
N2
-2
-1
-2
-2
0
0
Aragon: Update the gain of a’s nbors
N1 N2 N3N1 1 6N2 1 1N3 6 1
● Sending partitions to one place● Refining (N1, N2)
■ Compute initial gain■ Select max gain vertex, a■ Move a to N2■ Update gain of a’s nbors
24
Aragon: Repeat the whole process
N3
N1
N2
-2
-1
-2
-2
0
0
N1 N2 N3N1 1 6N2 1 1N3 6 1
● Sending partitions to one place● Refining (N1, N2)
■ Compute initial gain■ Repeat
■ Select max gain vertex, a■ Move a to N2■ Update gain of a’s nbors
25
Aragon: Output for the refinement of N1 and N2
● 4 Units Comm Cost (4 Edge-Cuts)● 1 Unit Migration Cost
N3
N1
N2
N1 N2 N3N1 1 6N2 1 1N3 6 1
● Sending partitions to one place● Refining (N1, N2)
■ Compute initial gain■ Repeat
■ Select max gain vertex, a■ Move a to N2■ Update gain of a’s nbors
26
Aragon: Refining N1 and N3
N3
N1
N2
N1 N2 N3N1 1 6N2 1 1N3 6 1
● Sending partitions to one place● Refining (N1, N3)
■ Compute initial gain■ Repeat
■ Select max gain vertex, a■ Move a to N2■ Update gain of a’s nbors
27
Aragon: Refining N2 and N3
N3
N1
N2
N1 N2 N3N1 1 6N2 1 1N3 6 1
● Sending partitions to one place● Refining (N2, N3)
■ Compute initial gain■ Repeat
■ Select max gain vertex, a■ Move a to N2■ Update gain of a’s nbors
28
Road Map
Introduction
Aragon
Paragon
Planar
Contention
Evaluation
Future Work
29
❖ Goal: Group neighbouring vertices as close as possible
Paragon: Parallel Architecture-Aware Graph Partitioning Refinement [EDBT’16]
Paragon vs Aragon○ lower overhead○ scale to much larger graphs
30
Paragon: Illustrate Paragon via an example
N1
N2
N3
N4
N5
N6
N7
N8
N9
G
P1
P2
P3
P4
P5
P6
P7
P8
P931
Paragon: Partition Grouping
P1P2P3
P4P6P9
P5P7P8
P1
P2
P3
P4
P5
P6
P7
P8
P9
N1
N2
N3
N4
N5
N6
N7
N8
N9
32
Paragon: Group Server Selection
P1
P2
P3
P4
P5
P6
P7
P8
P9
N1
N2
N3
N4
N5
N6
N7
N8
N9
N9
N8
N2P1P2P3
P4P6P9
P5P7P8
33
Paragon: Sending “Partition” to Group Servers
N1
N2
N3
N4
N5
N6
N7
N8
N9
P1
P2
P3
P4
P5
P6
P7
P8
P9
P1
P3
P4
P6
P5
P7
Only send boundary vertices
N9
N8
N2P1P2P3
P4P6P9
P5P7P8
34
Paragon: Parallel Refinement
Aragon
P1
P2
P3
P4
P5
P6
P7
P8
P9
P1
P3
P4
P6
P5
P7
Aragon
Aragon
N2
N3
N4
N5
N6
N7
N8
N9
N1
N9
N8
N2P1P2P3
P4P6P9
P5P7P8
Number of Groups○ Degree of Parallelism○ Parallelism vs Quality
35
Paragon: Shuffle Refinement
N2:
N9:
N8:
P1P2P4
P5P7P9
P3P6P8
Swap
Aragon
Aragon
Aragon
Parallel
P1P2P3
P4P6P9
P5P7P8
Repeat k times
To increase the # of partition pairs being refined!
36
Road Map
Introduction
Aragon
Paragon
Planar
Contention
Evaluation
Future Work
37
Planar: Parallel Lightweight Architecture-Aware Graph Repartitioning [ICDE’16]
Phase-1: Logical Vertex Migration
Phase-2: Physical Vertex Migration
Phase-3: Convergence Check
★ Migration Planning○ What vertices to move?○ Where to move?
★ Still beneficial?
★ Perform the Migration Plan
Sk Sk+1 Sk+2 Sk+4 Sk+5Pl
anar
Plan
ar
Plan
ar
Plan
ar
Plan
ar
○ Phase-1a: Minimizing Comm Cost○ Phase-1b: Ensuring Balanced Partitions
38
Phase-1a: Minimizing Comm Cost
N1 N2 N3
N1 6 1
N2 6 1
N3 1 1
N1 N2
N3
6
6
1
1
39
★ Run Planar on each partition in Parallel○ Each boundary vertex of my partition
■ make a migration decision on my own■ Probabilistic vertex migration
N1 N2
N3
6
6
1
1
Phase-1a: Minimizing Comm Cost
40
N1 N2
N3
Phase-1a: Minimizing Comm Cost
6
6
1
1
★ Run Planar on each partition in Parallel○ Each boundary vertex of my partition
■ make a migration decision on my own■ Probabilistic vertex migration
41
N1 N2
N3
Phase-1a@N1: Use vertex a as an example
6
6
1
1
g(a, N1, N1) = 0
★ Run Planar on each partition in Parallel○ Each boundary vertex of my partition
■ make a migration decision on my own■ Probabilistic vertex migration
Max Gain: 0Optimal Dest: N1
42
N1 N2
N3
Phase-1a@N1: Move vertex a to N2?
new_comm(a, N2) = 1 * 6 + 1 * 1 = 7
g(a, N1, N2) = 13 - 7 - 6 = 0
old_comm(a, N1) = 2 * 6 + 1 * 1 = 13
mig(a, N1, N2) = 1 * 6 = 6
★ Run Planar on each partition in Parallel○ Each boundary vertex of my partition
■ make a migration decision on my own■ Probabilistic vertex migration
Max Gain: 0Optimal Dest: N1
6
6
1
1
N1
N2
N3
6
1
1
43
N2
N3
6
6
1
N1
Phase-1a@N1: Move vertex a to N3?
new_comm(a, N3) = 1 * 1 + 2 * 1 = 3
old_comm(a, N1) = 2 * 6 + 1 * 1 = 13
mig(a, N1, N3) = 1 * 1 = 1
g(a, N1, N3) = 13 - 3 - 1 = 9
★ Run Planar on each partition in Parallel○ Each boundary vertex of my partition
■ make a migration decision on my own■ Probabilistic vertex migration
Max Gain: 9Optimal Dest: N3
1
N1N2
N3
11
1
1
44
Phase-1a: Probabilistic Vertex Migration
Partition N1
Boundary Vtx a
Migration Dest N3
Gain 9
N2
b d
N3 N3
2 3
Migration Planning
Probability 9/9 2/3 3/3
Max Gain 9 3
N1 N2
N3
6
6
1
1
Migrate with a probability proportional to the gain
0
0 0
N3
e g
N3 N3
0 0
45
Phase-1b: Balancing Partitions
❖ Quota-Based Vertex Migration
Q2: What vertices to migrate?■ Phase-1a vertex migration, but limited by the quota.
Q1: How much work should each overloaded partition migrate to each underloaded partition?
■ Potential Gain Computation
● Similar to Phase-1a vertex gain computation
■ Iteratively allocate quota starting from the partition pair having the largest gain.
46
Planar: Physical Vertex Migration
Phase-1: Logical Vertex Migration
Phase-2: Physical Vertex Migration
Phase-3: Convergence Check
★ Migration Planning○ What vertices to move?○ Where to move?
★ Still beneficial?
★ Perform the Migration Plan
Sk Sk+1 Sk+2 Sk+4 Sk+5Pl
anar
Plan
ar
Plan
ar
Plan
ar
Plan
ar
○ Phase-1a: Minimizing Comm Cost○ Phase-1b: Ensuring Balanced Partitions
47
Planar: Convergence Check
Phase-1: Logical Vertex Migration
Phase-2: Physical Vertex Migration
Phase-3: Convergence Check
★ Migration Planning○ What vertices to move?○ Where to move?
★ Still beneficial?
★ Perform the Migration Plan
Sk Sk+1 Sk+2 Sk+4 Sk+5Pl
anar
Plan
ar
Plan
ar
Plan
ar
Plan
ar
○ Phase-1a: Minimizing Comm Cost○ Phase-1b: Ensuring Balanced Partitions
48
Phase-3: Convergence
Sk Sk+1 Sk+2 Sk+4 Sk+5
ConvergeEnough changes (structure/load)
Repartitioning Epoch
★ Converge○ improvement achieved per adaptation superstep < ○ after consecutive adaptation supersteps
Plan
ar
Plan
ar
Plan
ar
Plan
ar
Plan
ar
= 1% and = 10 (via Sensitivity Analysis on 12 datasets)49
Road Map
Introduction
Aragon
Paragon
Planar
Contention
Evaluation
Future Work
50
Network may not always be the bottleneck!
★ Dual-socket Xeon E5v2 server with ○ DDR3-1600○ 2 FDR 4x NICs per socket
Revisit the Impact of Memory Subsystem Carefully!
★ Infiniband: 1.7GB/s~37.5GB/s ★ DDR3: 6.25GB/s~16.6GB/s
Network vs Memory Bandwidth (C. Bing, CoRR’15)
51
Contention on Memory Subsystems: Intra-node data comm via shared memory!
Send Buffer
Sending Core Receiving Core
Receive BufferShared Buffer
1. Load 3. Load2b. Write
2a. Load 4a. Load
4b. Write
52
Cached Send/Shared/Receive Buffer
Contention on Memory Subsystems: Intra-node data comm-->cache pollution!
Multiple copies of the same data in LLC, contending for LLC and MC
53
Contention on Memory Subsystems: Intra-node data comm-->cache pollution!
Cached Send/Shared Buffer Cached Receive/Shared Buffer
Multiple copies of the same data in LLC, contending for LLC, MC, and QPI
54
(P)aragon/Planar Contention Avoiding: Penalize intra-node network comm cost!
Intra-Node Network Comm Cost
Maximal Inter-Node Network Comm Cost
Degree of Contention
System Bottleneck
Clusters with High-Speed Networks
Geo-distributed clusters or Cloud
Memory (λ=1) Network (λ=0)
55
(P)aragon/Planar Contention Avoiding: RDMA allows inter-node data comm without polluting the cache!
SendBuffer
Sending Core
Node#1
IB HCA
Receive Buffer
Receiving Core
Node#2
IB HCA
56
Road Map
Introduction
Aragon
Paragon
Planar
Contention
Evaluation
Future Work
57
Evaluation
❖ Microbenchmarks▪ Partitioning Quality
❖ Real-World Workloads▪ Breadth First Search (BFS)▪ Single Source Shortest Path (SSSP)
❖ Scalability Test▪ Scalability vs Graph Size▪ Scalability vs # of Partitions▪ Scalability vs Graph Size and # of Partitions
58
Real-World Workload: Setup
Cluster Configuration MPICluster(FDR Infiniband)
Gordon(QDR Infiniband)
# of Nodes 32 1024
Network Topology Single Switch(32 nodes / switch)
4*4*4 3D Torus of Switches(16 nodes / switch)
Network Bandwidth 56Gbps 8Gbps
Node Configuration MPICluster(Intel Haswell)
Gordon(Intel Sandy Bridge)
# of Sockets 2(10 cores / socket)
2(8 cores / socket)
L3 Cache 25MB 20MB
Memory Bandwidth 65GB/s 85GB/s
59
Real-World Workload: System Bottleneck
Intra-Node Network Comm Cost
Maximal Inter-Node Network Comm Cost
Degree of Contention
System Bottleneck
MPICluster Gordon
Memory (λ=1) Network (λ=0)
60
Real-World Workload: Baselines
CatchW
’13
Balanced Graph (Re)Partitioning
Partitioners(static graphs)
Repartitioners(dynamic graphs)
Metis’95
ICA3PP’08
SoCC’12
TKDE’15
BigData
’15
DG/LDG’12
Offline Methods(High Quality)
(Poor Scalability)
Online Methods(Moderate Quality)(High Scalability)
Parmeti
s’97
Aragon’14
Para
gon’1
6xd
gp’13
Hermes
’15
Mizan’1
3
Offline Methods(High Quality)
(Poor Scalability)
Online Methods(Moderate~High Quality)
(High Scalability)
Plan
ar
uniPlanar
Initial Partitioner: DG 61
BFS Exec. Time on MPICluster (λ=1): Planar achieved up to 9x speedups
9x
7.5x
5.8x
4.1x
1.48x 1.37x 1x
★ as-skitter: |V|=1.6M, |E| = 22M
★ 60 Partitions: three 20-core machines
62
BFS Comm Volume on MPICluster (λ=1): Planar had the lowest intra-node comm volume
★ as-skitter: |V|=1.6M, |E| = 22M
★ 60 Partitions: three 20-core machines
Reduction Intra-Socket
Inter-Socket
DG 51% 38%
METIS 51% 36%
PARMETIS 47% 34%
uniPLANAR 44% 28%
ARAGON 4.3% 0.8%
PARAGON 5.2% 2.6%
63
3.2x
1.05x 1.16x 1.21x
BFS Exec. Time on Gordon (λ=0): Planar achieved up to 3.2x speedups
1x
★ as-skitter: |V|=1.6M, |E| = 22M
★ 48 Partitions: three 16-core machines
64
51%
25%
11% 0.1%
BFS Comm. Volume on Gordon (λ=0): Planar had the lowest inter-node comm volume
★ as-skitter: |V|=1.6M, |E| = 22M
★ 48 Partitions: three 16-core machines
65
Road Map
Introduction
Aragon
Paragon
Planar
Contention
Evaluation
Future Work
66
Argo: Architecture-Aware Graph Partitioning(05/2016~09/2016)
❖ Goal: make initial partitioning architecture-aware❑ To further confirm the contention issue
(experimentally)o By collecting a set of low-level metrics
✓ (e.g., cache misses, TLB misses)❑ Architecture-aware static graph partitioner
o For the initial partitioning step
67
Sargon: Skew-Resistant Graph Partitioning(05/2016~09/2016)
❖ Goal: make initial partitioning skew-resistant❑ Workload characteristics
➢ Traversal-style graph workloads (e.g., BFS/SSSP)✓ Not all vertices are always active✓ A balanced partitioning of the entire graph ≠ even
active vertex distribution
68
Sargon: Skew-Resistant Graph Partitioning(05/2016~09/2016)
❖ Goal: make initial partitioning skew-resistant❑ Workload characteristics
➢ Traversal-style graph workloads (e.g., BFS/SSSP)✓ Not all vertices are always active✓ A balanced partitioning of the entire graph ≠ even active vertex distribution
❑ Graph structure characteristics➢ Skewed vertex degree distribution (scale-free)➢ A balanced partitioning of the entire graph ≠ even
high-degree vertex distribution.
69
Skew-Resistant Graph Repartitioning(09/2016~12/2016)
❖ Goal: make repartitioning skew-resistant❑ Workload characteristics
➢ Traversal-style graph workloads (e.g., BFS/SSSP)✓ Not all vertices are always active✓ A balanced partitioning of the entire graph ≠ even
active vertex distribution❑ Graph structure characteristics
➢ Skewed vertex degree distribution (scale-free)➢ A balanced partitioning of the entire graph ≠ even
high-degree vertex distribution.
70
Research Overview & Timeline
Timeline Work ProgressAragon: Heterogeneity-Aware Graph Partition Refinement
Completed [BigGraphs’14]
Paragon: Parallel Architecture-Aware Graph Partition Refinement
Completed [EDBT’16]
Planar: Parallel Lightweight Architecture-Aware Graph Repartitioning
Completed [ICDE’16]
05/2016~09/2016 Argo: Architecture-Aware Graph Partitioning Evaluation & Writing
05/2016~09/2016 Sargon: Skew-Resistant Graph Partitioning Evaluation & Writing
09/2016~12/2016 Skew-Resistant Graph Repartitioning Algorithm Design
07/2016~03/2017 Thesis writing Ongoing
04/2017 Thesis defense
71
Thanks!
72