A Coherence Protocol for Optimizing Global Shared Data Accesses
description
Transcript of A Coherence Protocol for Optimizing Global Shared Data Accesses
1
A Coherence Protocol for Optimizing Global Shared Data Accesses
Jeeva Paudel, University of Alberta, Canada J. Nelson Amaral, University of Alberta, Canada
Olivier Tardieu, IBM T. J. Watson, USA
2
Shared Variables are Fundamental Abstractions in Parallel and Distributed Programming
3
ReadWrite
…Node 1 Node N
-1
-1
4
-1
-1 -1
-1
4
-1
-1
Node 1 Node 2
Node 1 Node 2
Node 3Node 4
MonteCarlo Estimation of PI 5-Point Stencil Operations
Turing Ring Simulation
4
Challenge: Minimize Communication Latency
-1
-1
4
-1
-1
Node 1 Node 2
Ghost Cell Pattern for Data Sharing
Data payloadMessage id
Data payloadAddress
NetworkInterface
HostCPU
Memory
Two-sided Message
One-sided Message
Remote Direct Memory Access (RDMA)
Communication Optimization Techniques
Communication Optimization Techniques
atomic at (p) sv();Transfer Referencing Task to SV Home
Communication Optimization Techniques
atomic at (p) sv();Transfer Referencing Task to SV Home
atomic at (p) async sv();Remote Task Creation for SV Access
…Node 1 Node N
Write-Once / Read-Mostly
Node 1 Node 1 Node 1 Node 1
…
Replication
…Node 1 Node N
Write-Once / Read-Mostly
Node 1 Node 1 Node 1 Node 1
…
Replication
…Node 1 Node N
Result Object
…Node 1 Node N
Collecting Sum Reducer
…Node 1 Node N
General Read-Write
SV state
SV state
SV state
SV state
SV state
SV state
SV state
SV state
11
Coordinate Multiple Protocols for Different Access Patterns
A static data management scheme may not yield performance improvements on varied
data access patterns
SV state
SV state
GR state
SV state
SV state
SV state
SV state
SV state
1. One-sided PUT/GET to SV home
2. Migrate Referencing Task to SV home 3. Directory-based Protocol
Composite Protocols
0 0 0 0 ⋅⋅⋅ 0 〈 Shared Variable (SV)〉↔
n bits: one per nodedirty bit
0 1 0 0 ⋅⋅⋅ 0 SV is only in its allocated node
1 0 0 1 ⋅⋅⋅ 0 Only one node can have a dirty SV
0 0 1 1 ⋅⋅⋅ 1 Multiple nodes may have clean SVs
Directory Entries
14
Node
SV state
SV state
15
Node
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node
SV state
SV state
16
Node
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node
SV state
SV state
17
Node
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node
SV state
SV state
Home node for SV : the node where SV is allocatedRemote node for SV : a node whose memory does not store SV
18
Node
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node
SV state
SV state
Example: Home node is j
Read/Write activity at node i
j
i
19
Node j
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node i
SV state
SV state
Read Miss at node i
Request
20
Node j
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node i
SV state
SV state
Read Miss at node iCase 1: SV is in node j in clean state.
21
Node j
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node i
SV state
SV state
Read Miss at node iCase 1: SV is in node j in clean state.
0 0 1 0 ⋅⋅⋅ 00ij
22
Node j
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node i
SV state
SV state
Read Miss at node iCase 1: SV is in node j in clean state.
Data Copy
0 0 1 0 ⋅⋅⋅ 00ij
23
Node j
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node i
SV state
SV state
Read Miss at node iCase 1: SV is in node j in clean state.
0 0 1 1 ⋅⋅⋅ 00ij
24
Node j
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node i
SV state
SV state
Read Miss at node iCase 2: SV is copied in node j and is in dirty state. 1 0 1 0 ⋅⋅⋅ 00
ij
25
Node j
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node i
SV state
SV state
Read Miss at node iCase 2: SV is copied in node j and is in dirty state.
Write back1 0 1 0 ⋅⋅⋅ 00
ij
26
Node j
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node i
SV state
SV state
Read Miss at node iCase 2: SV is copied in node j and is in dirty state. 0 0 1 0 ⋅⋅⋅ 00
ij
SV Copy
27
Node j
SV state
SV state
Network
Node
SV state
SV state
Node
SV state
SV state
Node i
SV state
SV state
Read Miss at node iCase 2: SV is copied in node j and is in dirty state. 0 0 1 1 ⋅⋅⋅ 00
ij
28
Performance Evaluation
29
Communication Patterns
Test CompositeProtocols
Data Structures / Granularities
What do We Want in Benchmarks?
30
Best Hand Coded Versions
Performance Comparison
• X10’s Shared Memory Protocol (X10-Mem)• Directory-based Protocol (GR-Mem)• Combination (X10-Mem/GR-Mem)
31
Code- and Data-Layout Restructurings
Patterns of Shared Variable Accesses
A Read-mostly: Replicate node-local copies --- reduce remote access
B Write-mostly: Intact: localize write access to the site of allocation
C Aggregate Data: Refactor into individual objects for element-wise access --- reduce false sharing
D Write-Following-Read from each place: Collecting Sum Reducer – reduce frequent remote writes
E Write-Once: Replicate node-local copies --- reduce remote access
32
Code Restructurings in Hand-coded Versions
Benchmarks Code RestructuringsA B C D E
FSSimpleDist ✔ ✔K-Means ✔MontePiDist ✔N-Body ✔ ✔ ✔Jacobi ✔RayTracer ✔Unbalanced Tree Search ✔ ✔ ✔Linear Regression ✔Delaunay Mesh Generation (DMG) ✔ ✔ ✔Delaunay Mesh Refinement (DMR) ✔ ✔ ✔
33
• CentOS Linux 6.0• 1 Node = 2 HyperTransport connected CPUs• QuadCore Opteron Processors
Heldar(Opetron)No. nodes Cores per node Memory per node
16 8 8 GB
Platform
FSSi
mpl
eDist
K-M
eans
Mon
tePi
Dist
N-B
ody
Jaco
bi
RayT
race
r
UTS
Line
arRe
gres
sion
0
10
20
30
40
50
60
70
80
90
Spee
dup
Ove
r Seq
uenti
al
Using 128 workers
DMG
DMR
FSSi
mpl
eDist
K-M
eans
Mon
tePi
Dist
N-B
ody
Jaco
bi
RayT
race
r
UTS
Line
arRe
gres
sion
0
10
20
30
40
50
60
70
80
90
X10-Mem ManualGR-Mem
Spee
dup
Ove
r Seq
uenti
al
Using 128 workers
DMG
DMR
FSSi
mpl
eDist
K-M
eans
Mon
tePi
Dist
N-B
ody
Jaco
bi
RayT
race
r
UTS
Line
arRe
gres
sion
0
10
20
30
40
50
60
70
80
90
X10-Mem ManualGR-Mem
Spee
dup
Ove
r Seq
uenti
al
Using 128 workers
DMG
DMR
FSSi
mpl
eDist
K-M
eans
Mon
tePi
Dist
N-B
ody
Jaco
bi
RayT
race
r
UTS
Line
arRe
gres
sion
0
10
20
30
40
50
60
70
80
90
X10-Mem ManualGR-Mem
Spee
dup
Ove
r Seq
uenti
al
Using 128 workers
DMG
DMR
FSSi
mpl
eDist
K-M
eans
Mon
tePi
Dist
N-B
ody
Jaco
bi
RayT
race
r
UTS
Line
arRe
gres
sion
0
10
20
30
40
50
60
70
80
90
X10-Mem ManualGR-Mem
Spee
dup
Ove
r Seq
uenti
al
Using 128 workers
DMG
DMR
FSSi
mpl
eDist
K-M
eans
Mon
tePi
Dist
N-B
ody
Jaco
bi
RayT
race
r
UTS
Line
arRe
gres
sion
0
10
20
30
40
50
60
70
80
90
X10-Mem ManualGR-Mem/X10-Mem
Spee
dup
Ove
r Seq
uenti
al
Using 128 workers
DMG
DMR
…Node 1 Node N
Write-Once / Read-Mostly
Node 1Node 1Node 1 Node 1…
Replication
…Node 1 Node N
Result Object
…Node 1 Node N
Collecting Sum Reducer
GR state
GR state
GR state
GRstate
GR state
GR state
GRstate
GRstate
1. One-sided PUT/GET to GR home
2. Migrate Referencing Task to GR home
3. Directory-based Protocol
68
79
K-M
eans
128 Workers
Speedup
Benchmarks Code Restructurings
A B C D E
FSSimpleDist ✔ ✔K-Means ✔MontePiDist ✔N-Body ✔Jacobi ✔RayTracer ✔Unbalanced Tree Search ✔ ✔ ✔Linear Regression ✔Delaunay Mesh Generation (DMG)
✔
Delaunay Mesh Refinement (DMR)
✔ ✔ ✔
Applicable to (A)PGAS LanguagesChapel, Fortress
41
Questions?