A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan...
-
Upload
shona-bathsheba-sanders -
Category
Documents
-
view
219 -
download
0
Transcript of A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan...
A Framework for Collective Personalized Communication
Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan
PPL@UIUC
Collective Communication
Collective communication is a performance impediment
Collective personalized communication– All to all personalized communication
(AAPC)– Many to many personalized communication
(MMPC)
PPL@UIUC
Issues
Communication latencies not scaling with bandwidth and processor speeds– High software over head (α) – Message combining
Synchronous operations (MPI_Alltoall) do not utilize the co processor effectively
Performance Metrics– Completion time vs Compute overhead
PPL@UIUC
AAPC
Each processors sends a distinct message to every other processor
High software overhead for small messages Direct AAPC
– Cost = (P – 1) × (α + mβ) α is the total software overhead of sending the message
β is the per byte network overhead
m is the size of the message
PPL@UIUC
Optimizing AAPC
Direct AAPC is α dominated Message combining for small messages
– Reduce the total number of messages– Messages sent along a virtual topology– Multistage algorithm to send messages– Group of messages combined and sent to
an intermediate processors which then forward them to the final destinations
PPL@UIUC
Virtual Topology: MeshOrganize processors in a 2D (virtual) Mesh
Phase 1: Processors send messages to row neighbors1 P
Message from (x1,y1) to (x2,y2) goes via (x1,y2)
Phase 1: Processors send messages to column neighbors1 P
2* messages instead of P-1
As compared to
TDirect = (P – 1) × (α + mβ)
1P
PmPmP)(αP(TMesh 22)12
PPL@UIUC
Virtual Topology: Hypercube
Dimensional exchange
mPPThypercube )2/(()log(
6 7
3
6
10
2
PPL@UIUC
Virtual Topology: 3d Grid
Messages exchanged along a 3d Grid AAPC Overhead
PmPT dgrid 3333
PPL@UIUC
Imperfect MESH
Holes are evenly distributed among the members of that column
PmPT
knrowsjji
becomeskiji
mesh 22
)),1%((),(
),,(),(
PPL@UIUC
Experiments
Benchmarks run on PSC Lemeiux– 750 quad alpha nodes– Connected by the QsNet Elan network
• Elan NIC has a communication co-processor capable of doing asynchronous remote DMA
• Elan network also likes fewer messages to be sent and received
PPL@UIUC
AAPC Times for a Small Message
0102030405060708090
16 32 64 128 256 512 1024 2048
Processors
Time (
ms)
Lemieux Native MPI MeshDirect 3d grid
AAPC Scalibility
PPL@UIUC
AAPC on 1024 Processors of Lemieux
10
100
1000
100 1000 10000
Message Size (Bytes)
AAPC
Com
pletio
n Tim
e (ms
)
Mesh
Direct
Native MPI
Hypercube
3d Grid
PPL@UIUC
Case Study 1: Radix Sort
0
5
10
15
20
Ste
p T
ime
(s)
100B 200B 900B 4KB 8KB
Size of Message
Sort Time on 1024 Processors
Mesh
Direct
7664848KB
4162564KB
2213332KB
MeshDirectSize
AAPC Time (ms)
PPL@UIUC
AAPC Processor Overhead
0
100
200
300
400
500
600
700
800
900
0 2000 4000 6000 8000 10000
Message Size (Bytes)
Tim
e (m
s)
Direct Compute (ms) Mesh Compute (ms) Mesh Completion (ms)
PPL@UIUC
Compute Overhead: A New Metric
Strategies should also be evaluated on compute overhead
Asynchronous non blocking primitives needed– Compute overhead of the Mesh strategy is a
small fraction of the total AAPC completion time– A data driven system like Charm++ will
automatically support this
PPL@UIUC
AMPI
Provides virtualization and other features of Charm++ to MPI programs
AMPI AAPC interface– Split phase interface
MPI_Ialltoall(sndbuf, msg_size, MPI_CHAR, recvbuf, msg_size, MPI_CHAR, MPI_COMM_WORLD, &req);
// User Code while(!MPI_Test(&req, ...)){
// Do computation; }Also recommended for other MPI implementations!
PPL@UIUC
MMPC
Many (not all) processors send data to many other processors– New metric δ to evaluate performance
• δ : number of messages a processor sends or receives
– Uniform MMPC• Small variance in δ
– Non Uniform MMPC• Large variance in δ
PPL@UIUC
Uniform MMPC
)(
))(log(
33
223
mT
mPT
mPT
mPT
direct
hypercube
grid
mesh
δ is the degree of the communication graph
PPL@UIUC
Case Study 2: Neighbor Send
Synthetic benchmark where processor i sends messages to processors {i+1, i+2, … , (i+ δ)%P}
Neighbor Send on 2048 Processors with Small Messages
10
100
100 300 500 700 900 1100 1300 1500
degree
Neig
hbor
Sen
d Fi
nish
Tim
e
Mesh
Direct
Native MPI
Hypercube
3d Grid
PPL@UIUC
Case Study 3: Namd
0
20
40
60
80
100
120
140S
tep
Tim
e
256 512 1024
Processors
MeshDirectNative MPI
Performance of Namd with the Atpase molecule.PME step in Namd involves an a 192 X 144 processor collective operation with 900 byte messages
PPL@UIUC
Related Work
The topologies described in this paper have also been is presented in Kirshnan’s Masters thesis, 1999
The Mesh and the 3d Grid strategies are also presented inC. Christara, X. Ding, and K. Jackson. An efficient transposition algorithm for distributed memory clusters. In 13th Annual International Symposium on High Performance Computing Systems and Applications, 1999
PPL@UIUC
Summary
We present a non blocking framework for collective personalized communication– New performance metric
• AAPC compute time
MPI programs can make use of it though a split phase interface
PPL@UIUC
Future Work
Optimal strategy depends on (δ,p,m)– Develop a learning framework using principle of
persistence Physical topologies
– Bluegene! Non Uniform MMPC
– Analysis and new strategies Smart strategies for multiple simultaneous AAPC’s
over sections of processors– Needed by ab initio molecular dynamics
Software available at http://charm.cs.uiuc.edu