A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan...

A Framework for Collective Personalized Communication

Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan

PPL@UIUC

Collective Communication

Collective communication is a performance impediment

Collective personalized communication– All to all personalized communication

(AAPC)– Many to many personalized communication

(MMPC)

PPL@UIUC

Issues

Communication latencies not scaling with bandwidth and processor speeds– High software over head (α) – Message combining

Synchronous operations (MPI_Alltoall) do not utilize the co processor effectively

Performance Metrics– Completion time vs Compute overhead

PPL@UIUC

AAPC

Each processors sends a distinct message to every other processor

High software overhead for small messages Direct AAPC

– Cost = (P – 1) × (α + mβ) α is the total software overhead of sending the message

β is the per byte network overhead

m is the size of the message

PPL@UIUC

Optimizing AAPC

Direct AAPC is α dominated Message combining for small messages

– Reduce the total number of messages– Messages sent along a virtual topology– Multistage algorithm to send messages– Group of messages combined and sent to

an intermediate processors which then forward them to the final destinations

PPL@UIUC

Virtual Topology: MeshOrganize processors in a 2D (virtual) Mesh

Phase 1: Processors send messages to row neighbors1 P

Message from (x1,y1) to (x2,y2) goes via (x1,y2)

Phase 1: Processors send messages to column neighbors1 P

2* messages instead of P-1

As compared to

TDirect = (P – 1) × (α + mβ)

1P

PmPmP)(αP(TMesh 22)12

PPL@UIUC

Virtual Topology: Hypercube

Dimensional exchange

mPPThypercube )2/(()log(

6 7

3

6

10

2

PPL@UIUC

Virtual Topology: 3d Grid

Messages exchanged along a 3d Grid AAPC Overhead

PmPT dgrid 3333

PPL@UIUC

Imperfect MESH

Holes are evenly distributed among the members of that column

PmPT

knrowsjji

becomeskiji

mesh 22

)),1%((),(

),,(),(

PPL@UIUC

Experiments

Benchmarks run on PSC Lemeiux– 750 quad alpha nodes– Connected by the QsNet Elan network

• Elan NIC has a communication co-processor capable of doing asynchronous remote DMA

• Elan network also likes fewer messages to be sent and received

PPL@UIUC

AAPC Times for a Small Message

0102030405060708090

16 32 64 128 256 512 1024 2048

Processors

Time (

ms)

Lemieux Native MPI MeshDirect 3d grid

AAPC Scalibility

PPL@UIUC

AAPC on 1024 Processors of Lemieux

10

100

1000

100 1000 10000

Message Size (Bytes)

AAPC

Com

pletio

n Tim

e (ms

)

Mesh

Direct

Native MPI

Hypercube

3d Grid

PPL@UIUC

Case Study 1: Radix Sort

0

5

10

15

20

Ste

p T

ime

(s)

100B 200B 900B 4KB 8KB

Size of Message

Sort Time on 1024 Processors

Mesh

Direct

7664848KB

4162564KB

2213332KB

MeshDirectSize

AAPC Time (ms)

PPL@UIUC

AAPC Processor Overhead

0

100

200

300

400

500

600

700

800

900

0 2000 4000 6000 8000 10000

Message Size (Bytes)

Tim

e (m

s)

Direct Compute (ms) Mesh Compute (ms) Mesh Completion (ms)

PPL@UIUC

Compute Overhead: A New Metric

Strategies should also be evaluated on compute overhead

Asynchronous non blocking primitives needed– Compute overhead of the Mesh strategy is a

small fraction of the total AAPC completion time– A data driven system like Charm++ will

automatically support this

PPL@UIUC

AMPI

Provides virtualization and other features of Charm++ to MPI programs

AMPI AAPC interface– Split phase interface

MPI_Ialltoall(sndbuf, msg_size, MPI_CHAR, recvbuf, msg_size, MPI_CHAR, MPI_COMM_WORLD, &req);

// User Code while(!MPI_Test(&req, ...)){

// Do computation; }Also recommended for other MPI implementations!

PPL@UIUC

MMPC

Many (not all) processors send data to many other processors– New metric δ to evaluate performance

• δ : number of messages a processor sends or receives

– Uniform MMPC• Small variance in δ

– Non Uniform MMPC• Large variance in δ

PPL@UIUC

Uniform MMPC

)(

))(log(

33

223

mT

mPT

mPT

mPT

direct

hypercube

grid

mesh

δ is the degree of the communication graph

PPL@UIUC

Case Study 2: Neighbor Send

Synthetic benchmark where processor i sends messages to processors {i+1, i+2, … , (i+ δ)%P}

Neighbor Send on 2048 Processors with Small Messages

10

100

100 300 500 700 900 1100 1300 1500

degree

Neig

hbor

Sen

d Fi

nish

Tim

e

Mesh

Direct

Native MPI

Hypercube

3d Grid

PPL@UIUC

Case Study 3: Namd

0

20

40

60

80

100

120

140S

tep

Tim

e

256 512 1024

Processors

MeshDirectNative MPI

Performance of Namd with the Atpase molecule.PME step in Namd involves an a 192 X 144 processor collective operation with 900 byte messages

PPL@UIUC

Related Work

The topologies described in this paper have also been is presented in Kirshnan’s Masters thesis, 1999

The Mesh and the 3d Grid strategies are also presented inC. Christara, X. Ding, and K. Jackson. An efficient transposition algorithm for distributed memory clusters. In 13th Annual International Symposium on High Performance Computing Systems and Applications, 1999

PPL@UIUC

Summary

We present a non blocking framework for collective personalized communication– New performance metric

• AAPC compute time

MPI programs can make use of it though a split phase interface

PPL@UIUC

Future Work

Optimal strategy depends on (δ,p,m)– Develop a learning framework using principle of

persistence Physical topologies

– Bluegene! Non Uniform MMPC

– Analysis and new strategies Smart strategies for multiple simultaneous AAPC’s

over sections of processors– Needed by ab initio molecular dynamics

Software available at http://charm.cs.uiuc.edu

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan...

Documents

Transcript of A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan...