Exploiting and modelling topology aware collectives -...

35
Introduction: Dense Linear Algebra 2.5D Performance Multicasts Modelling collectives Performance scaling Conclusions and future work Exploiting and modelling topology aware collectives Brian Van Straalen and Edgar Solomonik May 5, 2011 Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 1/ 28

Transcript of Exploiting and modelling topology aware collectives -...

Page 1: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Exploiting and modelling topology awarecollectives

Brian Van Straalen and Edgar Solomonik

May 5, 2011

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 1/ 28

Page 2: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Outline

1 Introduction: Dense Linear Algebra

2 2.5D Performance

3 MulticastsTopology-aware multicasts

4 Modelling collectivesAssumptions and terminologyModel derivationsVerification of multicast model

5 Performance scalingExascale architectureCollectives at exascale2.5D algoritms at exascale

6 Conclusions and future work

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 2/ 28

Page 3: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

”3D” Matrix Algorithms

Current standard parallel dense matrix algorithms are far fromoptimal for communication (Demmel)

PDGEMM and other BLAS 3 algorithms in ScaLAPACK useminimal memory (do not replicate the matrix elements).3D algorithms replicate the elements to achieve the maximumamount of reuse and minimize communication.

While optimal, there is an obvious limitation here...availablememory

2.5D algorithms achieve maximum reuse and minimizecommunication for any given amount of available memory.Possible to create optimal dense algorithms under machinemodels that have both data movement costs and local memorysize limitations.

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 3/ 28

Page 4: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

2.5D matrix multiplication

0

1

2

3

0 21 3

1

2

3

0

2 03 1

0

1

21 2 0

0

10 10

0

2

32 30

10 10

0

22

+

22

33

00

11

+3D (P=64, c=4)

2.5D (P=32, c=2)

2D (P=16, c=1)

B₁₂

B₂₂

B₃₂

B₄₂

A₃₂A₃₁ A₃₃ A₃₄=

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 4/ 28

Page 5: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Example: 2.5D LU factorization

2. Perform TRSMs to compute a panel of L and a panel of U.

1. Factorize A₀₀ andcommunicate L₀₀ and U₀₀among layers.

L₀₀

U₀₀

U₀₃

U₀₃

U₀₁

L₂₀L₃₀

L₁₀

3. Broadcast blocks so alllayers own the panelsof L and U.

(A)

(B)

4.Broadcast differentsubpanels within eachlayer.

5.Multiply subpanelson each layer.

6.Reduce (sum) thenext panels.*

U

L

7. Broadcast the panels and continue factorizing the Schur's complement...

* All layers always need to contribute to reductioneven if iteration done with subset of layers.

(C)(D)

U₀₀

U₀₀

L₀₀

L₀₀

U₀₀

L₀₀

2.5D algorithmspropogate dataalongdimensions...

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 5/ 28

Page 6: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Collectives in dense linear algebra

BLAS 3 dense algorithms perform line multicasts and linereductions

These arise naturally due to dimensional dependenciese.g. rank k updates, Schur complement updates, QR updates

ScaLAPACK algorithms map well to a 2D network torus

We show that novel 2.5D algorithms can map to a any torus

can exploit specialized line collectives

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 6/ 28

Page 7: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

2.5D MM with different collectives

0

0.5

1

1.5

2

P=2048, 2D

P=2048, 2.5D bnm

P=2048, 2.5D rect

P=16384, 2D

P=16384, 2.5D bnm

P=16384, 2.5D rect

Tim

e no

rmal

ized

by

2D

2.5D MM vs 2D MM on BG/P (log2(n3/P)=37)

compute timeidle time

communication time

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 7/ 28

Page 8: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

2.5D LU with different collectives

0

0.5

1

1.5

2

P=2048, 2D

P=2048, 2.5D bnm

P=2048, 2.5D rect

P=16384, 2D

P=16384, 2.5D bnm

P=16384, 2.5D rect

Tim

e no

rmal

ized

by

2D

2.5D LU vs 2D LU on BG/P (log2(n3/P)=37)

compute timeidle time

communication time

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 8/ 28

Page 9: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Topology-aware multicasts

Why the performance discrepancy in multicasts?

Cray machines use binomial multicastsForm spanning tree from a list of nodesRoute copies of message down each branchNetwork contention degrades utilization on a 3D torus

BG/P uses rectangular multicastsRequire network topology to be a k-ary n-cubeForm 2n edge-disjoint spanning trees

Route in different dimensional orderUse both directions of bidirectional network

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 9/ 28

Page 10: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Topology-aware multicasts

2D rectangular multicasts

root2D 4X4 Torus Spanning tree 1 Spanning tree 2

Spanning tree 3 Spanning tree 4 All 4 trees combined

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 10/ 28

Page 11: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Topology-aware multicasts

Another look at that first plot

Just how much better arerectangular algorithms onP = 4096 nodes?

Binomial collectives on XE6

1/30th of linkbandwidth

Rectangular collectives onBG/P

4.3X the link bandwidth

Over 120X improvementin efficiency!

128

256

512

1024

2048

4096

8192

8 64 512 4096B

andw

idth

(MB

/sec

)

#nodes

1 MB multicast on BG/P, Cray XT5, and Cray XE6

BG/PXE6XT5

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 11/ 28

Page 12: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Assumptions and terminologyModel derivationsVerification of multicast model

A basic model for topology-aware collectives

Aim to model scaling behavior of rectangular collectives

Compare with existing binomial collectives models

Assumptions

k-ary n-cube torus networkbroadcast root at corner of cubemessage is large (focus on bandwidth)LogP messaging model appliesDMA device and wormhole routing are used for messaging

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 12/ 28

Page 13: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Assumptions and terminologyModel derivationsVerification of multicast model

Terminology

L physical network latencyo overhead of sending and receiving a messageg reciprocal of single-link bandwidthP number of processors.d dimension of network torusBn = 2d/g total node link bandwidthm message size in bytesγ flop rate of a nodeβ memory bandwidth of a node

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 13/ 28

Page 14: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Assumptions and terminologyModel derivationsVerification of multicast model

Modelling the communication fabric

For large message size Rendezvous messaging is used

handshare is performed with an eager send (2o + L)receiver gets the data with a one-sided get (o + 2L)total cost tr = mr · g + 3o + 3L

The DMA engine can use all torus links simultaneously

L costs in each dimension can be overlappedSender overhead, o, cannot be overlapped

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 14/ 28

Page 15: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Assumptions and terminologyModel derivationsVerification of multicast model

A model for rectangular multicasts

tmcast = m/Bn + 2(d + 1) · o + 3L + d · P1/d · (2o + L)

Our multicast model consists of 3 terms

1 m/Bn, the bandwidth cost incurred at the root

2 2(d + 1) · o + 3L, the start-up overhead of setting up themulticasts in all dimensions

3 d · P1/d · (2o + L), the path overhead reflects the time for apacket to get from the root to the farthest destination node

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 15/ 28

Page 16: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Assumptions and terminologyModel derivationsVerification of multicast model

A model for rectangular reductions

tred = max[m/(8γ), 3m/β,m/Bn]+2(d+1)·o+3L+d ·P1/d ·(2o+L)

Any multicast tree can be inverted to produce a reduction tree

The reduction operator must be applied at each node

each node operates on 2m databoth the memory bandwidth and computation cost can beoverlapped

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 16/ 28

Page 17: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Assumptions and terminologyModel derivationsVerification of multicast model

A model for binomial multicasts

The root of the binomial tree sends the entire messagelog2(P) times

The setup overhead is overlapped with the path overhead

We assume no contention

tbnm = log2(P) · (m/Bn + 2o + L)

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 17/ 28

Page 18: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Assumptions and terminologyModel derivationsVerification of multicast model

Model verification: one dimension

200

400

600

800

1000

1 8 64 512 4096 32768 262144

Ban

dwid

th (M

B/s

ec)

msg size (KB)

DCMF Broadcast on a ring of 8 nodes of BG/P

trect modelDCMF rectangle dput

tbnm modelDCMF binomial

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 18/ 28

Page 19: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Assumptions and terminologyModel derivationsVerification of multicast model

Model verification: two dimensions

0

500

1000

1500

2000

1 8 64 512 4096 32768 262144

Ban

dwid

th (M

B/s

ec)

msg size (KB)

DCMF Broadcast on 64 (8x8) nodes of BG/P

trect modelDCMF rectangle dput

tbnm modelDCMF binomial

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 19/ 28

Page 20: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Assumptions and terminologyModel derivationsVerification of multicast model

Model verification: three dimensions

0

500

1000

1500

2000

2500

3000

1 8 64 512 4096 32768 262144

Ban

dwid

th (M

B/s

ec)

msg size (KB)

DCMF Broadcast on 512 (8x8x8) nodes of BG/P

trect modelKumar et al data

DCMF rectangle dputtbnm model

DCMF binomial

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 20/ 28

Page 21: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Assumptions and terminologyModel derivationsVerification of multicast model

Rectangular reduction performance on BG/P

50

100

150

200

250

300

350

400

8 64 512

Ban

dwid

th (M

B/s

ec)

#nodes

DCMF Reduce peak bandwidth (largest message size)

torus rectangle ringtorus binomial

torus rectangletorus short binomial

BG/P rectangular reduction performs significantly worse thanmulticast

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 21/ 28

Page 22: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Assumptions and terminologyModel derivationsVerification of multicast model

A custom 1D ring reduction implementation

Implemented with near-neighbor MPI Sends

high overhead due to re-establishment of handshakeslarge packet size required

Reduction must be software pipelined to achieve overlap

First version achieves overlap via interrupts

Second version achieves overlap via explictly managedOpenMP threads

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 22/ 28

Page 23: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Assumptions and terminologyModel derivationsVerification of multicast model

Performance of custom line reduction

100

200

300

400

500

600

700

800

512 4096 32768

Ban

dwid

th (M

B/s

ec)

msg size (KB)

Performance of custom Reduce/Multicast on 8 nodes

MPI BroadcastCustom Ring Multicast

Custom Ring Reduce 2Custom Ring Reduce 1

MPI Reduce

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 23/ 28

Page 24: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Exascale architectureCollectives at exascale2.5D algoritms at exascale

A hypothetical exascale architecture

Total flop rate (γ) 1018 flop/sTotal memory 32 PBNode count (P) 262,144Node interconnect bandwidth (β) 100 GB/sLatency overhead (o) 250 nsNetwork latency (L) 250 nsTopology 3D Torus

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 24/ 28

Page 25: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Exascale architectureCollectives at exascale2.5D algoritms at exascale

Performance of collectives at exascale

0

20

40

60

80

100

120

140

1 8 64 512 4096 32768 262144

Ban

dwid

th (G

B/s

ec)

msg size (MB)

Exascale broadcast performance

trect 3Dtrect 2Dtrect 1Dtbnm 3Dtbnm 2Dtbnm 1D

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 25/ 28

Page 26: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Exascale architectureCollectives at exascale2.5D algoritms at exascale

Performance of matrix multiplication at exascale

0.4

0.5

0.6

0.7

0.8

0.9

1

1 8 64

Par

alle

l effi

cien

cy

z dimension of partition

MM strong scaling at exascale (xy plane to full xyz torus)

2.5D with rectangular (c=z)2.5D with binomial (c=z)

2D with binomial

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 26/ 28

Page 27: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Exascale architectureCollectives at exascale2.5D algoritms at exascale

Performance of LU at exascale

0

0.2

0.4

0.6

0.8

1

1 8 64

Par

alle

l effi

cien

cy

z dimension of partition

LU strong scaling at exascale (xy plane to full xyz torus)

2.5D with rectangular (c=z)2.5D with binomial (c=z)

2D LU with binomial

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 27/ 28

Page 28: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Introduction: Dense Linear Algebra2.5D Performance

MulticastsModelling collectivesPerformance scaling

Conclusions and future work

Conclusions and future work

2.5D algorithms perform well by using rectangular collectives

implement pivoting into 2.5D LUextend 2.5D algorithms to SUMMA algorithm, QR, etc.incorporate 2.5D algorithms into applications (e.g. quantumchemistry)

Collectives models show rectangular algorithms scalingsuperiority

model contention within a multicast treemodel contention among concurrent multicast trees

Topology-awareness and rectangular collectives are importantat exascale

analyze different exascale network topologies

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 28/ 28

Page 29: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Backup slides

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 29/ 28

Page 30: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Insight from BLAS 3 communication lower bounds

Given local memory of size M, each processor communicates

W = Ω

(n3/P√

M

)words.

2D algorithms: M = O(n2/P)

Standard algorithms (ScaLAPACK)Use single copy of input matrices

Move O(

n2√P

)words

2.5D algorithms: M = O(c · n2/P)

Use c copies of input matrices

Move O(

n2√c·P

)words

Less communication via redundant memory usage

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 30/ 28

Page 31: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Virtual topology of 2.5D algorithms

(p/c)-1

i

j

k

0

00

(p/c) -11/2

1/2

c-1

2D algorithm mapping:(√

P)×(√

P)

grid

2.5D algorithm mapping:(√

P/c)×(√

P/c)× c grid for any c

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 31/ 28

Page 32: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

2.5D matrix multiplication performance

0

10

20

30

40

50

60

70

80

25 31 37

Per

cent

age

of p

eak

log(n3/P)

2.5D MM on BG/P

P=2048, 2D MMP=2048, 2.5D MM

P=16384, 2D MMP=16384, 2.5D MM

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 32/ 28

Page 33: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

2.5D LU performance

0

5

10

15

20

25

30

35

25 31 37

Per

cent

age

of p

eak

log(n3/P)

2.5D LU on BG/P

P=2048, 2D LUP=2048, 2.5D LU

P=16384, 2D LUP=16384, 2.5D LU

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 33/ 28

Page 34: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Reduction in total communication time

0

20

40

60

80

100

25 31 37

Red

uctio

n in

com

mun

icat

ion

time

(%)

log(n3/P)

2.5D vs 2D communication reduction on BG/P

P=2048, MMP=2048, LU

P=16384, MMP=16384, LU

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 34/ 28

Page 35: Exploiting and modelling topology aware collectives - Peoplekubitron/courses/cs252-S11/projects/... · k-ary n-cube torus network broadcast root at corner of cube message is large

Performance analysis

2.5D algorithms map precisely to BG/P partitions

allows usage of rectangular collectivesreduces network contention

Communication time reduced substantially (up to 92%)

mostly due to rectangular collectives

Idle time also reduced in LU

communication along critical path reduced

So rectangular collectives can indeed be used by higher levelalgorithms

Now lets take a closer look at rectangular collectives ondifferent/future architectures

Brian Van Straalen and Edgar Solomonik Exploiting and modelling topology aware collectives 35/ 28