Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department...

Post on 18-Jan-2018

217 views 0 download

description

Ethernet Switched Cluster switch

Transcript of Pipelined Broadcast on Ethernet Switched Clusters Pitch Patarasuk, Ahmad Faraj, Xin Yuan Department...

Pipelined Broadcast on Ethernet Switched Clusters

Pitch Patarasuk, Ahmad Faraj, Xin Yuan

Department of Computer ScienceFlorida State UniversityTallahassee, FL 32306

Broadcast communication(MPI_Bcast)

n0 n1 n2 n3

n0 n1 n2 n3

Before

After

A B C D

A B C D A B C D A B C D A B C D

Let T(msize) = time to send a message of size msizeBroadcast(msize) >= T(msize)

Ethernet Switched Cluster

switch switch switch

switch

Problem statement:How to efficiently realize the broadcast operation with large message sizes on Ethernet switched clusters.

Using pipelined broadcast can achieve near optimal results (T(msize) time for broadcasting a message of size msize).

Finding contention free broadcast treeFinding a good segment size

Traditional Broadcast algorithms

0 1 2 3 4 5 6 7

• Linear tree

1 2 3 4 5 6 7

• Flat tree 0

Time = (P-1) x T(msize)

Time = (P-1) x T(msize)

0

1 2

3 4 5 6

7

• Binary tree0

1 2 3

4 5 6 7

• k-ary tree

• Time = 2x(log2(P+1)-1)xT(msize)

0

4 2

6 5

1

3

7

• Binomial tree

Time = log2P x T(msize)

• Scatter/Allgather

n0 n1 n2 n3

Before A B C D

A B C DScatter

Allgather A B C D A B C D A B C D A B C D

Time = 2 x T(msize)

Time Complexity for large messages

Linear tree (P-1) x T(msize)Flat tree (P-1) x T(msize)Binary tree 2x(log2(P+1)-

1)xT(msize)Approx. 2xlog2P x T(msize)

Binomial tree log2P x T(msize)Scatter/allgather

2xT(msize)

Pipelined Broadcast AlgorithmLinear pipeline

0 1 2 3

Performance of pipelined broadcast:Assume no network contentiona message of size msize be broken into X messages of msize/X.H: tree hight, D: the number of children

Size of pipelined stage: D * T(msize/X)Total time T: (X + H –1) * (D * T(msize /X))

linear tree: H = P, D = 1, T = T(msize) Binary tree: H = log(P), D= 2, T = 2T(msize)K-ary tree: H = log_k(P), D = k, in general not as efficient as binary tree.

Time Complexity for large messagesPipelined (linear)

T(msize)

Pipelined (binary)

2 x T(msize)

k-ary pipeline k x T(msize)Binomial tree log2P x T(msize)Scatter/allgather

2xT(msize)

Pipelined broadcastHow to find a contention-free broadcast tree?How to select the best segment size?

Example of network contention

0

1 2

3 4 5 6

7

• Binary tree

switch switch

n0,n1,n2,n3n4,n5,n6,n7

There is a link contention cause by communication (14), (25), (2 6), and (3 7)

• Linear tree

switch switch

n0,n1,n4,n5n2,n3,n6,n7

The linear tree 0123…7 will have acontention caused by (12) and (56)

Algorithm for constructing contention free linear tree

Step 1: Traverse through all switches using depth-first-search (DFS) algorithm, name the switch by the order of their arrival in DFS treeStep 2: The linear tree consists of all machines in switch S0, follows by all machines in S1, then S2,and so on

Example of contention free linear tree

SwitchS0

SwitchS1

n0,n1,n4,n5 n2,n3,n6,n7

SwitchS3

SwitchS2

n12,n13,n14,n15

n8,n9,n10,n11

Linear tree: n0n1n4n5236789…15

Algorithm for constructing contention free binary tree

Start with a contention free linear treeRecursively divide the tree into 2 sub-treesMake sure that the cannot be a contentionThe sub-trees are chosen such that the height of the whole tree will be minimal

0 1 2 3 4 5 6 7 8 9 101112131415

Binary tree height

Performance of binary pipeline broadcast depends on the height of a binary treeEven though contention free binary tree may not be a complete binary tree, its height is not that much more than a complete binary tree

Average tree heights for 20 randomly generated topologies

EvaluationContention free pipelined algorithms:

Routine generators from topology informationThe generated routines are based on MPICH p2p primitives.Linear treeBinary tree3-nary tree

Targets for comparison:MPICH: Binomial tree, Scatter/allgatherLAM: Flat-tree, BinomialTopology unaware pipelined linear and binary algorithms

Evaluation

Performance of different pipelined trees (topology 1)

Comparing pipelined broadcast with other schemes

Topology unaware and contention-free pipelined broadcast

Segment size for pipelined broadcast

ConclusionsPipelined broadcast is faster than the current broadcast algorithm for medium and large messages Linear pipeline has a completion time roughly equal to T(msize)binary pipeline broadcast is best for medium messagesContention free broadcast tree is necessary for pipelined algorithmsA good segment size for pipelined broadcast is not difficult to find.

Questions?