Implementing Efficient and Scalable Flow Control...

34
Implementing Efficient and Scalable Flow Control Schemes in MPI over InfiniBand Jiuxing Liu and Dhabaleswar K. Panda Computer Science and Engineering The Ohio State University

Transcript of Implementing Efficient and Scalable Flow Control...

Page 1: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Implementing Efficient and Scalable Flow Control Schemes

in MPI over InfiniBand

Jiuxing Liu and Dhabaleswar K. Panda

Computer Science and EngineeringThe Ohio State University

Page 2: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Presentation Outline

• Introduction and overview• Flow control design alternatives• Performance Evaluation• Conclusion

Page 3: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Introduction

• InfiniBand is becoming popular for high performance computing

• Flow control is an important issue in implementing MPI over InfiniBand– Performance– Scalability

Page 4: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

InfiniBand Communication Model

• Different transport services– Focus on Reliable Connection (RC) in this paper

• Queue-pair based communication model• Communication requests posted to send or

receive queues• Communication memory must be registered• Completion detected through Completion

Queue (CQ)

Page 5: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

InfiniBand Send/Recv and RDMA Operations

User-level Buffer

Registered Buffer

HCA

IB Fabric

User-level Buffer

Registered Buffer

HCA

Send Descriptor Receive Descriptor

Send/Recv Model

User-level Buffer

Registered Buffer

HCA

IB Fabric

Registered User-level Buffer

HCA

Send Descriptor

RDMA Model

•For Send/Recv, a send operation must be matched with a pre-posted receive (which specifies a receive buffer)

Page 6: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

End-to-End Flow Control Mechanism in InfiniBand

• Implemented at the hardware level• When there is no recv buffer posted

for an incoming send packet– Receiver sends RNR NAK– Sender retries

Page 7: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

MPI Communication Protocols

Send

Receive

Eager Data

Eager Protocol Rendezvous Protocol

Rendezvous Start

Rendezvous Reply

Rendezvous Data

Rendezvous Finish

Send

Receive

Page 8: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Expected and Unexpected Messages in MPI Protocols

• Expected Messages:– Rendezvous Reply– Rendezvous Data– Rendezvous Finish

• Unexpected Message:– Eager Data– Rendezvous Start

Page 9: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Why Flow Control is Necessary

• Unexpected messages need resources (CPU time, buffer space, etc)

• MPI itself does not limit the number of unexpected messages– Receiver may not be able to keep up– Resources may not be enough

• Flow control (in the MPI implementation) is needed to avoid the above problems

Page 10: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

• Send/Recv operations used for Eager protocol and control messages in Rendezvous protocol– Can also exploit RDMA (not used in this

paper)• RDMA used for Rendezvous Data• Unexpected messages are from

Send/Recv

InfiniBand Operations used for Protocol Messages

Page 11: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Presentation Outline

• Introduction and overview • Flow control design alternatives• Performance Evaluation• Conclusion

Page 12: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Flow Control Design Outline

• Common issues• Hardware based• User-level static• User-level dynamic

Page 13: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Flow Control Design Objectives

• Need to be effective– Preventing the receiver from being

overwhelmed• Need to be efficient

– Very little run time overhead– No unnecessary stall of communication– Efficient buffer usage

• How many buffers for each connection

Page 14: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Classification of Flow Control Schemes

• Hardware based vs. User-level– Hardware based schemes exploit InfiniBand

end-to-end flow control– User-level schemes implements flow control in

MPI implementation• Static vs. Dynamic

– Static schemes use a fixed number of buffers for each connection

– Dynamic schemes can adjust the number of buffers during execution

Page 15: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Hardware-Based Flow Control

• No flow control in MPI• Rely on InfiniBand end-to-end flow

control• Implemented entirely in hardware and

transparent to MPI

Page 16: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Advantages and Disadvantages of the Hardware-Based Scheme

• Advantages– Almost no run-time overhead at the MPI layer during

normal communication– Flow control mechanism makes progress independent of

application • Disadvantages

– Very little flexibility– The hardware flow control scheme may not be the best

for all communication patterns– Separation of buffer management and flow control

• No information to MPI to adjust its behavior • Difficult to implement dynamic schemes

Page 17: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

User-Level Static Schemes

• Flow control handled in MPI implementation

• Fixed number of buffers for each connection

• Credit-based scheme– Piggybacking– Explicit credit messages

Page 18: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Problems of the User-Level Static Scheme

• More overhead at the MPI layer (for credit management)

• Flow control progress depends on application

• Buffer usage is not optimal, may result in:– Wasted buffer for some connections– Unnecessary communication stall for other

connections

Page 19: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

User-Level Dynamic Schemes

• Similar to the user-level static schemes

• Start with only a few buffers for each connection

• Use a feedback based control mechanism to adjust the number of buffers based on communication pattern

Page 20: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

User-Level Dynamic Scheme Design Issues

• How to provide information feedback– When no credit, sender will put a message into a

“backlog”– Message will be sent when more credits are available– Tag messages to indicate if they have gone through the

backlog• How to respond to feedback

– Increase the number of buffers for the connection if a receiver gets a message that has gone through the backlog

– Linear or exponential increase

Page 21: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Presentation Outline

• Introduction and overview • Flow control design alternatives• Performance Evaluation• Conclusion

Page 22: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Experimental Testbed

• 8 SuperMicro SUPER P4DL6 nodes (2.4 GHz Xeon, 400MHz FSB, 512K L2 cache)

• Mellanox InfiniHost MT23108 4X HCAs (A1 silicon), PCI-X 66bit 133MHz

• Mellanox InfiniScale MT43132 switch

Page 23: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Outline of Experiments

• Microbenchmarks– Latency– Bandwidth

• MPI Blocking and Non-blocking Functions• Small and large messages

• NAS Benchmarks– Running time– Communication characteristics related to flow

control

Page 24: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

0

2

4

6

8

10

12

14

16

4 8 16 32 64 128 256 512 1024

Size (Bytes)

Tim

e (M

icro

sec)

Hardware LevelUser Level StaticUser Level Dynamic

MPI Latency

In latency tests, flow control usually is not an issue because communication is symmetric

All schemes perform the same, which means that user level overhead is very small in this case

Page 25: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Blocking MPI Calls

00.20.40.60.8

11.2

2 6 10 14 18 22 26 30 34 38 42 46 50Window Size

Band

wid

th(M

B/s)

Hardware LevelUser Level StaticUser Level Dynamic

Non-Blocking MPI Calls

00.20.40.60.8

11.2

2 6 10 14 18 22 26 30 34 38 42 46 50

Window SizeB

andw

idth

(MB/

s)

Hardware LevelUser Level StaticUser Level Dynamic

MPI Bandwidth (Prepost = 100 and Size = 4 Bytes)

With enough buffers, all schemes perform comparably for small messages

Blocking and Non-blocking MPI calls performs comparably for small messages because message are copied and sent eagerly

Page 26: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Blocking MPI Calls

0

0.2

0.4

0.6

0.8

1

1.2

2 6 10 14 18 22 26 30 34 38 42 46 50Window Size

Band

wid

th (M

B/s

)

Hardware LevelUser Level StaticUser Level Dynamic

Non-Blocking MPI Calls

0

0.2

0.4

0.6

0.8

1

1.2

2 6 10 14 18 22 26 30 34 38 42 46 50

Window SizeBa

ndw

idth

(MB

/s)

Hardware LevelUser Level StaticUser Level Dynamic

MPI Bandwidth (Prepost = 10 and Size = 4 Bytes)

Buffers are not enough, which triggers flow control mechanisms user level dynamic performs the best, user level static performs

the worst

Page 27: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Blocking MPI Calls

380

390

400

410

420

430

440

450

460

2 6 10 14 18 22 26 30 34 38 42 46 50Window Size

Ban

dwid

th (M

B/s)

Hardware LevelUser Level StaticUser Level Dynamic

Non-Blocking MPI Calls

450

500

550

600

650

700

2 6 10 14 18 22 26 30 34 38 42 46 50

Window SizeBa

ndw

idth

(MB/

s)

Hardware LevelUser Level StaticUser Level Dynamic

MPI Bandwidth (Prepost = 10 and Size = 32 KB)

Large messages use Rendezvous protocol which has two-way traffic

All schemes perform comparablyNon-blocking calls give better performance

Page 28: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

NAS Benchmarks (Pre-post = 100)

0.1

1

10

100

IS FT LU CG MG SP BT

Run

ning

Tim

e (s

)

Hardware-LevelUser-Level StaticUser-Level Dynamic

All schemes perform comparably when given enough buffers

Page 29: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

NAS Benchmarks (Pre-post = 1)

2%0%

171%

2%

20%

0% 0%2% 1%

13%

6%2% 0% 0%0% 0% 0% 0% 0% 1% 0%

0%

20%

40%

60%

80%

100%

120%

140%

160%

180%

IS FT LU CG MG SP BT

Perf

orm

ance

Deg

rada

tion

Hardware-BaseUser-Level Static

User-Level Dynamic

Even with very few buffers, most applications still perform wellLU performs significant worse for hardware based and user-level

staticOverall, user-level dynamic gives best performance

Page 30: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Explicit Credit Messages for User-Level Static Schemes

145310SP289130BT15951MG42020CG488059002LU1930FT3830IS#Total Msg#ECMApp

Piggybacking is quite effectiveIn LU, the number of explicit credit messages is high

Page 31: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Maximum Number of Buffers for User-Level Dynamic Schemes

7SP7BT6MG3CG63LU4FT4IS#BufferApp

Almost all applications only need a few (less than 8) buffers per connection for optimal performance

LU requires more buffers

Page 32: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Conclusions• Three different flow control schemes for

MPI over InfiniBand• Evaluation in terms of overhead and buffer

efficiency• Many applications (like those in NAS)

require a small number of buffers for each connection

• User-Level Dynamic Scheme can achieve both good performance and buffer efficiency

Page 33: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Future Work

• More application level evaluation• Evaluate using larger scale systems• Integrate the schemes with our

RDMA based design for small messages

• Exploit the recently proposed Shared Receive Queue (SRQ) feature

Page 34: Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Web Pointers

http://www.cis.ohio-state.edu/~panda/http://nowlab.cis.ohio-state.edu/

http://nowlab.cis.ohio-state.edu/projects/mpi-iba/

NBC home page

MVAPICH home page