SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science...
-
Upload
margaretmargaret-hudson -
Category
Documents
-
view
222 -
download
2
Transcript of SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science...
![Page 1: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/1.jpg)
SCTP versus TCP for MPI
Brad Penoff, Humaira Kamal, Alan WagnerDepartment of Computer Science
University of British Columbia
![Page 2: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/2.jpg)
Outline
Self Introduction Research background Research presentation
SCTP & MPI backgroundMPI over SCTP designDesign featuresResultsConclusions
![Page 3: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/3.jpg)
Who am I?
Born and raised in Columbus area OSU alumni Europa alumni Worked a few years Grad student finishing my MSc at
UBC
![Page 4: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/4.jpg)
UBC
d
![Page 5: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/5.jpg)
Who do I work with?
Alan Wagner (Prof, UBC) Humaira Kamal (PhD, UBC) Mike Yao Chen Tsai (MSc, UBC) Edith Vong (BSc, UBC)
Randall Stewart (Cisco)
![Page 6: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/6.jpg)
What field do we work in?
Parallel computingConcurrently utilize multiple resources
![Page 7: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/7.jpg)
What field do we work in?
Parallel computingConcurrently utilize multiple resources
1 cook
![Page 8: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/8.jpg)
What field do we work in?
Parallel computingConcurrently utilize multiple resources
1 cookvs
8 cooks
![Page 9: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/9.jpg)
What field do we work in?
Time Saved
Parallel computingConcurrently utilize multiple resources
![Page 10: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/10.jpg)
What field do we work in?
Message passing programming model Message Passing Interface (MPI)
• Standardized API for applications
...result = compute();
MPI_Send(proc1, result, …);...
...local_answer = solve();
MPI_Recv(proc0, otherResult, ...);result = local_answer – otherResult;
...
message
Process 0 Process 1
![Page 11: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/11.jpg)
What field do we work in? Middleware for MPI
Glues necessary components together for parallel environment
JobScheduler
ProcessManager
MPIParallelLibrary
MPI Middleware
Parallel Application
Parallel Application
Resource Resource Resource Resource Resource
Parallel Application
Parallel Application
![Page 12: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/12.jpg)
What field do we work in? Middleware for MPI
Glues necessary components together for parallel environment
JobScheduler
ProcessManager
MPIParallelLibrary
MPI Middleware
Parallel Application
Parallel Application
Resource Resource Resource Resource Resource
Parallel Application
Parallel Application
←
![Page 13: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/13.jpg)
What field do we work in?
Parallel library componentImplements MPI API for various
interconnects• Shared memory• Myrinet• Infiniband• Specialized hardware (BlueGene/L, ASCI
Red, etc)
![Page 14: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/14.jpg)
What field do we work in?
TCP/IP protocol stack interconnect Stream Control Transmission Protocol
Application
Transport
Network
Link
TCP, UDP, SCTP
IP
Ethernet (device driver and
interface card)
![Page 15: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/15.jpg)
SCTP versus TCP for MPI
Brad Penoff, Humaira Kamal, Alan WagnerDepartment of Computer Science
University of British Columbia
Supercomputing 2005, Seattle, Washington USA
![Page 16: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/16.jpg)
What is MPI and SCTP?
Message Passing Interface (MPI) Library that is widely used to parallelize scientific
and compute-intensive programs Stream Control Transmission Protocol (SCTP)
General purpose unicast transport protocol for IP network data communications
Recently standardized by IETF Can be used anywhere TCP is used
![Page 17: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/17.jpg)
What is MPI and SCTP?
Message Passing Interface (MPI) Library that is widely used to parallelize scientific
and compute-intensive programs Stream Control Transmission Protocol (SCTP)
General purpose unicast transport protocol for IP network data communications
Recently standardized by IETF Can be used anywhere TCP is used
QuestionCan we take advantage of SCTP features to better support parallel applications using MPI?
![Page 18: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/18.jpg)
Communicating MPI Processes
TCP is often used as transport protocol for MPI
MPI API
MPI Process
TCP
IP
MPI API
MPI Process
TCP
IP
SCTP SCTP
![Page 19: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/19.jpg)
SCTP Key Features
Reliable in-order delivery, flow control, full duplex transfer.
Selective ACK is built-in the protocol
TCP-like congestion control
![Page 20: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/20.jpg)
SCTP Key Features
Message oriented
Use of associations
Multihoming
Multiple streams within an association
![Page 21: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/21.jpg)
Associations and Multihoming
Primary address Heartbeats Retransmissions Failover User adjustable
controls CMT
Node 0
NIC1 NIC2
Node 1
NIC3 NIC4
Network207.10.x.x
Network168.1.x.x
IP=207.10.40.1
IP=168.1.140.10IP=168.1.10.30
IP=207.10.3.20
![Page 22: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/22.jpg)
Logical View of Multiple Streams in an Association
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
Stream 1
Stream 2
Stream 3
SEND
SEND
RECEIVE
RECEIVE
Inbound Streams
Outbound Streams
![Page 23: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/23.jpg)
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D Msg EMsg B Msg C
Send order
![Page 24: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/24.jpg)
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A
Msg D Msg EMsg B Msg C
Send order
![Page 25: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/25.jpg)
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A
Msg D Msg E
Msg B
Msg C
Send order
![Page 26: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/26.jpg)
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A
Msg D Msg E
Msg B
Msg C
Send order
![Page 27: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/27.jpg)
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D
Msg E
Msg B
Msg C
Send order
![Page 28: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/28.jpg)
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D
Msg E
Msg B
Msg C
Send order
![Page 29: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/29.jpg)
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A
Msg D
Msg E
Msg B
Msg C
Receive order
![Page 30: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/30.jpg)
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A
Msg D
Msg E
Msg B
Msg C
Receive order
![Page 31: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/31.jpg)
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A
Msg D
Msg E
Msg B Msg C
Receive order
![Page 32: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/32.jpg)
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg E
Receive order
Msg A Msg DMsg B Msg C
![Page 33: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/33.jpg)
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D Msg EMsg B Msg C
Receive order
![Page 34: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/34.jpg)
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D Msg EMsg B Msg C
Receive order
Can be received in the same order as it was sent (required in TCP).
![Page 35: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/35.jpg)
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D
Msg E
Msg B
Msg C
Alternative receive order
![Page 36: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/36.jpg)
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D
Msg E
Msg B
Msg C
Alternative receive order
Msg A Msg D Msg EMsg B Msg C
![Page 37: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/37.jpg)
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D
Msg E
Msg B
Msg C
Alternative receive order
Msg A Msg DMsg E Msg B Msg C
![Page 38: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/38.jpg)
Partially Ordered User Messages Sent on
Different Streams
Endpoint X Endpoint Y
Stream 1
Stream 2
Stream 3
SENDRECEIVE
Msg A Msg D
Msg E
Msg B
Msg C
Alternative receive order
Msg A Msg DMsg EMsg B Msg C
![Page 39: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/39.jpg)
MPI API Implementaion
Message matching is done based on Tag, Rank and Context (TRC).
Combinations such as blocking, non-blocking, synchronous, asynchronous, buffered, unbuffered.
Use of wildcards for receive
MPI_Send(msg,count,type,dest-rank,tag,context)
MPI_Recv(msg,count,type,source-rank,tag,context)
Payload
Format of MPI Message
Context Rank Tag
Envelope
![Page 40: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/40.jpg)
MPI Messages Using Same Context, Two Processes
Process X Process Y
Msg_1MPI_Send(Msg_1,Tag_A)
MPI_Irecv(..ANY_TAG..)
MPI_Send(Msg_2,Tag_B)
MPI_Send(Msg_3,Tag_A) Msg_3
Msg_2
Process X Process Y
Msg_1MPI_Send(Msg_1,Tag_A)
MPI_Send(Msg_2,Tag_B)
MPI_Send(Msg_3,Tag_A)Msg_3
Msg_2
MPI_Irecv(..ANY_TAG..)
![Page 41: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/41.jpg)
MPI Messages Using Same Context, Two Processes
Process X Process Y
Msg_1
MPI_Send(Msg_1,Tag_A)
MPI_Send(Msg_2,Tag_B)
MPI_Send(Msg_3,Tag_A)Msg_3
Msg_2
MPI_Irecv(..ANY_TAG..)
Out of order messages withsame tagsviolate MPI semantics
![Page 42: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/42.jpg)
MPI API Implementation
Request Progression Layer
Short Messages vs. Long Messages
Application Layer Receive Request is Issued
MPI Implementation
SCTP LayerIncoming Message is Received
Unexpected Message Queue
Receive Request Queue
Runtime
Socket
![Page 43: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/43.jpg)
MPI over SCTP :Design and Implementation LAM (Local Area Multi-computer) is an open
source implementation of MPI library. Origins at Ohio Supercomputing Center
We redesigned LAM TCP RPI module to use SCTP.
RPI module is responsible maintaining state information of all requests.
![Page 44: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/44.jpg)
MPI over SCTP :Design and Implementation Challenges:
Lack of documentationCode examination
• Our document is linked-off LAM/MPI websiteExtensive instrumentation
• Diagnostic traces Identification of problems in SCTP protocol
![Page 45: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/45.jpg)
Using SCTP for MPI
Striking similarities between SCTP and MPI
MPISCTP
Context
Rank /Source
MessageTags
One-to-ManySocket
Association
Streams
![Page 46: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/46.jpg)
Implementation Issues Maintaining State Information
Maintain state appropriately for each request function to work with the one-to-many style.
Message Demultiplexing Extend RPI initialization to map associations to rank. Demultiplexing of each incoming message to direct it to the proper
receive function. Concurrency and SCTP Streams
Consistently map MPI tag-rank-context to SCTP streams, maintaining proper MPI semantics.
Resource Management Make RPI more message-driven. Eliminate the use of the select() system call, making the
implementation more scalable. Eliminating the need to maintain a large number of socket descriptors.
![Page 47: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/47.jpg)
Implementation Issues
Eliminating Race Conditions Finding solutions for race conditions due to added
concurrency. Use of barrier after association setup phase.
Reliability Modify out-of-band daemons and request progression
interface (RPI) to use a common transport layer protocol to allow for all components of LAM to multihome successfully.
Support for large messages Devised a long-message protocol to handle messages
larger than socket send buffer. Experiments with different SCTP stacks
![Page 48: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/48.jpg)
Features of Design
Scalability
Head-of-Line Blocking
![Page 49: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/49.jpg)
Scalability
MPIProcess
MPIProcess
MPIProcess
MPIProcess
N - 1 sockets
TCP
![Page 50: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/50.jpg)
Scalability
MPIProcess
MPIProcess
MPIProcess
MPIProcess
1 socket
SCTP
![Page 51: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/51.jpg)
Head-of-Line Blocking
Process X Process Y
Tag_AMPI_Send
MPI_Send
MPI_Irecv
MPI_Irecv
Tag_B
Msg_AMsg_B
Delivered
SCTP
Process X Process Y
Tag_AMPI_Send
MPI_Send
MPI_Irecv
MPI_Irecv
Tag_B
Msg_AMsg_B
Blocked
TCP
![Page 52: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/52.jpg)
P0
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
- - -
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
- - -
P1
![Page 53: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/53.jpg)
P0
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
- - -
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
- - -
P1
TCP
Exe
cuti
on
tim
e o
n P
0
MPI_Irecv
MPI_IrecvMsg-B arrives
Socket buffer
![Page 54: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/54.jpg)
P0
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
- - -
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
- - -
P1
TCP
Exe
cuti
on
tim
e o
n P
0
MPI_Irecv
MPI_Irecv
MPI_Waitany
Msg-B arrives
Msg-A arrives
Socket buffer
![Page 55: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/55.jpg)
P0
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
- - -
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
- - -
P1
TCP
Exe
cuti
on
tim
e o
n P
0
MPI_Irecv
MPI_Irecv
MPI_Waitany
Compute
MPI_Waitall
Msg-B arrives
Msg-A arrives
Socket buffer
![Page 56: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/56.jpg)
P0
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
- - -
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
- - -
P1
TCP
Ex
ecu
tio
n t
ime
on
P0
MPI_Irecv
MPI_Irecv
MPI_Waitany
Compute
MPI_Waitall
Msg-B arrives
Msg-A arrives
Socket buffer
SCTP
MPI_Irecv
MPI_Irecv
![Page 57: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/57.jpg)
P0
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
- - -
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
- - -
P1
TCP
Ex
ecu
tio
n t
ime
on
P0
MPI_Irecv
MPI_Irecv
MPI_Waitany
Compute
MPI_Waitall
Msg-B arrives
Msg-A arrives
Socket buffer
SCTP
MPI_Irecv
MPI_Irecv
MPI_Waitany
![Page 58: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/58.jpg)
P0
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Irecv(P1, MPI_ANY_TAG)
MPI_Waitany()
Compute()
MPI_Waitall()
- - -
MPI_Send(Msg-A, P0, tag-A)
MPI_Send(Msg-B, P0, tag-B)
- - -
P1
TCP
Ex
ecu
tio
n t
ime
on
P0
MPI_Irecv
MPI_Irecv
MPI_Waitany
Compute
MPI_Waitall
Msg-B arrives
Msg-A arrives
Socket buffer
SCTP
MPI_Irecv
MPI_Irecv
MPI_Waitany
Compute
MPI_Waitall
![Page 59: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/59.jpg)
Limitations
Comprehensive CRC32c checksum – offload to NIC not yet commonly available
SCTP bundles messages together so it might not always be able to pack a full MTU
SCTP stack is in early stages and will improve over time
Performance is stack dependant (Linux lksctp stack << FreeBSD KAME stack)
![Page 60: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/60.jpg)
Experiments
Controlled environment - Eight nodes -Dummynet
Used standard benchmarks as well as real world programs
Fair comparisonBuffer sizes, Nagle disabled, SACK ON,
No multihoming, CRC32c OFF
![Page 61: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/61.jpg)
Experiments: Benchmarks
MPBench Ping Pong Test
0
0.2
0.4
0.6
0.8
1
1.2
1.41
3276
8
6553
5
9830
2
1310
69
Message Size (bytes)
Th
rou
gh
pu
t N
orm
ali
zed
to
LA
M_
TC
P
va
lue
s
LAM_SCTP
LAM_TCP
MPBench Ping Pong Test under No Loss
![Page 62: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/62.jpg)
NAS Benchmarks
The NAS benchmarks approximate real world parallel scientific applications
We experimented with a suite of 7 benchmarks, 4 data set sizes
SCTP performance comparable to TCP for large datasets.
![Page 63: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/63.jpg)
Latency Tolerant Programs
Bulk Farm Processor programReal-world applicationNon-blocking communicationOverlap computation with
communicationUse of multiple tags
![Page 64: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/64.jpg)
Farm Program - Short Messages
LAM_SCTP versus LAM_TCP for Farm ProgramMessage Size: Short, Fanout: 10
8.7 11.7 16.06.2
88.1
154.7
0
50
100
150
200
0% 1% 2%Loss Rate
To
tal R
un
Tim
e (
se
co
nd
s)
LAM_SCTP
LAM_TCP
![Page 65: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/65.jpg)
Head-of-line blocking – Short messages
LAM_SCTP 10-Streams versus LAM_SCTP 1-Streamfor Farm Program. Message Size: Short, Fanout: 10
8.711.7
16.0
9.311.0
21.6
0
5
10
15
20
25
0% 1% 2%Loss Rate
To
tal R
un
Tim
e (
se
co
nd
s)
10 Streams
1 Stream
![Page 66: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/66.jpg)
Conclusions
SCTP is a better suited for MPIAvoids unnecessary head-of-line
blocking due to use of streams Increased fault tolerance in presence of
multihomed hosts In-built security featuresRobust under loss
SCTP might be key to moving MPI programs from LANs to WANs.
![Page 67: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/67.jpg)
Future Work
Release LAM SCTP RPI module at SC|05
Incorporate our work into Open MPI and/or MPICH2
Modify real applications to use tags as streams
![Page 68: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/68.jpg)
More information about our work is at:
http://www.cs.ubc.ca/labs/dsg/mpi-sctp/
Thank you!
![Page 69: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/69.jpg)
Extra Slides
![Page 70: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/70.jpg)
Partially Ordered User Messages Sent on
Different Streams
User messages
Message Stream Number (SNo)
Fragmentation
Data chunk queue
02122
Bundling
Control chunk queue
SCTP Layer
IP LayerSCTP Packets
![Page 71: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/71.jpg)
Added Security
P0 P1
INIT
INIT-ACK
COOKIE-ECHO
COOKIE-ACK
User data can be piggy-backed on third and fourth leg
SCTP’s Use of Signed Cookie
![Page 72: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/72.jpg)
Added Security
32 bit Verification Tag – reset attack Autoclose feature No half-closed state
![Page 73: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/73.jpg)
Farm Program - Long Messages
LAM_SCTP versus LAM_TCP for Farm ProgramMessage Size: Long, Fanout: 10
79786
1585
129
3103
6414
0
2000
4000
6000
8000
0% 1% 2%Loss Rate
To
tal R
un
Tim
e (
se
co
nd
s)
LAM_SCTP
LAM_TCP
![Page 74: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/74.jpg)
Head-of-line blocking – Long messages
LAM_SCTP 10-Streams versus LAM_SCTP 1-Streamfor Farm Program. Message Size: Long, Fanout: 10
79
786
1585
79
1000
1942
0
500
1000
1500
2000
2500
0% 1% 2%Loss Rate
To
tal R
un
Tim
e (
se
co
nd
s)
10 Streams
1 Stream
![Page 75: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/75.jpg)
Experiments: Benchmarks
SCTP outperformed TCP under loss for ping pong test.
![Page 76: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/76.jpg)
Experiments: Benchmarks
SCTP outperformed TCP under loss for ping pong test.
0100002000030000400005000060000
Bytes/second
1 2
Loss Rate
Throughput of Ping-pong w/ 30K messages
SCTP
TCP
![Page 77: SCTP versus TCP for MPI Brad Penoff, Humaira Kamal, Alan Wagner Department of Computer Science University of British Columbia.](https://reader036.fdocuments.net/reader036/viewer/2022062407/56649e425503460f94b34a46/html5/thumbnails/77.jpg)
Experiments: Benchmarks
SCTP outperformed TCP under loss for ping pong test.
0100020003000400050006000
Bytes/second
1 2
Loss Rate
Throughput of Ping-pong w/ 300K messages
SCTP
TCP