1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2,...
-
Upload
cody-johnson -
Category
Documents
-
view
215 -
download
0
Transcript of 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2,...
![Page 1: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/1.jpg)
1May 2011
RDMA Capable iWARP over Datagrams
Ryan E. Grant1, Mohammad J. Rashti1, Pavan Balaji2, Ahmad Afsahi1
1Department of Electrical and Computer Engineering
Queen’s University Kingston, ON, Canada K7L 3N6
2Mathematics and Computer Science
Argonne National Laboratory
Argonne, IL, USA
![Page 2: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/2.jpg)
2May 2011
Introduction
• Motivation• Background Information• Design• Experimental Framework and Results
– Microbenchmarks– Applications
• Conclusions– Future Work
• Questions
![Page 3: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/3.jpg)
3May 2011
Motivation
• Existing RDMA designs do not provide support for RDMA write operations over unreliable datagram (UD) transports
• Popular applications use datagrams– video on demand streaming – high-speed financial trading applications
• Desirable to leverage RDMA technology to improve application performance
• Improve performance of inter-node communication for Ethernet clusters
![Page 4: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/4.jpg)
4May 2011
Motivation
• Sandvine Inc. Report from Monday– Netflix consumes 29.7% of peak time
bandwidth in North America– Real-time entertainment consumes 49.2%– Predicting entertainment will consume 55-60%
of peak time bandwidth by the end of 2011– RTE and filesharing consume almost 70% of
peak time bandwidth
Source: www.sandvine.com/news/pr_detail.asp?ID=312
![Page 5: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/5.jpg)
5May 2011
Motivation
• Why use UD?– Scalability, no need for connections– Speed, no TCP congestion control– Simplicity, less complex implementation for
UD offloading than a TOE
• Drawbacks to UD?– Unreliability– Potential packet loss from congestion
![Page 6: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/6.jpg)
6May 2011
Outline
• Motivation• Background Information• Design• Experimental Framework and Results
– Microbenchmarks– Applications
• Conclusions– Future Work
• Questions
![Page 7: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/7.jpg)
7May 2011
Background Information
• iWARP– Remote Direct Memory Access over Ethernet
– Standard built on TCP or SCTP lower layer
– Queue pair based network
– Untagged and tagged models• Untagged, sent data matched with a posted receive
for local data placement• Tagged, sender aware of remote memory window
and provides target memory location
![Page 8: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/8.jpg)
8May 2011
Background Information
iWARP (UD) Stack versus Kernel TCP/IP Stack
![Page 9: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/9.jpg)
9May 2011
Background Information• Traditional iWARP RDMA Write
1. Verbs Request
2. iW
AR
P st
ack
appl
ies
tagg
ed
head
er (
STa
g an
d of
fset
)
3. Data sent to target
4. Data received
5. D
ata
wri
tten
into
mem
ory
base
d on
STa
g an
d of
fset
6. S
end
requ
est p
oste
d
7. Send request data sent to target8. Incoming data matched to Recv Request
9. R
ecv
requ
est H
andl
ed
10. RDMA Write valid after Recv
11. Application can access data
Alternatively, the application can poll a bit in memory to determine when write is complete
7. Poll on memory until valid
![Page 10: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/10.jpg)
10May 2011
Background
• Relies on the lower layer (TCP) for reliability
• With a UD LLP:– If using UD, target buffer may not have
complete message– Final send/recv lost in transit means complete
iWARP message loss
![Page 11: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/11.jpg)
11May 2011
Outline
• Motivation• Background Information• Design• Experimental Framework and Results
– Microbenchmarks– Applications
• Conclusions– Future Work
• Questions
![Page 12: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/12.jpg)
12May 2011
Design - Challenges with UD Transports
• UD Transports provide additional challenges over TCP– Unreliable!– No order guarantees– No connection information
• But solves some problems as well– No middlebox fragmentation issues
• No need for iWARP markers
![Page 13: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/13.jpg)
13May 2011
Challenges with UD
• RDMA functions like a local DMA, but Remote– For UD need to treat RDMA like an unreliable
memory– Indicate which areas of memory are “bad” due
to message loss
• Ideally it should be compatible with socket semantics– Done through an intermediate interface or
protocol
![Page 14: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/14.jpg)
14May 2011
Challenges with UD
• Allow for socket semantics compatibility– Each incoming message can result in a
completion notification– Functions like traditional recvmsg but using
user buffers– Similar to send/recv without posted recvs
• Allow for DMA-like interface– Produce a validity map for all valid areas of
memory in a defined memory region– Essentially an aggregate of many completion
notifications, delivered at once
![Page 15: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/15.jpg)
15May 2011
Background InformationBackground Information• iWARP RDMA Write-Record
1. Verbs Request
2. iW
AR
P st
ack
appl
ies
tagg
ed
head
er (
STa
g an
d of
fset
)
3. Data sent to target
4. Data received
5. D
ata
wri
tten
into
mem
ory
base
d on
STa
g an
d of
fset
8. Application can access data
7. Poll CQ for valid data
6. Location of valid data entered into CQ or Validity map
![Page 16: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/16.jpg)
16May 2011
Solving the Challenges of UD
• Ordering– Small messages are typical of UD (< 64K)– Direct placement avoids ordering issues for
small messages– Large messages – need to keep a message
sequence number counter for each user of a memory region
• No Connection Information– Pass sender’s IP/Port back to application upon
application validity data fetch
![Page 17: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/17.jpg)
17May 2011
Outline
• Motivation• Background Information• Design• Experimental Framework and Results
– Microbenchmarks– Applications
• Conclusions– Future Work
• Questions
![Page 18: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/18.jpg)
18May 2011
Experimental Framework
OS Processors NIC Switch
FedoraKernel2.6.31
2 – 2.0 Ghz Quad-Core AMD Opteron
NetEffect 10GigE Fujitsu 10GigE Switch
• Network Performance data collected using custom microbenchmark suite for software iWARP
• Application results collected using a custom socket interface to software iWARP and the following software:
VideoLan’s VLC (http://www.videolan.org/vlc)
SIPp (http://sipp.sourceforge.net)
UD Send/Recv first proposed in: Mohammad J. Rashti, Ryan E. Grant, Pavan Balaji, and Ahmad Afsahi, "iWARP Redefined: Scalable Connectionless Communication over High-Speed Ethernet", 17th International Conference on High Performance Computing (HiPC 2010), Goa, India, December 19-22, 2010.
![Page 19: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/19.jpg)
19May 2011
Microbenchmark Results
• UD RDMA Write-Record has the lowest small message latency, similar to UD Send/Recv
Verbs Small Message Latency
20
25
30
35
40
45
50
1 2 4 8 16 32 64 128 256 512 1K
Message Size (Bytes)
La
ten
cy
(µ
s)
UD Send/Recv RC Send/Recv
UD RDMA Write-Record RC RDMA Write
![Page 20: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/20.jpg)
20May 2011
Baseline Multi-Stream Performance
• RDMA Write-Record also has higher bandwidth for larger message sizes, and outperforms at medium message sizes as well
UniDirectional Bandwidth
0
50
100
150
200
250
1 4 16 64 256 1K 4K 16K 64K 256K1MB
Message Size (Bytes)
Ba
nd
wid
th (
MB
/s)
UD Send/Recv RC Send/RecvUD RDMA Write-Record RC RDMA Write
![Page 21: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/21.jpg)
21May 2011
Microbenchmark Results• RDMA Write-Record is more loss tolerant for large messages than Send/Recv
as well, as it delivers partial messages (messages may span multiple 64K UDP messages)
UD Send/Recv Bandwidth under Packet Loss Conditions
0
50
100
150
200
250
1 4 16 64 256 1K 4K 16K 64K 256K 1MB
Message Size (Bytes)
Band
widt
h (M
B/s)
0.1% loss 0.5% loss 1% loss 5% loss
UD RDMA Write-Record Bandwidth under Packet Loss Conditions
0
50
100
150
200
250
1 4 16 64 256 1K 4K 16K 64K 256K 1MB
Message Size (Bytes)
Band
widt
h (M
B/s)
0.1% loss 0.5% loss 1% loss 5% loss
![Page 22: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/22.jpg)
22May 2011
Microbenchmark Summary
• RDMA Write-Record provides good performance– Beats RC RDMA Write at the most important
message sizes for latency and bandwidth– Improves upon UD Send/Recv
• RDMA Write-Record fits well within existing socket semantics, enabling easy adoption– Removes MPA layer complexity as well as TCP
bottlenecks to enhance performance and reduce overall stack complexity
![Page 23: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/23.jpg)
23May 2011
Application Performance Results
![Page 24: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/24.jpg)
24May 2011
Application Performance
• Tested with Media Streaming and SIP phone applications for performance– Developed a sockets to verbs interface to allow
existing applications to use software iWARP stack (UD/RC iWARP)
– Lightweight interface to test functionality• Formally specified socket interface would be
helpful in facilitating acceptance• Operates in one iWARP transport mode at a time
only, RC or UD.• Sockets Direct Protocol is available for RC mode
hardware (not compatible with software iWARP)
![Page 25: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/25.jpg)
25May 2011
VLC Performance
VLC performance shows significantly less buffering time required for UD iWARP over RC iWARP, a 74% average improvement.
VLC Streaming Media Buffering Performance
0
200
400
600
800
1000
1200
1400
UD RCTransport Type
Tim
e (
ms)
Send/Recv RDMA Write (Record)
![Page 26: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/26.jpg)
26May 2011
SIP Performance
Sip shows a 43.1% improvement in response times using UD over RC (send/recv and RDMA Write (Record) are statistically tied in performance for this test)
SIP Response times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
UD RC
Transport
Tim
e (
ms
)
![Page 27: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/27.jpg)
27May 2011
Application Performance Discussion
• Performance with UD is better than with RC
• Software solution is still using TCP/IP and UDP stacks– OS related overhead in both cases is similar– Performance benefits from simpler UDP
transport• Hardware solutions would show benefit
from having no target CPU involvement required for data reception (no posted recvs)
• Target system can receive information without local machine work request
![Page 28: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/28.jpg)
28May 2011
Application Memory Usage
The memory usage of a UD solution for a SIP application can be significantly less than that of an RC solution (24.1% @ 10000 clients)
% Improvement in Memory Usage - UD vs RC
0
5
10
15
20
25
30
100 1000 10000
Number of Concurrent Calls
% Im
pro
vem
en
t
![Page 29: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/29.jpg)
29May 2011
Application Memory Usage
• Memory usage calculated using whole application memory usage as well as memory usage from the slab.
• Improvement of 24.1% @10000 users contrasts to theoretical improvement of 28.1%– Difference is in SIP application’s requirement
to store information on active UDP clients
• Scalability and offloaded networking for iWARP UD hardware are promising for increasing server capacity and throughput
![Page 30: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/30.jpg)
30May 2011
Outline
• Motivation• Background Information• Design• Experimental Framework and Results
– Microbenchmarks– Applications
• Conclusions– Future Work
• Questions
![Page 31: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/31.jpg)
31May 2011
Conclusions
• RDMA Write-Record is the first one-sided RDMA operation operable over UD on iWARP
• RDMA Write-Record allows for data transfer that can tolerate packet loss
• UD solution is more scalable than connection based one
• Full specifications for a two-sided Send/Recv and one-sided RDMA Write-Record over iWARP are now available
• Real applications show performance improvements using UD based iWARP
![Page 32: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/32.jpg)
32May 2011
Future Work
• Extend the work to include a reliable datagram transport, broadening the potential application space
• MPI-RDMA Write-Record interface for HPC applications
• Provide an SDP-like interface for UD iWARP
![Page 33: 1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649cf85503460f949c9863/html5/thumbnails/33.jpg)
33May 2011
Thank You
Questions?
Questions?
This work was supported in part by: Natural Sciences and Engineering Research Council of Canada Grant #RGPIN/238964-2005, Canada Foundation for Innovation and Ontario Innovation Trust Grant #7154, Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract DE-AC02-06CH11357, and the National Science Foundation Grant #0702182