Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7...
Transcript of Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7...
Design Guidelinesfor
High Performance RDMA Systems
Anuj Kalia (CMU) Michael Kaminsky (Intel Labs)
David Andersen (CMU)
1
2
RDMA is cheap (and fast!)
Mellanox Connect-IB• 2x 56 Gbps InfiniBand• ~2 µs RTT• RDMA• $1300
ProblemPerformance depends on complex low-level factors
Background: RDMA read
3
NIC
CPUCore Core
L3
PCI ExpressDMA read
RDMA read request
RDMA read response
How to design a sequencer?
4
Server
Client Client
87 88
Which RDMA ops to use?
5
Remote CPU bypass (one-sided)• Read• Write• Fetch-and-add• Compare-and-swap
Remote CPU involved (messaging, two-sided)• Send• Recv
2.2 M/sPerf?
6
How we sped up the sequencer
by 50X
Large RDMA design space
7
Operations READ WRITE ATOMIC SEND, RECV
Optimizations Inlined Unsignaled Doorbell batching
0B-RECVsWQE shrinking
Transports Reliable Unreliable Connected Datagram
Remote bypass (one-sided) Two-sided
Guidelines
8
NICs have multiple processing units (PUs)Avoid contention
Exploit parallelism
PCI Express messages are expensiveReduce CPU-to-NIC messages (MMIOs)Reduce NIC-to-CPU messages (DMAs)
High contention w/ atomics
9
PCI Express
Fetch&Add(A, 1)
PU PU
DMA read DMA write
Latency ~500nsThroughput ~2 M/s
CPUCore Core
L3 A Sequence counter
Reduce contention: use CPU cores
10
NIC
Core Core
L3
PCI Express (500 ns)DMA write
RDMA write (RPC req)
ACore to L3: 20 ns
SEND (RPC resp)[HERD, SIGCOMM 14]
11
Thro
ughp
ut (M
/s)
0306090
120150
Sequencer throughput
72.2
Atomics RPC (1 core)
Sequencer throughput
50x
Reduce MMIOs w/ Doorbell batching
12
SEND
NIC
CPU
MMIOs ⇒ lots of CPU cycles
SEND
SEND
NIC
CPU SEND
DMA
Push
Pull
RPCs w/ Doorbell batching
13
CPU NICRequests
Responses
CPU NICRequests
Responses
Push Pull (Doorbell batching)
14
Sequencer throughputTh
roug
hput
(M/s
)
0306090
120150
Sequencer throughput
16.672.2Atomics RPC (1 C) +Dbell batching
50x
Exploit NIC parallelism w/ multiQ
15
CPUCore Core
L3
PCI Express
A
Idle
SEND (RPC resp)
Bottleneck
16
Sequencer throughputTh
roug
hput
(M/s
)
0
30
60
90
120
150
Sequencer throughput
27.416.672.2Atomics RPC (1 C) +3 queues+Dbell batching
50x
Thro
ughp
ut (M
/s)
030
6090
120150
Sequencer throughput
97.2
27.416.672.2
Sequencer throughput
Atomics RPC (1 C) +6 cores+Batching
17
Bottleneck = PCIe DMA bandwidth (paper)
+3 queues
50x
Reduce DMA size: Header-only
18
SEND
NIC
CPU0 12864
Header Size Data Unused640 128
64B 4B 8B 52BImm
Move payload
Header Imm0 64
Thro
ughp
ut (M
/s)
0306090
120150
Sequencer throughput
12297.2
27.472.2
Sequencer throughput
Atomics RPC (1 C) +6 cores+4 Queues,Dbell batching
19
+Header-only
50x
Evaluation• Evaluation of optimizations on 3 RDMA generations
• PCIe models, bottlenecks
• More atomics experiments
• Example: atomic operations on multiple addresses
20
21
RPC-based key-value storeTh
roug
hput
(M/s
)
0
25
50
75
100
Number of cores0 2 4 6 8 10 12 14
Baseline +Doorbell Batching
HERD[SIGCOMM 14]16B keys, 32B values, 5% PUTs
14 resps/doorbell
9 resps/doorbell
Conclusion
22
Code: https://github.com/anujkaliaiitd/rdma_bench
NICs have multiple processing units (PUs)
Avoid contentionExploit parallelism
PCI Express messages are expensive
Reduce CPU-to-NIC messages (MMIOs)Reduce NIC-to-CPU messages (DMAs)