Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7...

22
Design Guidelines for High Performance RDMA Systems Anuj Kalia (CMU) Michael Kaminsky (Intel Labs) David Andersen (CMU) 1

Transcript of Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7...

Page 1: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

Design Guidelinesfor

High Performance RDMA Systems

Anuj Kalia (CMU) Michael Kaminsky (Intel Labs)

David Andersen (CMU)

1

Page 2: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

2

RDMA is cheap (and fast!)

Mellanox Connect-IB• 2x 56 Gbps InfiniBand• ~2 µs RTT• RDMA• $1300

ProblemPerformance depends on complex low-level factors

Page 3: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

Background: RDMA read

3

NIC

CPUCore Core

L3

PCI ExpressDMA read

RDMA read request

RDMA read response

Page 4: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

How to design a sequencer?

4

Server

Client Client

87 88

Page 5: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

Which RDMA ops to use?

5

Remote CPU bypass (one-sided)• Read• Write• Fetch-and-add• Compare-and-swap

Remote CPU involved (messaging, two-sided)• Send• Recv

2.2 M/sPerf?

Page 6: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

6

How we sped up the sequencer

by 50X

Page 7: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

Large RDMA design space

7

Operations READ WRITE ATOMIC SEND, RECV

Optimizations Inlined Unsignaled Doorbell batching

0B-RECVsWQE shrinking

Transports Reliable Unreliable Connected Datagram

Remote bypass (one-sided) Two-sided

Page 8: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

Guidelines

8

NICs have multiple processing units (PUs)Avoid contention

Exploit parallelism

PCI Express messages are expensiveReduce CPU-to-NIC messages (MMIOs)Reduce NIC-to-CPU messages (DMAs)

Page 9: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

High contention w/ atomics

9

PCI Express

Fetch&Add(A, 1)

PU PU

DMA read DMA write

Latency ~500nsThroughput ~2 M/s

CPUCore Core

L3 A Sequence counter

Page 10: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

Reduce contention: use CPU cores

10

NIC

Core Core

L3

PCI Express (500 ns)DMA write

RDMA write (RPC req)

ACore to L3: 20 ns

SEND (RPC resp)[HERD, SIGCOMM 14]

Page 11: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

11

Thro

ughp

ut (M

/s)

0306090

120150

Sequencer throughput

72.2

Atomics RPC (1 core)

Sequencer throughput

50x

Page 12: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

Reduce MMIOs w/ Doorbell batching

12

SEND

NIC

CPU

MMIOs ⇒ lots of CPU cycles

SEND

SEND

NIC

CPU SEND

DMA

Push

Pull

Page 13: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

RPCs w/ Doorbell batching

13

CPU NICRequests

Responses

CPU NICRequests

Responses

Push Pull (Doorbell batching)

Page 14: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

14

Sequencer throughputTh

roug

hput

(M/s

)

0306090

120150

Sequencer throughput

16.672.2Atomics RPC (1 C) +Dbell batching

50x

Page 15: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

Exploit NIC parallelism w/ multiQ

15

CPUCore Core

L3

PCI Express

A

Idle

SEND (RPC resp)

Bottleneck

Page 16: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

16

Sequencer throughputTh

roug

hput

(M/s

)

0

30

60

90

120

150

Sequencer throughput

27.416.672.2Atomics RPC (1 C) +3 queues+Dbell batching

50x

Page 17: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

Thro

ughp

ut (M

/s)

030

6090

120150

Sequencer throughput

97.2

27.416.672.2

Sequencer throughput

Atomics RPC (1 C) +6 cores+Batching

17

Bottleneck = PCIe DMA bandwidth (paper)

+3 queues

50x

Page 18: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

Reduce DMA size: Header-only

18

SEND

NIC

CPU0 12864

Header Size Data Unused640 128

64B 4B 8B 52BImm

Move payload

Header Imm0 64

Page 19: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

Thro

ughp

ut (M

/s)

0306090

120150

Sequencer throughput

12297.2

27.472.2

Sequencer throughput

Atomics RPC (1 C) +6 cores+4 Queues,Dbell batching

19

+Header-only

50x

Page 20: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

Evaluation• Evaluation of optimizations on 3 RDMA generations

• PCIe models, bottlenecks

• More atomics experiments

• Example: atomic operations on multiple addresses

20

Page 21: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

21

RPC-based key-value storeTh

roug

hput

(M/s

)

0

25

50

75

100

Number of cores0 2 4 6 8 10 12 14

Baseline +Doorbell Batching

HERD[SIGCOMM 14]16B keys, 32B values, 5% PUTs

14 resps/doorbell

9 resps/doorbell

Page 22: Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7 Sequencer throughput Atomics RPC (1 C) +Batching +6 cores 17 Bottleneck = PCIe DMA

Conclusion

22

Code: https://github.com/anujkaliaiitd/rdma_bench

NICs have multiple processing units (PUs)

Avoid contentionExploit parallelism

PCI Express messages are expensive

Reduce CPU-to-NIC messages (MMIOs)Reduce NIC-to-CPU messages (DMAs)