Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7...

Design Guidelinesfor

High Performance RDMA Systems

Anuj Kalia (CMU) Michael Kaminsky (Intel Labs)

David Andersen (CMU)

1

2

RDMA is cheap (and fast!)

Mellanox Connect-IB• 2x 56 Gbps InfiniBand• ~2 µs RTT• RDMA• $1300

ProblemPerformance depends on complex low-level factors

Background: RDMA read

3

NIC

CPUCore Core

L3

PCI ExpressDMA read

RDMA read request

RDMA read response

How to design a sequencer?

4

Server

Client Client

87 88

Which RDMA ops to use?

5

Remote CPU bypass (one-sided)• Read• Write• Fetch-and-add• Compare-and-swap

Remote CPU involved (messaging, two-sided)• Send• Recv

2.2 M/sPerf?

6

How we sped up the sequencer

by 50X

Large RDMA design space

7

Operations READ WRITE ATOMIC SEND, RECV

Optimizations Inlined Unsignaled Doorbell batching

0B-RECVsWQE shrinking

Transports Reliable Unreliable Connected Datagram

Remote bypass (one-sided) Two-sided

Guidelines

8

NICs have multiple processing units (PUs)Avoid contention

Exploit parallelism

PCI Express messages are expensiveReduce CPU-to-NIC messages (MMIOs)Reduce NIC-to-CPU messages (DMAs)

High contention w/ atomics

9

PCI Express

Fetch&Add(A, 1)

PU PU

DMA read DMA write

Latency ~500nsThroughput ~2 M/s

CPUCore Core

L3 A Sequence counter

Reduce contention: use CPU cores

10

NIC

Core Core

L3

PCI Express (500 ns)DMA write

RDMA write (RPC req)

ACore to L3: 20 ns

SEND (RPC resp)[HERD, SIGCOMM 14]

11

Thro

ughp

ut (M

/s)

0306090

120150

Sequencer throughput

72.2

Atomics RPC (1 core)


50x

Reduce MMIOs w/ Doorbell batching

12

SEND

NIC

CPU

MMIOs ⇒ lots of CPU cycles

SEND

SEND

NIC

CPU SEND

DMA

Push

Pull

RPCs w/ Doorbell batching

13

CPU NICRequests

Responses

CPU NICRequests

Responses

Push Pull (Doorbell batching)

14

Sequencer throughputTh

roug

hput

(M/s

)

0306090

120150


16.672.2Atomics RPC (1 C) +Dbell batching

50x

Exploit NIC parallelism w/ multiQ

15

CPUCore Core

L3

PCI Express

A

Idle

SEND (RPC resp)

Bottleneck

16

Sequencer throughputTh

roug

hput

(M/s

)

0

30

60

90

120

150


27.416.672.2Atomics RPC (1 C) +3 queues+Dbell batching

50x

Thro

ughp

ut (M

/s)

030

6090

120150


97.2

27.416.672.2


Atomics RPC (1 C) +6 cores+Batching

17

Bottleneck = PCIe DMA bandwidth (paper)

+3 queues

50x

Reduce DMA size: Header-only

18

SEND

NIC

CPU0 12864

Header Size Data Unused640 128

64B 4B 8B 52BImm

Move payload

Header Imm0 64

Thro

ughp

ut (M

/s)

0306090

120150


12297.2

27.472.2


Atomics RPC (1 C) +6 cores+4 Queues,Dbell batching

19

+Header-only

50x

Evaluation• Evaluation of optimizations on 3 RDMA generations

• PCIe models, bottlenecks

• More atomics experiments

• Example: atomic operations on multiple addresses

20

21

RPC-based key-value storeTh

roug

hput

(M/s

)

0

25

50

75

100

Number of cores0 2 4 6 8 10 12 14

Baseline +Doorbell Batching

HERD[SIGCOMM 14]16B keys, 32B values, 5% PUTs

14 resps/doorbell

9 resps/doorbell

Conclusion

22

Code: https://github.com/anujkaliaiitd/rdma_bench

NICs have multiple processing units (PUs)

Avoid contentionExploit parallelism

PCI Express messages are expensive

Reduce CPU-to-NIC messages (MMIOs)Reduce NIC-to-CPU messages (DMAs)

https://github.com/anujkaliaiitd/rdma_bench

Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7...

Documents

Transcript of Design Guidelines - USENIX · 2019. 2. 26. · 120 150 Sequencer throughput 97.2 27.4 16.6 2.2 7...