Octopus: an RDMA-enabled Distributed Persistent Memory ... · PDF fileOctopus: an RDMA-enabled...

Octopus: an RDMA-enabled Distributed Persistent Memory File System

Youyou Lu1, Jiwu Shu1, Youmin Chen1, Tao Li2

1Tsinghua University2University of Florida

1

Outline

•Background and Motivation

•Octopus Design

•Evaluation

•Conclusion

2

NVMM & RDMA

•NVMM (PCM, ReRAM, etc)• Data persistency

• Byte-addressable

• Low latency

• RDMA• Remote direct access• Bypass remote kernel• Low latency and high throughput

Client ServerRegistered Memory Registered Memory

HCA HCACPU CPU

A

B

C

D

E

3

Modular-Designed Distributed File System

• DiskGluster• Disk for data storage

• GigE for communication

• MemGluster• Memory for data storage

• RDMA for communication

18 ms

Latency (1KB write+sync)

98 %

2 %

Overall

HDD

Software

Network

324 us

99.7 %

Overall

MEM

Software

RDMA

4

Modular-Designed Distributed File System

• DiskGluster• Disk for data storage

• GigE for communication

• MemGluster• Memory for data storage

• RDMA for communication

Bandwidth (1MB write)

88 MB/s

83 MB/s

HDD

File System

118 MB/sNetwork

6509 MB/s

1779MB/s

MEM

File System

6350 MB/sRDMA

94 %

27 %

5

RDMA-enabled Distributed File System•More than fast hardware

• It is suboptimal to simply replace the network/storagemodule

•Opportunities and Challenges• NVM

• Byte-addressability

• Significant overhead of data copies

• RDMA• Flexible programming verbs (message/memory semantics)

• Imbalanced CPU processing capacity vs. network I/Os

6

RDMA-enabled Distributed File System

• It is necessary to rethink the design of DFS over NVM & RDMA 7

Byte-addressability of NVM

One-sided RDMA verbsShared data managements

CPU is the new bottleneckNew data flow strategies

Flexible RDMA verbs Efficient RPC design

RDMA Atomics Concurrent control

Opportunity Approaches

Outline

•Background and Motivation•Octopus Design• Shared Persistent Memory Pool• Self-Identified Metadata RPC• Client-Active Data I/O• Collect-Dispatch Transaction

•Evaluation•Conclusion

8

-> Reduce data copies-> Reduce response latency-> Rebalance CPU/network overhead-> Efficient concurrent control

Octopus Architecture

N2

NVMM

N2

NVMM

N3

NVMM

HCA HCA HCA

...

Shared Persistent Memory Pool

Client 1 Client 2 Self-Identified RPC

RDMA-based Data IO

create(“/home/cym”) Read(“/home/lyy”)

It performs remote direct data access just like an Octopus uses its eight legs

9

1. Shared Persistent Memory Pool

• Existing DFSs• Redundant data copy

Client Server

FS Image

User Space Buffer

GlusterFS

PageCache

User Space Buffer

mbuf

NICNIC

mbuf

10

7 copy



Client Server

FS Image

User Space Buffer

GlusterFS + DAX

PageCache

User Space Buffer

mbuf

NICNIC

mbuf

11

6 copy

•Octopus with SPMP• Introduces the shared persistent memory

pool• Global view of data layout



Client Server

FS Image

User Space Buffer User Space Buffer

mbuf

NICNIC

mbuf

12

MessagePool

4 copy

2. Client-Active Data I/O

• Server-Active• Server threads process the

data I/O

• Works well for slow Ethernet

• CPUs can easily become the bottleneck with fast hardware

• Client-Active• Let clients read/write data

directly from/to the SPMP

C1 timeNIC MEMCPUC2

Lookup file data Send data

Lookup file data Send address13

SERVER

3. Self-Identified Metadata RPC

• Message-based RPC• easy to implement, lower throughput

• DaRPC[SoCC’14], FaSST[OSDI’16]

• Memory-based RPC• CPU cores scan the message buffer

• FaRM[NSDI’14]

• Using rdma_write_with_imm?• Scan by polling

• Imm data for self-identification

14

Message PoolHCA

Thead1 Theadn

HCA DATAID

Message PoolHCA

Thead1 Theadn

HCA DATA

4. Collect-Dispatch Distributed Transaction• mkdir, mknod operations need distributed transactions

• Two-Phase Locking & Commit• Distributed logging

• Two-phase coordination

2 * RPC15

Coordinator

Participant

BeginLog

BeginLocalLock wait

LogCommit/

Abort

Write Data End

EndBegin

LogBegin

TxExec

LogContext

LogCommit/Abort

LocalLock

LogBegin

TxExec

LogContext

LogCommit/Abort

wait

LocalUnlock

Write Data

LocalUnlock

waitLogEnd

prepare commit

• mkdir, mknod operations need distributed transactions

• Collect-Dispatch Transaction• Local logging with remote in-place update

• Distributed lock service

Coordinator

Participant

BeginLog

BeginLocalLock

wait Local ExecLog &

UpdateLocal

Unlock

waitLocalLock

Collect

Commit/Abort

End

Update

Unlock

End

1 * RPC + 2 * One-sidedcollect dispatch

Outline


•Octopus Design

•Evaluation

•Conclusion

16

Evaluation Setup• Evaluation Platform

• Connected with Mellanox SX1012 switch

• Evaluated Distributed File Systems• memGluster, runs on memory, with RDMA connection• NVFS[OSU], Crail[IBM], optimized to run on RDMA• memHDFS, Alluxio, for big data comparison

Cluster CPU Memory ConnectX-3 FDR Number

A E5-2680 * 2 384 GB Yes * 5

B E5-2620 16 GB Yes * 7

17

Overall Efficiency

75%

80%

85%

90%

95%

100%

getattr readdir

Latency Breakdown

software mem network

0

1000

2000

3000

4000

5000

6000

7000

Write Read

Bandwidth Utilization

software mem network

• Software latency is reduced rom 326 us to 6 us

• Achieves read/write bandwidth that approaches the raw storage and network bandwidth

18

Metadata Operation Performance

• Octopus provides metadata IOPS in the order of 105~106

• Octopus can scales linearly

6.E+03

6.E+04

6.E+05

1 2 3 4 5

MKNOD

glusterfs nvfscrail dmfscrail-poll

8.E+04

8.E+05

8.E+06

1 2 3 4 5

GETATTR


3.E+04

3.E+05

1 2 3 4 5

RMNOD


19

Read/Write Performance

• Octopus can easily reach the maximum bandwidth of hardware with a single client

• Octopus can achieve the same bandwidth as Crail even add an extra data copy [not shown]

6.E+00

6.E+01

6.E+02

6.E+03

1K 4K 16K 64K 256K 1MB

WRITE

glusterfs

nvfs

crail-np

crail-poll

dmfs

6.E+00

6.E+01

6.E+02

6.E+03

1K 4K 16K 64K 256K 1MB

READ

20

5629 MB/s 6088 MB/s

Big Data Evaluation

•Octopus can also provide better performance for big data applications than existing file systems.

0

500

1000

1500

2000

2500

3000

write read

TestDFSIO (MB/s)

memHDFS Alluxio NVFS Crail Octopus

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

Teragen Wordcount

Normalized Execution Time

memHDFS Alluxio NVFS Crail Octopus

21

Outline


•Octopus Design

•Evaluation

•Conclusion

22

Conclusion• It is necessary to rethink the DFS designs over emerging H/Ws

•Octopus’s internal mechanisms• Simplifies data management layer by reducing data copies• Rebalances network and server loads with Client-Active I/O• Redesigns the metadata RPC and distributed transaction with

RDMA primitives

• Evaluations show that Octopus significantly outperforms existing file systems

23

Q&A

Thanks

24

Octopus: an RDMA-enabled Distributed Persistent Memory ... · PDF fileOctopus: an RDMA-enabled...

Documents

Transcript of Octopus: an RDMA-enabled Distributed Persistent Memory ... · PDF fileOctopus: an RDMA-enabled...