Octopus: an RDMA-enabled Distributed Persistent Memory ... · PDF fileOctopus: an RDMA-enabled...
Transcript of Octopus: an RDMA-enabled Distributed Persistent Memory ... · PDF fileOctopus: an RDMA-enabled...
Octopus: an RDMA-enabled Distributed Persistent Memory File System
Youyou Lu1, Jiwu Shu1, Youmin Chen1, Tao Li2
1Tsinghua University2University of Florida
1
NVMM & RDMA
•NVMM (PCM, ReRAM, etc)• Data persistency
• Byte-addressable
• Low latency
• RDMA• Remote direct access• Bypass remote kernel• Low latency and high throughput
Client ServerRegistered Memory Registered Memory
HCA HCACPU CPU
A
B
C
D
E
3
Modular-Designed Distributed File System
• DiskGluster• Disk for data storage
• GigE for communication
• MemGluster• Memory for data storage
• RDMA for communication
18 ms
Latency (1KB write+sync)
98 %
2 %
Overall
HDD
Software
Network
324 us
99.7 %
Overall
MEM
Software
RDMA
4
Modular-Designed Distributed File System
• DiskGluster• Disk for data storage
• GigE for communication
• MemGluster• Memory for data storage
• RDMA for communication
Bandwidth (1MB write)
88 MB/s
83 MB/s
HDD
File System
118 MB/sNetwork
6509 MB/s
1779MB/s
MEM
File System
6350 MB/sRDMA
94 %
27 %
5
RDMA-enabled Distributed File System•More than fast hardware
• It is suboptimal to simply replace the network/storagemodule
•Opportunities and Challenges• NVM
• Byte-addressability
• Significant overhead of data copies
• RDMA• Flexible programming verbs (message/memory semantics)
• Imbalanced CPU processing capacity vs. network I/Os
6
RDMA-enabled Distributed File System
• It is necessary to rethink the design of DFS over NVM & RDMA 7
Byte-addressability of NVM
One-sided RDMA verbsShared data managements
CPU is the new bottleneckNew data flow strategies
Flexible RDMA verbs Efficient RPC design
RDMA Atomics Concurrent control
Opportunity Approaches
Outline
•Background and Motivation•Octopus Design• Shared Persistent Memory Pool• Self-Identified Metadata RPC• Client-Active Data I/O• Collect-Dispatch Transaction
•Evaluation•Conclusion
8
-> Reduce data copies-> Reduce response latency-> Rebalance CPU/network overhead-> Efficient concurrent control
Octopus Architecture
N2
NVMM
N2
NVMM
N3
NVMM
HCA HCA HCA
...
Shared Persistent Memory Pool
Client 1 Client 2 Self-Identified RPC
RDMA-based Data IO
create(“/home/cym”) Read(“/home/lyy”)
It performs remote direct data access just like an Octopus uses its eight legs
9
1. Shared Persistent Memory Pool
• Existing DFSs• Redundant data copy
Client Server
FS Image
User Space Buffer
GlusterFS
PageCache
User Space Buffer
mbuf
NICNIC
mbuf
10
7 copy
1. Shared Persistent Memory Pool
• Existing DFSs• Redundant data copy
Client Server
FS Image
User Space Buffer
GlusterFS + DAX
PageCache
User Space Buffer
mbuf
NICNIC
mbuf
11
6 copy
•Octopus with SPMP• Introduces the shared persistent memory
pool• Global view of data layout
1. Shared Persistent Memory Pool
• Existing DFSs• Redundant data copy
Client Server
FS Image
User Space Buffer User Space Buffer
mbuf
NICNIC
mbuf
12
MessagePool
4 copy
2. Client-Active Data I/O
• Server-Active• Server threads process the
data I/O
• Works well for slow Ethernet
• CPUs can easily become the bottleneck with fast hardware
• Client-Active• Let clients read/write data
directly from/to the SPMP
C1 timeNIC MEMCPUC2
Lookup file data Send data
Lookup file data Send address13
SERVER
3. Self-Identified Metadata RPC
• Message-based RPC• easy to implement, lower throughput
• DaRPC[SoCC’14], FaSST[OSDI’16]
• Memory-based RPC• CPU cores scan the message buffer
• FaRM[NSDI’14]
• Using rdma_write_with_imm?• Scan by polling
• Imm data for self-identification
14
Message PoolHCA
Thead1 Theadn
HCA DATAID
Message PoolHCA
Thead1 Theadn
HCA DATA
4. Collect-Dispatch Distributed Transaction• mkdir, mknod operations need distributed transactions
• Two-Phase Locking & Commit• Distributed logging
• Two-phase coordination
2 * RPC15
Coordinator
Participant
BeginLog
BeginLocalLock wait
LogCommit/
Abort
Write Data End
EndBegin
LogBegin
TxExec
LogContext
LogCommit/Abort
LocalLock
LogBegin
TxExec
LogContext
LogCommit/Abort
wait
LocalUnlock
Write Data
LocalUnlock
waitLogEnd
prepare commit
• mkdir, mknod operations need distributed transactions
• Collect-Dispatch Transaction• Local logging with remote in-place update
• Distributed lock service
Coordinator
Participant
BeginLog
BeginLocalLock
wait Local ExecLog &
UpdateLocal
Unlock
waitLocalLock
Collect
Commit/Abort
End
Update
Unlock
End
1 * RPC + 2 * One-sidedcollect dispatch
Evaluation Setup• Evaluation Platform
• Connected with Mellanox SX1012 switch
• Evaluated Distributed File Systems• memGluster, runs on memory, with RDMA connection• NVFS[OSU], Crail[IBM], optimized to run on RDMA• memHDFS, Alluxio, for big data comparison
Cluster CPU Memory ConnectX-3 FDR Number
A E5-2680 * 2 384 GB Yes * 5
B E5-2620 16 GB Yes * 7
17
Overall Efficiency
75%
80%
85%
90%
95%
100%
getattr readdir
Latency Breakdown
software mem network
0
1000
2000
3000
4000
5000
6000
7000
Write Read
Bandwidth Utilization
software mem network
• Software latency is reduced rom 326 us to 6 us
• Achieves read/write bandwidth that approaches the raw storage and network bandwidth
18
Metadata Operation Performance
• Octopus provides metadata IOPS in the order of 105~106
• Octopus can scales linearly
6.E+03
6.E+04
6.E+05
1 2 3 4 5
MKNOD
glusterfs nvfscrail dmfscrail-poll
8.E+04
8.E+05
8.E+06
1 2 3 4 5
GETATTR
glusterfs nvfscrail dmfscrail-poll
3.E+04
3.E+05
1 2 3 4 5
RMNOD
glusterfs nvfscrail dmfscrail-poll
19
Read/Write Performance
• Octopus can easily reach the maximum bandwidth of hardware with a single client
• Octopus can achieve the same bandwidth as Crail even add an extra data copy [not shown]
6.E+00
6.E+01
6.E+02
6.E+03
1K 4K 16K 64K 256K 1MB
WRITE
glusterfs
nvfs
crail-np
crail-poll
dmfs
6.E+00
6.E+01
6.E+02
6.E+03
1K 4K 16K 64K 256K 1MB
READ
20
5629 MB/s 6088 MB/s
Big Data Evaluation
•Octopus can also provide better performance for big data applications than existing file systems.
0
500
1000
1500
2000
2500
3000
write read
TestDFSIO (MB/s)
memHDFS Alluxio NVFS Crail Octopus
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1.05
Teragen Wordcount
Normalized Execution Time
memHDFS Alluxio NVFS Crail Octopus
21
Conclusion• It is necessary to rethink the DFS designs over emerging H/Ws
•Octopus’s internal mechanisms• Simplifies data management layer by reducing data copies• Rebalances network and server loads with Client-Active I/O• Redesigns the metadata RPC and distributed transaction with
RDMA primitives
• Evaluations show that Octopus significantly outperforms existing file systems
23