The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1,...
Transcript of The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1,...
OpenEC: Toward Unified and Configurable Erasure Coding Management in
Distributed Storage Systems
Xiaolu Li1, Runhui Li1, Patrick P. C. Lee1, Yuchong Hu2
The Chinese University of Hong Kong1
Huazhong University of Science and Technology2
USENIX FAST 2019
1
Introduction
ØFault tolerance for distributed storage is critical • Availability: data remains accessible under failures• Durability: no data loss even under failures
ØErasure coding is a promising redundancy technique • Minimum data redundancy via “data encoding” • Higher reliability with same storage redundancy than replication• Reportedly deployed in Google, Azure, Facebook
• e.g., Azure reduces redundancy from 3x (replication) to 1.33x (erasure coding) à PBs saving
2
Erasure CodingØDivide file data to k data blocksØEncode k data blocks to n-k parity blocksØDistribute the n erasure-coded blocks (coding group) to n nodes ØFault-tolerance: any k out of n blocks can recover file data
3
Nodes
(n, k) = (4, 2)
File encodedivideABCD
A+CB+DA+D
B+C+D
ABCD
A+CB+DA+D
B+C+D
ABCD
Erasure Coding
ØReed-Solomon (RS) codes are widely deployed• Storage-optimal• Generality for n and k• Drawback: high repair penalty
ØNew erasure coding solutions• Repair-optimal erasure codes
• e.g., regenerating codes [TIT’10]; locally repairable codes (LRCs) [ATC’12, PVLDB’13]; double regenerating codes (DRC) [TOS’17]
• Repair-efficient algorithms• e.g., Partial-parallel-repair (PPR) [Eurosys’16]; Repair pipelining [ATC’17]
4
Challenge
ØDeploying new erasure coding solutions in distributed storage systems (DSSs) is a daunting task• Re-engineering of DSS workflows (e.g., read/write paths)• Hard to generalize for different DSSs
ØOur past experience: • Over 4K lines-of-code change to HDFS-RAID for adding DRC [TOS’17]
ØReview of six DSSs with erasure coding support• HDFS-RAID, Hadoop 3.0 HDFS, QFS, Tahoe-LAFS, Ceph and Swift
5
Limitations of Current DSSs
ØHard to add advanced erasure codes• Existing DSSs only provide interfaces for basic encoding/decoding operations• Most DSSs do not support sub-packetization (e.g., regenerating codes)
ØHard to configure the workflows and placement of coding operations
6
Network
RN1 N2 N3 N4
Network
RN1 N2 N3 N4
Repair in fetch-and-compute manner Repair pipelining [ATC’17] cannot be readily realized
ü û
Our Contributions
ØPropose ECDAG, a directed-acyclic-graph abstraction for realizing general erasure coding solutions• Decoupling erasure coding management from DSS workflows
ØPrototype OpenEC on HDFS-RAID, Hadoop 3 HDFS, and QFS• Minimal code changes
ØExtensive experiments on local and Amazon EC2 clusters7
OpenEC: a unified and configurable framework for erasure coding management
ECDAGØ (n, k) code
• Data blocks: b0, …, bk-1
• Parity blocks: bk, …, bn-1
• Virtual blocks: bi for i ≥ n
ØAn ECDAG is a directed acyclic graph that defines either an encoding or a decoding operation• Vertex vi: block bi in a coding group• Edge ei,j: block bi is an input to the linear combination of bj
• Each edge is associated with a coding coefficient
8
ECDAG
ØECDAGs for a (5,4) code:
9
Encoding
0 1 2 3
4Decoding
1 2 3 4
0Partial-parallel repair (PPR) [Mitra, EuroSys’16]
1 2 3 4
5 6
0
ECDAG
ØECDAGs for regenerating codes [Dimakis, TIT’10] with sub-packetization• w: sub-packetization level (number of sub-blocks per block)• e.g., n=4, k=2, w=2
10
Layout
0 2 4 6
1 3 5 7
b0 b1 b2 b3Encoding
0 1 2 3
4 5 6 7Decoding
63 4 52 7
8 9 10
0 1
ECDAG Primitives
Construction of an ECDAG:
Ø Join: describes linear combination
ØBindX: co-locates coding operations at same level (i.e., x-direction)
ØBindY: co-locates coding operations across levels (i.e., y-direction)
11
ECDAG Primitives
ØEncoding of (6,4) RS code
12
ECDAG* ecdag = new ECDAG();ecdag->Join(4, {0,1,2,3}, {1,1,1,1});ecdag->Join(5, {0,1,2,3}, {1,2,4,8});int vidx = ecdag->BindX({4,5});ecdag->BindY(vidx, 0);
0 1 2 3
4 5 BindX
0 1 2 3
4 5
6BindY
0 1 2 3
4 5
6
ECDAG Primitives
ØDecoding via repair pipelining [Li, ATC’17]:• e.g., recovering the missing block 0 for (6, 4) RS code
13
ECDAG* ecdag = new ECDAG();ecdag->Join(7, {1,2}, {1,1});ecdag->BindY(7, 2);ecdag->Join(8, {7,3}, {1,1});ecdag->BindY(8, 3);ecdag->Join(9, {8,4}, {1,1});ecdag->BindY(9, 4);ecdag->Join(0, {9}, {1});
1 2
7
3
8
4
9 0
Erasure Coding Interfaces
class ECBase {int n, k, w;vector<int> ecoefs;
public:// constructing encoding ECDAGsECDAG* Encode();
// constructing decoding ECDAGsECDAG* Decode(vector<int> from, vector<int> to);
// organizing blocks in groups (e.g., racks)vector<vector<int>> Place();
}
14
OpenEC Design
15
ØController:• Manages EC metadata• Parses ECDAGs and
assigns tasks to agents• Controls block placement• Coordinates repair
ØAgent:• Performs coding operations
ØOECClient:• Interfaces between
applications and storage
OpenEC deployment on HDFS
OpenEC Design
Basic operations:ØWrites
• Online encoding• Offline encoding
ØNormal reads
ØDegraded reads
ØFull-node recovery
16
Tasks:Ø Load
• Loads an input block
ØFetch• Retrieves blocks from other agents
ØCompute• Computes a new block
ØPersist• Returns a block
Parsing an ECDAG
ØOnline encoding for (6,4) RS code• On the write path• Performed by client C
Vertices Nodes Tasksv0 C Load b0
v1 C Load b1
v2 C Load b2
v3 C Load b3
v6 C Compute b4 from {b0, b1, b2, b3} with coding coefficients {1,1,1,1};Compute b5 from {b0, b1, b2, b3} with coding coefficients {1,2,4,8};
v4 C -v5 C -- C Persist b0; Persist b1; Persist b2;
Persist b3; Persist b4; Persist b5;17
0 1 2 3
4 5
6
Parsing an ECDAG
18
ØOffline encoding for (6,4) RS code• Blocks 0-3 are in
nodes 0-3• Performed by different
nodes
Vertices Nodes Tasksv0 N0 Load b0
v1 N1 Load b1
v2 N2 Load b2
v3 N3 Load b3
v6 N0 Fetch b1 from N1Fetch b2 from N2Fetch b3 from N3Compute b4 from {b0, b1, b2, b3} with coding coefficients {1,1,1,1};Compute b5 from {b0, b1, b2, b3} with coding coefficients {1,2,4,8};
v4 N4 Fetch b4 from N0; Persist b4
v5 N5 Fetch b5 from N0; Persist b5
0 1 2 3
4 5
6
Automated Optimizations
ØAutomated BindX and BindY• Examines subgraph structures and calls BindX and BindY automatically
ØHierarchy awareness
19
Pipelining
1 2 3 4
0
5 6 1 23 45 6
7 8 9 10 11 0
OpenEC Implementation
ØMiddleware layer (7000+ lines-of-code)• Coding operations in units of packets• Intel ISA-L for erasure coding• Redis for communications
Ø Integration with existing distributed storage systems• HDFS-RAID• Hadoop 3.0 HDFS• QFS (see technical report)
ØEach integration only makes ≤ 450 lines-of-code changes• Changes include: (1) interfacing with systems, (2) block placement
20
Experiments
Ø Local cluster• 16 machines• Quad-core 3.4 GHz Intel CPU• 16 GiB RAM• 10 Gb/s network
ØAmazon EC2• Up to 30 instances• m5.xlarge instances• 10 Gb/s network
21
Basic Operations in Local Cluster
ØOpenEC preserves original HDFS performance
ØOpenEC achieves much faster offline encoding than HDFS-RAID with a simpler workflow
22
827.
584
4 1185
.410
50.4
1132
.210
39
0
500
1000
1500
2000
Write NormalRead
DegradedRead
Thpt
(MiB
/s)
HDFS-3 OpenEC
1484
.5
624.
9
308.
3
285.
6
0
500
1000
1500
2000
OfflineEncoding
Full-nodeRecovery
Thpt
(MiB
/s)
HDFS-RAIDOpenEC
Comparisons with HDFS-3 Comparisons with HDFS-RAID
Comparisons with Native Coding (without I/O)
ØECDAG coding computations are slower than ISA-L • 29-38% lower in encoding; 0.6-3.15% lower in decoding
ØRemains much faster than I/O; limited overhead overall23
5176
.47296
.5
3659
.659
72.1
2746
.74373
.1
2617
.64092
.2
0
2000
4000
6000
8000
(6,4) (9,6) (12,8)(14,10)
Thpt
(MiB
/s)
NativeOpenEC
2751
2840
.6
1876
.619
13.5
1424
1443
.9
1085
.110
91.6
0
1000
2000
3000
(6,4) (9,6) (12,8)(14,10)
Thpt
(MiB
/s)
NativeOpenEC
Encoding Decoding
Support of Erasure Coding Designs
24
36.5
18.4
216.
9
123.
9
0
50
100
150
200
250
1 Gbps 10 Gbps
Thpt
(MiB
/s)
RS (9, 6)LRC(10, 6)
58.4
39.1
27.4
201.
321
6.9
160.
6
0
100
200
300
400
500
1 Gbps 10 Gbps
Thpt
(MiB
/s)
RS (6, 4)Butterfly (6, 4)MISER (8, 4)
101.
135
.418
.4
322.
720
0.2
123.
9
0
100
200
300
400
500
1 Gbps 10 Gbps
Thpt
(MiB
/s)
RS (9, 6)PPR (9, 6)Pipeline (9, 6) 54
.7
36.8
55.6
27.8
0
20
40
60
80
(6, 4) (9, 6)
Thpt
(MiB
/s)
RS DRC
LRC Regenerating codes
Repair algorithms
DRC
ØComparisons with six state-of-the-art erasure coding designs
ØOpenEC’s performance conforms to the theoretical gains in network-bound environments
Automated Optimizations
ØAutomated ECDAG customization for a hierarchical topology
ØUp to 82% repair throughput gain
25
100.
855
.1
96.6
42.3
74.1
36.7
0
40
80
120
(8,6) (10,8) (12,10)
Thpt
(MiB
/s)
DefaultPipeline
Scalability in Amazon EC2
ØOpenEC scales well with number of instances
26
3221
.121
52.1
976.
5
7270
4907
.225
38.3
7235
.450
78.9
2529
.80
2000
4000
6000
8000
Write NormalRead
DegradedRead
Thpt
(MiB
/s)
N=10N=20N=30
2195
.518
25.7
1188
.9
541
425.
916
7.2
0
500
1000
1500
2000
2500
O ineEncoding
Full-nodeRecovery
Thpt
(MiB
/s)
N=10N=20N=30
Online Encoding Offline Encoding
Conclusions
ØOpenEC is a unified and configurable framework for flexible erasure coding management
ØFuture work:• Integration with more systems (e.g., Ceph, Swift)• Combined with software-defined storage for better configurability
ØSource code:• http://adslab.cse.cuhk.edu.hk/software/openec
27
Backup
28
Question
ØHow to construct decoding ECDAGs for different combinations of lost blocks?
ØThe Decode() function should construct different decoding ECDAGs for two cases:• Decoding one lost block: uses any repair-efficient approach• Decoding multiple lost blocks: picks the first k available blocks
29
Question
ØWhat happens if there is a failure during repair?
ØWe assume that OpenEC restarts the repair process by connecting to the new set of available nodes.
30
Question
ØWhat types of codes are supported or not supported?
ØSupported: • Linear codes (e.g., RS codes, regenerating codes, LRC)
ØNot supported:• Non-linear codes• Sector-disk codes
31
Question
ØPerformance of automated BindX and BindY?
32
290.
528
6.8
179.
5 316.
729
8.4
183
336
323.
618
7
0
200
400
600
(8,6) (10,8) (12,10)
Thpt
(MiB
/s)
No OptimizationBindXBindX & BindY (default)
Question
ØPerformance in QFS
33
349.
934
6.7 53
3.2
515.
8
517.
949
2.3
0
200
400
600
800
Write NormalRead
DegradedRead
Thpt
(MiB
/s)
QFS OpenEC
594.
260
0.4
505.
651
0.9
506
510
0
200
400
600
800
Write NormalRead
DegradedRead
Thpt
(MiB
/s)
QFS OpenEC
Single Client Multiple Clients