The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1,...

33
OpenEC : Toward Unified and Configurable Erasure Coding Management in Distributed Storage Systems Xiaolu Li 1 , Runhui Li 1 , Patrick P. C. Lee 1 , Yuchong Hu 2 The Chinese University of Hong Kong 1 Huazhong University of Science and Technology 2 USENIX FAST 2019 1

Transcript of The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1,...

Page 1: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

OpenEC: Toward Unified and Configurable Erasure Coding Management in

Distributed Storage Systems

Xiaolu Li1, Runhui Li1, Patrick P. C. Lee1, Yuchong Hu2

The Chinese University of Hong Kong1

Huazhong University of Science and Technology2

USENIX FAST 2019

1

Page 2: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Introduction

ØFault tolerance for distributed storage is critical • Availability: data remains accessible under failures• Durability: no data loss even under failures

ØErasure coding is a promising redundancy technique • Minimum data redundancy via “data encoding” • Higher reliability with same storage redundancy than replication• Reportedly deployed in Google, Azure, Facebook

• e.g., Azure reduces redundancy from 3x (replication) to 1.33x (erasure coding) à PBs saving

2

Page 3: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Erasure CodingØDivide file data to k data blocksØEncode k data blocks to n-k parity blocksØDistribute the n erasure-coded blocks (coding group) to n nodes ØFault-tolerance: any k out of n blocks can recover file data

3

Nodes

(n, k) = (4, 2)

File encodedivideABCD

A+CB+DA+D

B+C+D

ABCD

A+CB+DA+D

B+C+D

ABCD

Page 4: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Erasure Coding

ØReed-Solomon (RS) codes are widely deployed• Storage-optimal• Generality for n and k• Drawback: high repair penalty

ØNew erasure coding solutions• Repair-optimal erasure codes

• e.g., regenerating codes [TIT’10]; locally repairable codes (LRCs) [ATC’12, PVLDB’13]; double regenerating codes (DRC) [TOS’17]

• Repair-efficient algorithms• e.g., Partial-parallel-repair (PPR) [Eurosys’16]; Repair pipelining [ATC’17]

4

Page 5: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Challenge

ØDeploying new erasure coding solutions in distributed storage systems (DSSs) is a daunting task• Re-engineering of DSS workflows (e.g., read/write paths)• Hard to generalize for different DSSs

ØOur past experience: • Over 4K lines-of-code change to HDFS-RAID for adding DRC [TOS’17]

ØReview of six DSSs with erasure coding support• HDFS-RAID, Hadoop 3.0 HDFS, QFS, Tahoe-LAFS, Ceph and Swift

5

Page 6: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Limitations of Current DSSs

ØHard to add advanced erasure codes• Existing DSSs only provide interfaces for basic encoding/decoding operations• Most DSSs do not support sub-packetization (e.g., regenerating codes)

ØHard to configure the workflows and placement of coding operations

6

Network

RN1 N2 N3 N4

Network

RN1 N2 N3 N4

Repair in fetch-and-compute manner Repair pipelining [ATC’17] cannot be readily realized

ü û

Page 7: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Our Contributions

ØPropose ECDAG, a directed-acyclic-graph abstraction for realizing general erasure coding solutions• Decoupling erasure coding management from DSS workflows

ØPrototype OpenEC on HDFS-RAID, Hadoop 3 HDFS, and QFS• Minimal code changes

ØExtensive experiments on local and Amazon EC2 clusters7

OpenEC: a unified and configurable framework for erasure coding management

Page 8: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

ECDAGØ (n, k) code

• Data blocks: b0, …, bk-1

• Parity blocks: bk, …, bn-1

• Virtual blocks: bi for i ≥ n

ØAn ECDAG is a directed acyclic graph that defines either an encoding or a decoding operation• Vertex vi: block bi in a coding group• Edge ei,j: block bi is an input to the linear combination of bj

• Each edge is associated with a coding coefficient

8

Page 9: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

ECDAG

ØECDAGs for a (5,4) code:

9

Encoding

0 1 2 3

4Decoding

1 2 3 4

0Partial-parallel repair (PPR) [Mitra, EuroSys’16]

1 2 3 4

5 6

0

Page 10: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

ECDAG

ØECDAGs for regenerating codes [Dimakis, TIT’10] with sub-packetization• w: sub-packetization level (number of sub-blocks per block)• e.g., n=4, k=2, w=2

10

Layout

0 2 4 6

1 3 5 7

b0 b1 b2 b3Encoding

0 1 2 3

4 5 6 7Decoding

63 4 52 7

8 9 10

0 1

Page 11: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

ECDAG Primitives

Construction of an ECDAG:

Ø Join: describes linear combination

ØBindX: co-locates coding operations at same level (i.e., x-direction)

ØBindY: co-locates coding operations across levels (i.e., y-direction)

11

Page 12: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

ECDAG Primitives

ØEncoding of (6,4) RS code

12

ECDAG* ecdag = new ECDAG();ecdag->Join(4, {0,1,2,3}, {1,1,1,1});ecdag->Join(5, {0,1,2,3}, {1,2,4,8});int vidx = ecdag->BindX({4,5});ecdag->BindY(vidx, 0);

0 1 2 3

4 5 BindX

0 1 2 3

4 5

6BindY

0 1 2 3

4 5

6

Page 13: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

ECDAG Primitives

ØDecoding via repair pipelining [Li, ATC’17]:• e.g., recovering the missing block 0 for (6, 4) RS code

13

ECDAG* ecdag = new ECDAG();ecdag->Join(7, {1,2}, {1,1});ecdag->BindY(7, 2);ecdag->Join(8, {7,3}, {1,1});ecdag->BindY(8, 3);ecdag->Join(9, {8,4}, {1,1});ecdag->BindY(9, 4);ecdag->Join(0, {9}, {1});

1 2

7

3

8

4

9 0

Page 14: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Erasure Coding Interfaces

class ECBase {int n, k, w;vector<int> ecoefs;

public:// constructing encoding ECDAGsECDAG* Encode();

// constructing decoding ECDAGsECDAG* Decode(vector<int> from, vector<int> to);

// organizing blocks in groups (e.g., racks)vector<vector<int>> Place();

}

14

Page 15: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

OpenEC Design

15

ØController:• Manages EC metadata• Parses ECDAGs and

assigns tasks to agents• Controls block placement• Coordinates repair

ØAgent:• Performs coding operations

ØOECClient:• Interfaces between

applications and storage

OpenEC deployment on HDFS

Page 16: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

OpenEC Design

Basic operations:ØWrites

• Online encoding• Offline encoding

ØNormal reads

ØDegraded reads

ØFull-node recovery

16

Tasks:Ø Load

• Loads an input block

ØFetch• Retrieves blocks from other agents

ØCompute• Computes a new block

ØPersist• Returns a block

Page 17: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Parsing an ECDAG

ØOnline encoding for (6,4) RS code• On the write path• Performed by client C

Vertices Nodes Tasksv0 C Load b0

v1 C Load b1

v2 C Load b2

v3 C Load b3

v6 C Compute b4 from {b0, b1, b2, b3} with coding coefficients {1,1,1,1};Compute b5 from {b0, b1, b2, b3} with coding coefficients {1,2,4,8};

v4 C -v5 C -- C Persist b0; Persist b1; Persist b2;

Persist b3; Persist b4; Persist b5;17

0 1 2 3

4 5

6

Page 18: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Parsing an ECDAG

18

ØOffline encoding for (6,4) RS code• Blocks 0-3 are in

nodes 0-3• Performed by different

nodes

Vertices Nodes Tasksv0 N0 Load b0

v1 N1 Load b1

v2 N2 Load b2

v3 N3 Load b3

v6 N0 Fetch b1 from N1Fetch b2 from N2Fetch b3 from N3Compute b4 from {b0, b1, b2, b3} with coding coefficients {1,1,1,1};Compute b5 from {b0, b1, b2, b3} with coding coefficients {1,2,4,8};

v4 N4 Fetch b4 from N0; Persist b4

v5 N5 Fetch b5 from N0; Persist b5

0 1 2 3

4 5

6

Page 19: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Automated Optimizations

ØAutomated BindX and BindY• Examines subgraph structures and calls BindX and BindY automatically

ØHierarchy awareness

19

Pipelining

1 2 3 4

0

5 6 1 23 45 6

7 8 9 10 11 0

Page 20: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

OpenEC Implementation

ØMiddleware layer (7000+ lines-of-code)• Coding operations in units of packets• Intel ISA-L for erasure coding• Redis for communications

Ø Integration with existing distributed storage systems• HDFS-RAID• Hadoop 3.0 HDFS• QFS (see technical report)

ØEach integration only makes ≤ 450 lines-of-code changes• Changes include: (1) interfacing with systems, (2) block placement

20

Page 21: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Experiments

Ø Local cluster• 16 machines• Quad-core 3.4 GHz Intel CPU• 16 GiB RAM• 10 Gb/s network

ØAmazon EC2• Up to 30 instances• m5.xlarge instances• 10 Gb/s network

21

Page 22: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Basic Operations in Local Cluster

ØOpenEC preserves original HDFS performance

ØOpenEC achieves much faster offline encoding than HDFS-RAID with a simpler workflow

22

827.

584

4 1185

.410

50.4

1132

.210

39

0

500

1000

1500

2000

Write NormalRead

DegradedRead

Thpt

(MiB

/s)

HDFS-3 OpenEC

1484

.5

624.

9

308.

3

285.

6

0

500

1000

1500

2000

OfflineEncoding

Full-nodeRecovery

Thpt

(MiB

/s)

HDFS-RAIDOpenEC

Comparisons with HDFS-3 Comparisons with HDFS-RAID

Page 23: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Comparisons with Native Coding (without I/O)

ØECDAG coding computations are slower than ISA-L • 29-38% lower in encoding; 0.6-3.15% lower in decoding

ØRemains much faster than I/O; limited overhead overall23

5176

.47296

.5

3659

.659

72.1

2746

.74373

.1

2617

.64092

.2

0

2000

4000

6000

8000

(6,4) (9,6) (12,8)(14,10)

Thpt

(MiB

/s)

NativeOpenEC

2751

2840

.6

1876

.619

13.5

1424

1443

.9

1085

.110

91.6

0

1000

2000

3000

(6,4) (9,6) (12,8)(14,10)

Thpt

(MiB

/s)

NativeOpenEC

Encoding Decoding

Page 24: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Support of Erasure Coding Designs

24

36.5

18.4

216.

9

123.

9

0

50

100

150

200

250

1 Gbps 10 Gbps

Thpt

(MiB

/s)

RS (9, 6)LRC(10, 6)

58.4

39.1

27.4

201.

321

6.9

160.

6

0

100

200

300

400

500

1 Gbps 10 Gbps

Thpt

(MiB

/s)

RS (6, 4)Butterfly (6, 4)MISER (8, 4)

101.

135

.418

.4

322.

720

0.2

123.

9

0

100

200

300

400

500

1 Gbps 10 Gbps

Thpt

(MiB

/s)

RS (9, 6)PPR (9, 6)Pipeline (9, 6) 54

.7

36.8

55.6

27.8

0

20

40

60

80

(6, 4) (9, 6)

Thpt

(MiB

/s)

RS DRC

LRC Regenerating codes

Repair algorithms

DRC

ØComparisons with six state-of-the-art erasure coding designs

ØOpenEC’s performance conforms to the theoretical gains in network-bound environments

Page 25: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Automated Optimizations

ØAutomated ECDAG customization for a hierarchical topology

ØUp to 82% repair throughput gain

25

100.

855

.1

96.6

42.3

74.1

36.7

0

40

80

120

(8,6) (10,8) (12,10)

Thpt

(MiB

/s)

DefaultPipeline

Page 26: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Scalability in Amazon EC2

ØOpenEC scales well with number of instances

26

3221

.121

52.1

976.

5

7270

4907

.225

38.3

7235

.450

78.9

2529

.80

2000

4000

6000

8000

Write NormalRead

DegradedRead

Thpt

(MiB

/s)

N=10N=20N=30

2195

.518

25.7

1188

.9

541

425.

916

7.2

0

500

1000

1500

2000

2500

O ineEncoding

Full-nodeRecovery

Thpt

(MiB

/s)

N=10N=20N=30

Online Encoding Offline Encoding

Page 27: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Conclusions

ØOpenEC is a unified and configurable framework for flexible erasure coding management

ØFuture work:• Integration with more systems (e.g., Ceph, Swift)• Combined with software-defined storage for better configurability

ØSource code:• http://adslab.cse.cuhk.edu.hk/software/openec

27

Page 28: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Backup

28

Page 29: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Question

ØHow to construct decoding ECDAGs for different combinations of lost blocks?

ØThe Decode() function should construct different decoding ECDAGs for two cases:• Decoding one lost block: uses any repair-efficient approach• Decoding multiple lost blocks: picks the first k available blocks

29

Page 30: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Question

ØWhat happens if there is a failure during repair?

ØWe assume that OpenEC restarts the repair process by connecting to the new set of available nodes.

30

Page 31: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Question

ØWhat types of codes are supported or not supported?

ØSupported: • Linear codes (e.g., RS codes, regenerating codes, LRC)

ØNot supported:• Non-linear codes• Sector-disk codes

31

Page 32: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Question

ØPerformance of automated BindX and BindY?

32

290.

528

6.8

179.

5 316.

729

8.4

183

336

323.

618

7

0

200

400

600

(8,6) (10,8) (12,10)

Thpt

(MiB

/s)

No OptimizationBindXBindX & BindY (default)

Page 33: The Advanced Computing Systems Association - OpenEC ......Distributed Storage Systems Xiaolu Li1, RunhuiLi 1, Patrick P. C. Lee, YuchongHu2 The Chinese University of Hong Kong1 HuazhongUniversity

Question

ØPerformance in QFS

33

349.

934

6.7 53

3.2

515.

8

517.

949

2.3

0

200

400

600

800

Write NormalRead

DegradedRead

Thpt

(MiB

/s)

QFS OpenEC

594.

260

0.4

505.

651

0.9

506

510

0

200

400

600

800

Write NormalRead

DegradedRead

Thpt

(MiB

/s)

QFS OpenEC

Single Client Multiple Clients