2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD...

36
TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State University {lu.932, shi.876}@osu.edu 2020 OFA Virtual Workshop

Transcript of 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD...

Page 1: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD

PARADIGM BASED ON TRIPARTITE GRAPH MODEL

Xiaoyi Lu and Haiyang Shi

The Ohio State University

{lu.932, shi.876}@osu.edu

2020 OFA Virtual Workshop

Page 2: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

CHALLENGES OF DATA EXPLOSION

2 © OpenFabrics Alliance

0

20

40

60

80

100

120

140

160

180

200

0.00

0.20

0.40

0.60

0.80

1.00

Year

Disk Drive Cost and Annual Data Size with Time

Hard Disk Drives

Solid State Drives

Annual Data Size

Pri

ce (

$/G

B)

Dat

a Si

ze (

ZB)

Disk Drive Prices (1955-2019), https://jcmit.net/diskprice.htmData Age 2025, sponsored by Seagate with data from IDC Global DataSphere, Nov 2018

Page 3: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

RETHINKING OF DISTRIBUTED STORAGE SYSTEMS

3 © OpenFabrics Alliance

▪ No Fault Tolerance

❖High performance

❖Not reliable

▪ N-way Replication

• Tolerate up to 𝑛 − 1 node failures

• Performance degradation

• Large storage overhead Perf

orm

ance

Storage Overhead

No FT

Replication

1 𝑛

Erasure Coding (EC)

Page 4: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

EC BASICS - REED-SOLOMON CODE

4 © OpenFabrics Alliance

Page 5: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

REED-SOLOMON (RS) CODE

5 © OpenFabrics Alliance

▪ 𝑅𝑆(𝑘,𝑚)

• Widely used (RAID, Microsoft Azure, HDFS 3.x)

• Encodes on 𝑘 data chunks to generate 𝑚parity chunks

• Decodes on any 𝑘 data/parity chunks to recover the original data

▪ Storage Overhead of Common Codes

• RS(4,2) => 1.5x overhead

• RS(6,3) => 1.5x overhead

Reed-Solomon EC for k=4 and m=2

01 00 00 00

00 01 00 00

00 00 01 00

00 00 00 01

1B 1C 12 14

1C 1B 14 12

A B C D

E F G H

I J K L

M N O P

A B C D

E F G H

I J K L

M N O P

51 52 53 49

55 56 57 25Original Data

Generator

Matrix

Encoded Data

× =Data Chunks

Parity Chunks

Page 6: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

REED-SOLOMON (RS) CODE

6 © OpenFabrics Alliance

Reed-Solomon EC for k=4 and m=2

01 00 00 00

00 01 00 00

00 00 01 00

00 00 00 01

1B 1C 12 14

1C 1B 14 12

A B C D

E F G H

I J K L

M N O P

A B C D

E F G H

I J K L

M N O P

51 52 53 49

55 56 57 25Original Data

Generator

Matrix

Encoded Data

× =

▪ 𝑅𝑆(𝑘,𝑚)

• Widely used (RAID, Microsoft Azure, HDFS 3.x)

• Encodes on 𝑘 data chunks to generate 𝑚parity chunks

• Decodes on any 𝑘 data/parity chunks to recover the original data

▪ Storage Overhead of Common Codes

• RS(4,2) => 1.5x overhead vs. 3x in 3-way replication

• RS(6,3) => 1.5x overhead vs. 4x in 4-way replication

Page 7: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

REED-SOLOMON (RS) CODE

7 © OpenFabrics Alliance

Reed-Solomon EC for k=4 and m=2

01 00 00 00

00 01 00 00

1B 1C 12 14

1C 1B 14 12

A B C D

E F G H

I J K L

M N O P

A B C D

E F G H

51 52 53 49

55 56 57 25

Original Data Survived Data

× =

01 00 00 00

00 01 00 00

8D F6 7B 01

F6 8D 01 7B

×

01 00 00 00

00 01 00 00

8D F6 7B 01

F6 8D 01 7B

×

Inverse Matrix

Inverse Matrix

▪ 𝑅𝑆(𝑘,𝑚)

• Widely used (RAID, Microsoft Azure, HDFS 3.x)

• Encodes on 𝑘 data chunks to generate 𝑚parity chunks

• Decodes on any 𝑘 data/parity chunks to recover the original data

▪ Storage Overhead of Common Codes

• RS(4,2) => 1.5x overhead vs. 3x in 3-way replication

• RS(6,3) => 1.5x overhead vs. 4x in 4-way replication

Page 8: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

EC: THE ALTERNATIVE RESILIENCE TECHNIQUE

8 © OpenFabrics Alliance

▪ Exploiting EC will introduce both computation overhead and

communication overhead

Client

Write(Encoding) Read with Erasures (Decoding)

ClientComputation Computation

Communication Communication

Are there any good approaches to improve EC performance?

Page 9: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

EC OFFLOAD ON SMARTNICS

9 © OpenFabrics Alliance

Page 10: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

EC OFFLOAD ON SMARTNICS

10 © OpenFabrics Alliance

▪ Promising Capability on Modern SmartNICs

❖Mellanox InfiniBand ConnectX-4 and later

• About 100 supercomputers in latest TOP500 are equipped with

Mellanox InfiniBand NICs

Page 11: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

OVERVIEW EC CAPABILITIES ON MODERN SMARTNICS

11 © OpenFabrics Alliance

Initiator ReceiverGroup

MemoryAccessbyCPU (R)DMA

PostSend/EC

Send(D1)Send(D2)Send(D3)

EC(D1-3)

Send(P1)

Send(P2)

PostECSend

CPU MEM NIC NICs MEMs CPUs

Initiator ReceiverGroup

ECSend

(D1-3)

CPU MEM NIC NICs MEMs CPUs

HardwareACK

EC

Incoherent and Coherent EC Calculation and Networking

▪ Coherent EC Calculation

and Networking

❖Less CPU involvement

❖Less DMA operations

Page 12: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

RETHINKING TRADITIONAL EC PARADIGM ON SMARTNICS

12 © OpenFabrics Alliance

𝑁𝑦𝑥 Node y in layer x

Exclusive OR

Erasure Coder

Data Chunk

Parity Chunk

IntermediateChunk

EC𝑁11

𝑁𝑎1

𝑁𝑏1

𝑁𝑐1

𝑁𝑑1

𝑁𝑒1

𝑁𝑓1

𝑁𝑎2

𝑁𝑏2

𝑁𝑐2

𝑁𝑑2

𝑁𝐼𝐶

𝑁12

𝑁22

𝑁𝑘2

𝑁𝑘+12

𝑁𝑘+𝑚2

…k … EC…

k

m

• Encoding Procedure❖Collect 𝑘 data chunks

❖Post encode-and-send

• One set of nodes (e.g., clients) perform EC encoding calculations

• The other set of nodes (e.g., data nodes) receive encoded chunks

• We name it as Bipartite Graph Based EC (BiEC) Paradigm

Page 13: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

RETHINKING TRADITIONAL EC PARADIGM ON SMARTNICS

13 © OpenFabrics Alliance

▪ Decoding Procedure

❖Fetch 𝑘 survived chunks

❖Decode to reconstruct corrupt chunks

• One set of nodes (e.g., data nodes)

send requested chunks

• The other set of nodes (e.g., clients)

receive and decode to reconstruct

• Follows Bipartite Graph Based EC

(BiEC) Paradigm

𝑁𝑦𝑥 Node y in layer x

Exclusive OR

Erasure Coder

Data Chunk

Parity Chunk

IntermediateChunk

EC

Corrupt Node

RecoveredChunk

𝑁11

𝑁𝑎1

𝑁𝑏1

𝑁𝑐1

𝑁𝑑1

𝑁𝑒1

𝑁𝑓1

𝑁𝑎2

𝑁𝑏2

𝑁𝑐2

𝑁𝑑2

𝑁𝐼𝐶

𝑁12

𝑁22

𝑁𝑘2

𝑁𝑘+12

𝑁𝑘+𝑚2

……EC

Page 14: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

LIMITATIONS AND CHALLENGES

14 © OpenFabrics Alliance

Page 15: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

LIMITATIONS OF BIEC PARADIGM

15 © OpenFabrics Alliance

▪ Limitation 1: Lack of parallelism

▪ Limitation 2: Treat a NIC as a power processor, not

fully exploit networked resources

𝑁11

𝑁𝑎1

𝑁𝑏1

𝑁𝑐1

𝑁𝑑1

𝑁𝑒1

𝑁𝑓1

𝑁𝑎2

𝑁𝑏2

𝑁𝑐2

𝑁𝑑2

𝑁𝐼𝐶

𝑁12

𝑁22

𝑁𝑘2

𝑁𝑘+12

𝑁𝑘+𝑚2

… … EC…

Page 16: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

LIMITATIONS OF EXISTING APIS FOR EC ON SMARTNICS

16 © OpenFabrics Alliance

▪ Encode and Send Primitive

❖Offload encoding operation and data transmissions simultaneously

• Limitation 3: No receive and decode primitive support

Page 17: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

CHALLENGES

1. Can we propose a better EC paradigm, which is able to exploit the networked

resources in an optimized approach?

2. If such an EC paradigm exists, how can we achieve better overlapping?

3. Can such a new EC paradigm be applied uniformly to both encoding and decoding

procedures?

4. Are there any additional challenges to co-design applications with the proposed EC

paradigm?

17 © OpenFabrics Alliance

Page 18: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

TRIEC IDEAS

18 © OpenFabrics Alliance

Page 19: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

TRIPARTITE GRAPH BASED EC PARADIGM (TRIEC)

19 © OpenFabrics Alliance

Key Observation: An EC computation can be decomposed into sub EC calculations,

which are possible to be performed on a set of nodes in parallel

▪ Bipartite EC graph transforms to Tripartite EC graph

▪ We name it as Tripartite Graph Based EC Paradigm

Page 20: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

TRIEC FOR ENCODING

20 © OpenFabrics Alliance

▪ Encoding Procedure❖Client (e.g., 𝑁1

1) sends each data chunk

❖Data node in layer-2 (e.g., 𝑁12) posts

encode-and-send

❖Data node in layer-3 (e.g., 𝑁13) sums up

𝑘 intermediate chunks to generate a

parity chunk

𝑁𝑦𝑥 Node y in layer x

Exclusive OR

Erasure Coder

Data Chunk

Parity Chunk

IntermediateChunk

EC 𝑁11

𝑁𝑎1

𝑁𝑏1

𝑁𝑐1

𝑁𝑑1

𝑁𝑒1

𝑁𝑓1

𝑁𝑎2

𝑁𝑏2

𝑁𝑐2

𝑁𝑑2

𝑁12

𝑁22

𝑁𝑎3

𝑁𝑏3

𝑁𝑐3

𝑁13

𝑁𝑚3𝑁𝑘

2

𝑁𝑑3

𝑁𝐼𝐶

𝑁𝐼𝐶

𝑁𝐼𝐶

EC

EC

……

EC

m

Page 21: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

TRIEC FOR DECODING

21 © OpenFabrics Alliance

▪ Decoding Procedure❖Data node in layer-3 (e.g., 𝑁1

3) decodes

to generate 𝑝 intermediate chunks

❖Data node in layer-2 (e.g., 𝑁12) fetches 𝑘

intermediate chunks and sums them up

to reconstruct corrupt chunk

❖Client in layer-1 (e.g., 𝑁11) gets

recovered chunks

𝑁𝑦𝑥 Node y in layer x

Exclusive OR

Erasure Coder

Data Chunk

Parity Chunk

IntermediateChunk

EC

Corrupt Node

RecoveredChunk

...

...

...

𝑁11

𝑁𝑎1

𝑁𝑏1

𝑁𝑐1

𝑁𝑑1

𝑁𝑒1

𝑁𝑓1

𝑁𝑒1

𝑁𝑎2

𝑁𝑏2

𝑁𝑐2

𝑁𝑑2

𝑁𝑒2

𝑁𝑓2

𝑁12

𝑁𝑝2

𝑁𝑎3

𝑁𝑏3

𝑁𝑐3

𝑁13

𝑁𝑥3

𝑁𝑘3

𝑁𝐼𝐶

𝑁𝐼𝐶

𝑁𝐼𝐶

EC

EC

EC

TriEC can be uniformly applied to both encoding and decoding

Page 22: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

TRIEC-CACHE

22 © OpenFabrics Alliance

Page 23: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

ARCHITECTURE OVERVIEW OF TRIEC-CACHE

23 © OpenFabrics Alliance

▪ Based on memcached (v1.5.12)

▪ Interleaved Architecture

❖EC Group: Leader, data processes, and

parity processes

❖Interleave EC groups into the cluster such

that data processes and parity processes are

evenly distributed

❖Balance workload and resource utilizations

Node 1 Node 2 Node 3 Node 4 Node 5

Group 1

Group 2

Group 3

Group 4

Group 5

𝐿𝑖 Leader of 𝑖th Group 𝐷𝑖 𝑖th Data Chunk

𝑖th Parity Chunk𝑃𝑖

𝐿1𝐷1

𝐷2 𝐷3 𝑃1 𝑃2

𝐷1𝐷2 𝐷3 𝑃1𝑃2

𝐿2

𝐷1𝐷2 𝐷3𝑃1 𝑃2

𝐿3

𝐷1𝐷2

𝐿4

𝐷1

𝐿5𝐷2 𝐷3 𝑃1 𝑃2

𝐷3 𝑃1 𝑃2

Corrupt Chunk

Page 24: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

ARCHITECTURE OVERVIEW OF TRIEC-CACHE

24 © OpenFabrics Alliance

▪ set(key, value)

❖Implemented with encode-and-send

primitive

▪ get(key)

❖No EC operations

▪ get(key) with erasures

❖Involve data recoveries

❖Implemented with recv-and-decode

primitive

Node 1 Node 2 Node 3 Node 4 Node 5

Group 1

Group 2

Group 3

Group 4

Group 5

𝐿𝑖 Leader of 𝑖th Group 𝐷𝑖 𝑖th Data Chunk

𝑖th Parity Chunk𝑃𝑖

Client

𝐷1𝐷2 𝐷3 𝑃1𝑃2

𝐿2

𝐷1𝐷2

𝐿4𝐷3 𝑃1 𝑃2

𝐿1

𝐷1𝐷2 𝐷3

𝑃1𝑃2

𝐿3

𝐷1𝐷2

𝐷3

𝐷3

𝐷2 𝐷1𝐿5

𝑃1 𝑃2

𝑃1 𝑃2

Corrupt Chunk

Page 25: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

RECOVERY MECHANISM OF BIEC AND TRIEC

25 © OpenFabrics Alliance

Out-of-Band Recovery for BiEC In-Band Recovery for TriEC

• Out-of-Band Recovery❖Write back

❖Recover in the background

• In-Band Recovery❖No extra computations or

communications

Page 26: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

OPTIMIZATIONS WITH MLNX OFED

26 © OpenFabrics Alliance

▪ Avoidance of Sending Unrequested

Chunks

❖Nullify the work requests to avoid sending out

unrequested chunks

▪ EC Calculator Cache

❖Initializing EC calculators is very expensive

with Mellanox’s EC offload APIs

❖Static EC calculator cache vs. dynamic EC

calculator cache

0

1000

2000

3000

4000

5000

6000

7000

Lat

ency

(us)

No Cache

Dynamic Cache

Static Cache

Performance Impact of Calculator Cache (OSU RI2 Cluster)

Page 27: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

EVALUATION

27 © OpenFabrics Alliance

Page 28: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

MICROBENCHMARK – TRIEC ENCODING

28 © OpenFabrics Alliance

▪ TriEC reduces the encoding time with RS(12,4) by 23.5% - 45.1%

0500

10001500200025003000

BiEC TriEC

Ex

ecu

tion

Tim

e (m

s)

Encoding Performance Comparisons for RS(12,4) with Diverse Chunk Sizes (OSU RI2 Cluster)

Page 29: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

MICROBENCHMARK – TRIEC ENCODING

29 © OpenFabrics Alliance

0

1500

3000

4500

600051

2

2K

B

8K

B

32

KB

12

8K

B

51

2K

B

2M

B

8M

B

1K

B

4K

B

16

KB

64

KB

25

6K

B

1M

B

4M

B

51

2

2K

B

8K

B

32

KB

12

8K

B

51

2K

B

2M

B

8M

B

1K

B

4K

B

16

KB

64

KB

25

6K

B

1M

B

4M

B

(3,2) (6,3) (8,3) (10,4)

BiEC TriEC

Exec

uti

on T

ime

(ms)

Encoding Performance Comparisons with Varied Configurations and Chunk Sizes (OSU RI2 Cluster)

Up to 1.29x Up to 1.43x Up to 1.61x Up to 1.72x

▪ Evaluate TriEC with all widely-used EC configurations

Page 30: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

MICROBENCHMARK – TRIEC DECODING

30 © OpenFabrics Alliance

▪ TriEC reduces the decoding time with RS(12,4) by 18.6% - 64.6%

Ex

ecu

tion

Tim

e (m

s)

Performance Comparisons for Recovering Four Data Chunks for RS(12,4) with Diverse Chunk Sizes (OSU RI2 Cluster)

0100020003000400050006000

BiEC TriEC

Page 31: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

MICROBENCHMARK – TRIEC DECODING

31 © OpenFabrics Alliance

0

3000

6000

9000

1200051

2

2K

B

8K

B

32

KB

12

8K

B

51

2K

B

2M

B

8M

B

1K

B

4K

B

16

KB

64

KB

25

6K

B

1M

B

4M

B

51

2

2K

B

8K

B

32

KB

12

8K

B

51

2K

B

2M

B

8M

B

1K

B

4K

B

16

KB

64

KB

25

6K

B

1M

B

4M

B

(3,2) (6,3) (8,3) (10,4)

BiEC TriEC

Performance Comparisons for Recovering 𝒎 Data Chunks with Varied Configurations and Chunk Sizes (OSU RI2 Cluster)

Exec

uti

on T

ime

(ms)

m=2 m=3 m=3 m=4

Up to 2.33x Up to 2.26x Up to 2.17x Up to 2.16x

▪ Evaluate TriEC with all widely-used EC configurations

Page 32: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

THROUGHPUT OF TRIEC-CACHE

32 © OpenFabrics Alliance

▪ TriEC speeds up the overall throughput performance by up to 13.3% for

equal-shares, 14.8% for read-mostly, and 13.9% for read-only workloads.

Throughput Comparisons for RS(6, 3) (OSC Pitzer Cluster)

BiEC TriEC

1KB 4KB 16KB1KB 4KB 16KB1KB 4KB 16KB

Thro

ughput

(103

op

s/s)

1KB 4KB 16KB1KB 4KB 16KB1KB 4KB 16KB0

40

80

120

r:w=50:50 r:w=95:5 r:w=100:0 r:w=50:50 r:w=95:5 r:w=100:0

1% of reads with one erasure 1% of reads with three erasures

Page 33: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

CONCLUSION

33 © OpenFabrics Alliance

Page 34: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

Haiyang Shi and Xiaoyi Lu. TriEC: Tripartite Graph Based Erasure Coding NIC Offload. In Proceedings of the 32nd International Conference for High

Performance Computing, Networking, Storage and Analysis (SC), 2019. (Best Student Paper Finalist)

CONCLUSION AND FUTURE WORK

34 © OpenFabrics Alliance

▪ A new EC NIC offload paradigm based on tripartite graph model (i.e., TriEC)

▪ A new receive-and-decode primitive on top of Mellanox EC NIC offload APIs

▪ Co-design a TriEC-based key value store based on Memcached

▪ TriEC reduces the average write latency by up to 23.2% and the recovery time by up to 37.8%

even with only 1% failure occurrences

▪ In the future, we plan to apply TriEC to other storage systems and hardware platforms

Page 35: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

ACKNOWLEDGEMENT

35 © OpenFabrics Alliance

▪ We would like to thank Ohio Supercomputer Center (OSC) for providing the cluster access.

▪ We would also like to thank Mellanox for donating SmartNICs.

▪ We also thank the anonymous reviewers for their precious feedback to this work.

▪ This work is supported in part by National Science Foundation grant #CCF-1822987.

Page 36: 2020 OFA Virtual Workshop TRIEC: AN EFFICIENT ......TRIEC: AN EFFICIENT ERASURE CODING NIC OFFLOAD PARADIGM BASED ON TRIPARTITE GRAPH MODEL Xiaoyi Lu and Haiyang Shi The Ohio State

THANK YOUXiaoyi Lu, Research Assistant Professor

The Ohio State University

2020 OFA Virtual Workshop