Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
-
Upload
shuai-yuan -
Category
Technology
-
view
308 -
download
1
description
Transcript of Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Accelerate Reed-solomon coding forFault-Tolerance in RAID-like system
Shuai YUAN
NTHULSA lab
1 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Outline
▶ Background
▶ How to use CUDA to accelerate
▶ Experiment
▶ Q & A
2 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Why Reed-Solomon codes?
Why fault-tolerance?
In a RAID-like system, storage is distributed among several devices,the probability of one of these devices failing becomes significant.
3 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Why Reed-Solomon codes?
Why fault-tolerance?
In a RAID-like system, storage is distributed among several devices,the probability of one of these devices failing becomes significant.MTTF (mean time to failure) of one device is P ,
MTTF of a system of n devices isP
n.
3 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Why Reed-Solomon codes?
Why fault-tolerance?
In a RAID-like system, storage is distributed among several devices,the probability of one of these devices failing becomes significant.MTTF (mean time to failure) of one device is P ,
MTTF of a system of n devices isP
n.
Thus in such systems, fault-tolerance must be taken into account.
3 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Why Reed-Solomon codes?
Why erasure codes?
Replication is expensive!e.g. To achieve double-fault tolerance, we need 2 replicas.
4 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Why Reed-Solomon codes?
Erasure Codes
Given a message/file of k equal-size blocks/chunks/fragments,encode it into n blocks/chunks/fragments (n > k)
▶ Optimal erasure codes: any k out of the n fragments aresufficient to recover the original message/file.i.e. can tolerate arbitrary (n− k) failures.We call this (n,k) MDS (maximum distance separable)property.e.g. Reed Solomon Codes
▶ Near-optimal erasure codes: require (1 + ε)k codeblocks/chunks to recover (ε > 0).i.e. cannot tolerate arbitrary (n− k) failures.e.g. Local Reconstruction Codes (used in WAS)
5 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Why Reed-Solomon codes?
Replication vs. Erasure Codes
Given a file of size MUsing Reed Solomon Codes(n=4, k=2)
Achieve double fault tolerance
▶ R-S Codes: storage overhead = M
▶ Relication: storage overhead = 2M (2 replicas)
6 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Reed-Solomon coding mechanism
Reed-Solomon encoding process
Given file F : divide it into k equal-size native chunks:F = [Fi]i=1,2,...,k. Encode them into (n− k) code chunks:C = [Ci]i=1,2,...,n−k.Use Encoding Matrix EM(n−k)×k to produce code chunks:
CT = EM × F T
Ci is the linear combination of F1, F2, . . . , Fk.
7 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Reed-Solomon coding mechanism
Reed-Solomon encoding process
In our case,F =
(A B
)EM =
(1 11 2
)CT =
(1 11 2
)×
(AB
)=
(A+BA+ 2B
)
7 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Reed-Solomon coding mechanism
Reed-Solomon encoding process
Let P = [Pi]i=1,2,...,n = [F1, F2, . . . , Fk, C1, C2, . . . , Cn−k] be the
n chunks in storage, EM ′ =
(I
EM
), Here
I =
1 0 . . . 00 1 . . . 0...
.... . .
...0 0 . . . 1
then
P T = EM ′ × F T
=
(F T
CT
)
7 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Reed-Solomon coding mechanism
Reed-Solomon encoding process
In our case,
EM ′ =
(I
EM
)=
1 00 11 11 2
P T = EM ′ × F T
=
1 00 11 11 2
×(AB
)
=
AB
A+BA+ 2B
7 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Reed-Solomon coding mechanism
Reed-Solomon encoding process
EM ′ is the key of MDS property!
7 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Reed-Solomon coding mechanism
Reed-Solomon encoding process
Theorem (necessary and sufficient condition)
Every possible k × k submatrix obtained by removing (n− k) rowsfrom EM ′ has full rank.equivalent expression of full rank:
▶ rank = k
▶ non-singular...
Alternative view:Consider the linear space ofP = [Pi]i=1,2,...,n = [F1, F2, . . . , Fk, C1, C2, . . . , Cn−k], itsdimension is k, and any k out of n vectors form a basis of thelinear space.
7 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Reed-Solomon coding mechanism
Reed-Solomon encoding process
Reed-Solomon Codes uses Vandermonde matrix V as EM
V =
1 1 1 . . . 11 2 3 . . . k12 22 32 . . . k2
......
.... . .
...
1(n−k) 2(n−k) 3(n−k) . . . k(n−k)
7 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Reed-Solomon coding mechanism
Reed-Solomon encoding process
EM ′ =
1 0 0 . . . 00 1 0 . . . 00 0 1 . . . 0...
......
. . ....
0 0 0 . . . 11 1 1 . . . 11 2 3 . . . k12 22 32 . . . k2
......
.... . .
...
1(n−k) 2(n−k) 3(n−k) . . . k(n−k)
7 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Reed-Solomon coding mechanism
Reed-Solomon encoding process
Remark:
▶ All arithmetic operations in Galois Field GF(2x). Then everynumber is less than 2x.
▶ EM ′ satisfies MDS property.
7 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Reed-Solomon coding mechanism
Reed-Solomon decoding process
We randomly select k out of n fragments, sayX = [x1, x2, · · · , xk]T .Then we use the coefficients of each fragments xi to construct ak × k matrix V ′.Original data can be regenerated by multiplying matrix X with theinverse of matrix V ′:
F = V ′−1X
8 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Optimization motivation
Optimization motivation
The reason that discourages using Reed-Solomon coding to replacereplication is its high computational complexity:
▶ Galois Field arithmetic
▶ matrix operations
9 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Accelerate operations in Galois Field (GF)
Accelerate operations in Galois Field (GF)
GF(2w) field is constructed by finding a primitive polynomial q(x)of degree w over GF(2), and then enumerating the elements(which are polynomials) with the generator x. Addition in this fieldis performed using polynomial addition, and multiplication isperformed using polynomial multiplication and taking the resultmodulo q(x).
10 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Accelerate operations in Galois Field (GF)
Accelerate operations in Galois Field (GF)
GF(2w) field is constructed by finding a primitive polynomial q(x)of degree w over GF(2), and then enumerating the elements(which are polynomials) with the generator x. Addition in this fieldis performed using polynomial addition, and multiplication isperformed using polynomial multiplication and taking the resultmodulo q(x).In implementation, we map the elements of GF(2w) to binarywords of size w, so that computation in GF(2w) becomes bitwiseoperations. Let r(x) be a polynomial in GF(2w). Then we canmap r(x) to a binary word b of size w by setting the ith bit of b tothe coefficient of xi in r(x).
10 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Accelerate operations in Galois Field (GF)
Accelerate operations in Galois Field (GF)
GF(2w) field is constructed by finding a primitive polynomial q(x)of degree w over GF(2), and then enumerating the elements(which are polynomials) with the generator x. Addition in this fieldis performed using polynomial addition, and multiplication isperformed using polynomial multiplication and taking the resultmodulo q(x).In implementation, we map the elements of GF(2w) to binarywords of size w, so that computation in GF(2w) becomes bitwiseoperations. Let r(x) be a polynomial in GF(2w). Then we canmap r(x) to a binary word b of size w by setting the ith bit of b tothe coefficient of xi in r(x).After mapping, Addition/subtraction of binary elements of GF(2w)can be performed by bitwise exclusive-or.
10 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Accelerate operations in Galois Field (GF)
Accelerate operations in Galois Field (GF)
However, multiplication is still costly.Use bitwise operations to emulate polynomial multiplication inGF(2w): a loop of w iterations.e.g. In GF(28), multiplying 2 bytes takes 8 iterations.
11 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Accelerate operations in Galois Field (GF)
Accelerate operations in Galois Field (GF)
CPU approach
store all the result in a multiplication table
e.g. In GF(28), multiplying 2 bytes takes 256× 256 = 64KBconsumes a lot of space ⇝ not suitable for GPU
11 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Accelerate operations in Galois Field (GF)
Accelerate operations in Galois Field (GF)
trade-off between computation and memory consumption ([1])We build up two smaller tables:▶ exponential table: maps from a binary element b to power j
such that xj is equivalent to b.▶ logarithm table: maps from a power j to its binary element b
such that xj is equivalent to b.
Multiplication in GF(2w):▶ looking each binary number in the logarithm table for its
logarithm (this is equivalent to finding the polynomial)▶ adding the logarithms modulo 2w − 1 (this is equivalent to
multiplying the polynomial modulo q(x))▶ looking up exponential table to convert the result back to a
binary number
Totally three table lookups and a modular addition.11 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Accelerate encoding process
Accelerate encoding process
▶ Generating Vandermonde’s Matrix
▶ Matrix multiplication
12 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Accelerate decoding process
Accelerate decoding process
▶ Computing inverse matrix
▶ Matrix multiplication
13 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Accelerate decoding process
Accelerate matrix inversion
Guassian/Guass-Jordon elimination in GF(28)
[V ′I] =
1 0 0 0 | 1 0 0 00 1 0 0 | 0 1 0 01 2 3 4 | 0 0 1 01 1 1 1 | 0 0 0 1
14 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Accelerate decoding process
Accelerate matrix inversion
Guassian/Guass-Jordon elimination in GF(28)
[V ′I] =
1 0 0 0 | 1 0 0 00 1 0 0 | 0 1 0 00 2 3 4 | 1 0 1 00 1 1 1 | 1 0 0 1
14 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Accelerate decoding process
Accelerate matrix inversion
Guassian/Guass-Jordon elimination in GF(28)
[V ′I] =
1 0 0 0 | 1 0 0 00 1 0 0 | 0 1 0 00 0 3 4 | 1 2 1 00 0 1 1 | 1 1 0 1
14 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Accelerate decoding process
Accelerate matrix inversion
Guassian/Guass-Jordon elimination in GF(28)
[V ′I] =
1 0 0 0 | 1 0 0 00 1 0 0 | 0 1 0 00 0 1 247 | 244 245 244 00 0 0 246 | 245 244 244 1
14 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Accelerate decoding process
Accelerate matrix inversion
Guassian/Guass-Jordon elimination in GF(28)
[V ′I] =
1 0 0 0 | 1 0 0 00 1 0 0 | 0 1 0 00 0 1 0 | 104 187 186 2100 0 0 1 | 105 186 186 211
14 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Accelerate decoding process
Accelerate matrix inversion
Guassian/Guass-Jordon elimination in GF(28)
GPU
▶ normalize row
▶ make column into reduced echelon form
14 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Experiment Settings
Experiment Settings
▶ Intel(R) Xeon(R) CPU E5620 @ 2.40GHz
▶ nVidia Corporation GF100 [Tesla C2050]
15 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
GPU Cost Breakdown
Experiment Result (GPU cost breakdown)
k = 4, n = 6
29.45792
173.697983
218.380676
429.164764
576.414185
0
100
200
300
400
500
600
752KB 47MB 161MB 679MB 1.1GB
time(ms)
file size
GPU R-S encoding
Copy code chunks from GPU memory to hostmemory
Encoding file (matrix multiplication)
Copy encoding matrix from GPU memory tohost memory
Generating encoding matrix
Copy data chunks from host memory to GPUmemory
16 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
GPU Cost Breakdown
Experiment Result (GPU cost breakdown)
k = 4, n = 6
0
50
100
150
200
250
300
350
400
450
0 200000 400000 600000 800000 1000000 1200000
time(ms)
file size(KB)
GPU R-S encoding
Total computation time
Total communication time
16 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
GPU Cost Breakdown
Experiment Result (GPU cost breakdown)
k = 4, n = 6
31.092863
178.489853
262.113556
614.687073
751.384766
0
100
200
300
400
500
600
700
800
752KB 47MB 161MB 679MB 1.1GB
time(ms)
file size
GPU R-S decoding
Copy data chunks from GPU memory to hostmemory
Decoding file (matrix multiplication)
Generating decoding matrix (inverse matrix)
Copy encoding matrix from host memory toGPU memory
Copy code chunks from host memory to GPUmemory
16 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
GPU Cost Breakdown
Experiment Result (GPU cost breakdown)
k = 4, n = 6
0
100
200
300
400
500
600
0 200000 400000 600000 800000 1000000 1200000
time(ms)
file size(KB)
GPU R-S decoding
Total computation time
Total communication time
16 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
GPU vs CPU
Experiment Result (GPU vs CPU)
k = 4, n = 6
9.87 X
25.73 X
58.06 X
64.72 X
0
200
400
600
800
1000
1200
1400
1600
1800
2000
0 200000 400000 600000 800000 1000000 1200000
bandwidth (MB/s)
file size(KB)
RS encode: CPU vs GPU
GPU bandwidth
CPU bandwidth
17 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
GPU vs CPU
Experiment Result (GPU vs CPU)
k = 4, n = 6
1.47 X
18.12 X
39.19 X
75.4 X
89.99 X
0
200
400
600
800
1000
1200
1400
1600
0 200000 400000 600000 800000 1000000 1200000
bandwidth (MB/s)
file size(KB)
RS decode: CPU vs GPU
GPU bandwidth
CPU bandwidth
17 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
tuning k
Experiment Result (tuning k)
File size: 1.1GBdouble fault tolerance
576.414185 665.971008 947.589661
1495.114624
2702.535156
5060.778809
0
1000
2000
3000
4000
5000
6000
4 8 16 32 64 128
time(ms)
k
Total GPU encoding time (n-k=2)
Total GPU encoding time
18 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
tuning k
Experiment Result (tuning k)
File size: 1.1GBdouble fault tolerance
751.384766 1028.636963 1210.730591
1856.644165
3167.548828
5958.092285
0
1000
2000
3000
4000
5000
6000
7000
4 8 16 32 64 128
time(ms)
k
Total GPU decoding time (n-k=2)
Total GPU decoding time
18 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
tuning k
Experiment Result (tuning k)
File size: 1.1GBtriple fault tolerance
766.305176 695.003662 934.1427
1509.532715
2674.739502
5053.129883
0
1000
2000
3000
4000
5000
6000
4 8 16 32 64 128
time(ms)
k
Total GPU encoding time (n-k=3)
Total GPU encoding time
18 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
tuning k
Experiment Result (tuning k)
File size: 1.1GBtriple fault tolerance
873.917542 1026.679443 1345.650269
1875.053711
3167.339355
5960.175293
0
1000
2000
3000
4000
5000
6000
7000
4 8 16 32 64 128
time(ms)
k
Total GPU decoding time (n-k=3)
Total GPU decoding time
18 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
Q & A
Thank You!
19 / 19
..........
.....
.....................................................................
.....
......
.....
.....
.
. . . . . . .Background
. . . . .How to use CUDA to accelerate
. . . .Experiment Q & A
J.S. Plank et al.A tutorial on reed-solomon coding for fault-tolerance inraid-like systems.Software Practice and Experience, 27(9):995–1012, 1997.
19 / 19