Algorithm-Based Fault Tolerance Matrix Multiplication

35
CS71 7 Algorithm-Based Fault Tolerance Matrix Multiplication Greg Bronevetsky

description

Algorithm-Based Fault Tolerance Matrix Multiplication. Greg Bronevetsky. Problem at Hand. Have matrices A and B Want to compute their product: AB Ask a matrix-matrix-multiply (MMM) implementation to compute product Answer: C Question: Is C the correct answer? How could we know for sure?. - PowerPoint PPT Presentation

Transcript of Algorithm-Based Fault Tolerance Matrix Multiplication

Page 1: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Algorithm-Based Fault ToleranceMatrix Multiplication

Greg Bronevetsky

Page 2: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Problem at Hand

• Have matrices A and B• Want to compute their product: AB• Ask a matrix-matrix-multiply (MMM)

implementation to compute product• Answer: C

• Question: Is C the correct answer? How could we know for sure?

Page 3: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Algorithm-Based Fault Tolerance

• Encode input matrices via error-correcting code

• Run regular MMM algorithm on encoded matrices– Encoding invariant under MMM

• Naturally outputs encoded matrices

• Encoding guarantees:– If upto t errors in output, will detect error– If upto c<t errors in output, can decode correct

output matrix

Page 4: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Outline

Linear Error Correcting Codes

Algorithm-Based Fault Tolerance

ABFT = Linear Encoding of Matrices

Page 5: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Error Correcting Codes

• Map f: k n

– k-long data words n-long codewords– We use ={0, 1}

• Code of length n is a “sparse” subset of n

– Very few possible words are valid codewords

• Rate of code

Amount of information communicated by each codeword

n

k

n

Cr log

Page 6: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Minimum Distance

• Minimum Distance:

d() = Hamming distance• Hamming distance: number of spots where

words differ

• Measures difficulty of decoding/correcting corrupted codewords

),(min,

min yxddCyx

Page 7: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Detection and Correction

• Code may detect errors in dmin spots– No error can morph one codeword into another

• May correct errors in (dmin-1)/2 spots– Can still find “closest” codeword

• More details later…

Each codeword defines circle around itself of radius dmin/2

Page 8: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Linear Codes

• Codewords form linear subspace inside n

• In rowspace of generator matrix G:

1011100

1101010

1110001

G a (n=7, k=3) code

}{ mmessagesGmcodewords

Page 9: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Property 1

• Linear combination of any codewords is also a codeword:

For any x,yC, (x+y)C• Codeword*constant is codeword

For any zC, k*zC• <0,0…0> always a codeword

• Proof: basic properties of linear spaces

Page 10: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Property 2

• Minimum distance of linear code =

• Where

• Proof:

codewordinsofnumberweight '1()

)(min),(min,

min zweightyxddCzCyx

))((min,

)0()(.

)(min),(min

min

,,

zweightdThus

zzyxaspressedexbecanCzAnylinearisCncesiCzyxzLet

yxweightyxd

Cz

CyxCyx

Page 11: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Parity Check Matrix

• H: dual matrix to G– Contains basis of space orthogonal to G’s row

space– n-k dimentional space

• H is (n-k)xn

• Space defined as:

• Note: H also defines a linear code

nhhhH

cHCx

...

0.

21

Page 12: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Property 3

• dmin=min # of columns of H that can sum to 0

• Proof:

0. tosum tocolumnsfewer get t can' Thus,

minimal) be ' (otherwise

spacecolumn sH'in becan rdlighter wo No

0 H of columns of sum codeword)t (min weighH

codeword min weightin s1’ of #

codeword min weight ofweight

min

min

min

twouldnd

d

d

Page 13: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Property 4

• Minimum distance of linear code n-k+1• Proof

– Total n dimensions (since codewords are n-vectors)– G’s rowspace rank = k– Thus, H’s columspace rank = n-k– Thus, n-k+1 columns will be linearly dependent

• Add up to 0

– By Property 3, this is dmin

Page 14: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Outline

Linear Error Correcting Codes

Algorithm-Based Fault Tolerance

ABFT = Linear Encoding of Matrices

Page 15: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Encoding a Matrix

• Algorithm-Based Fault Tolerance introduced by Huang and Abraham in 1984

• Encode each row of matrix via extra column• Column entries = sums of matrix rows

T

mmm

eAeA

rowrow

rowrow

rowrow

row

row

row

A 1...11:::

22

11

2

1

Page 16: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Encoding a Matrix

• Encode each column of matrix via extra row• Row entries = sums of matrix columns

• Full Encoding:

TTn

n eAe

A

colcolcol

colcolcol1...11

...

...

21

21

AeeAe

AeATT

Page 17: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Detecting Errors

• Suppose matrix A is corrupted to matrix – entry âi,j is wrong

• Can detect error’s exact position: <i,j>

jjimi

ijimj

sumasumcolunequal

sumasumrowunequal

,1

,1

:

:

......

:.........

...ˆ...

:.........

ˆ ,

j

iji

sum

sumaAErroneous

Page 18: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Correcting Errors

• Can correct error using row or col checksum

jijijijijiji

jiji

niinijii

inijii

nijiii

aaaaaaCorrectionApply

aa

aaaaa

sumaaa

aaasumCorrectionRow

,,,,,,

,,

,1,,,1,

,,1,

,,1,

)ˆ(ˆˆˆ:

ˆ

)...()...ˆ...(

)...ˆ...(

)......(:

......

:.........

...ˆ...

:.........

ˆ ,,1,

j

inijii

sum

sumaaaAErroneous

Page 19: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Big Trick: Preservation of Encoding

• Column-encoded mtx * Row-encoded mtx = = Fully-encoded mtx

• Can check MMM computation by checking encoding of output

• If product matrix has an erroneous entry– Can detect– Can correct

ABeeABe

ABeABBeB

Ae

ATTT

Page 20: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Applications

• Matrix Multiplication– Given encoded A and B, – Check whether MMM result C (?=AB) has valid

encoding

• Matrix Factorization– Given a factorization A=WZ– Verify correctness by verifying encodings of

factors• Factors row- OR column-encoded• Can only detect, not correct errors

Page 21: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Weighted ABFT

• Oftentimes need to check row- or column-encoded matrices– Ex: factorization, data integrity check

• Can only detect errors in such matrices• Can we also correct?

• Yes, by generalizing to weighted checking rows/columns

Page 22: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Weighting

• Suppose we have d n-vectors w1…wd

• Can column-encode matrix A:

• Lets try out:

dAwAwA ...1

nw

w

...21

1...11

2

1

Page 23: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Weighted Error Detection

::.........

.........ˆ...

::.........ˆ

,1,2,1,1,,1, niiniinijii aawaawaaaAErroneous

!:,

)ˆ()ˆ(,

...)...ˆ...1(

......1...

)ˆ(...)...ˆ...(

.........

1

2

,,,,2

,1,2,,1,2

,,1,,1,2

,,,1,1,,1,1

,,1,,1,1

Detectedentryerroneousjs

sThus

aajajajsClearly

aawanajasLet

anajaaaw

aaaawaaasLet

aaaaaw

jijijiji

niinijii

nijiinii

jijiniinijii

nijiinii

Page 24: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Weighted Error Correction

::.........

.........ˆ...

::.........ˆ

,1,2,1,1,,1, niiniinijii aawaawaaaAErroneous

jijijijijiji

jiji

aaaasaaCorrection

aas

,,,,1,,

,,1

)ˆ(ˆˆˆ:

)ˆ(

• Weighted encoding Detects and Corrects single errors– Even for non full-encoding

Page 25: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Outline

Linear Error Correcting Codes

Algorithm-Based Fault Tolerance

ABFT = Linear Encoding of Matrices

Page 26: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

“Surprise”

• But this is all just a linear code!• Generator matrix for above scheme:

k

G

11

::...

211

111

Page 27: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Generating Encodings

• Given m=<ai,1, ai,2, …, ai,k> as message word (or matrix row/column)

k

aaa niii

11

::...

211

111

... ,2,1,

li

k

lli

k

lniii alaaaa ,

1,

1,2,1, ...

Page 28: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Surprise??

• Not too surprising really• Why else would MMM preserve encoding?• Another possibility:

– Efficient: can be implemented via bit shifts

• Room open for using any linear code!

nw

w

2...22

1...1110

2

1

Page 29: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Error Detection/Correction in General

• To show for linear codes:– Can detect dmin errors

– Can correct (dmin-1)/2 errors

• Let be original codeword• Let be the corrupted codeword•

– e: error vector

mm̂

emm ˆ

Page 30: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Error Detection in General

• •

– s called the “syndrome vector”– Independent of original codeword

• Note: weight(e) <dmin since <dmin errors

• Thus:

• Detection: if , then ERROR

emm ˆHeHeHmemHmHsLet )(ˆ

0He

0ˆ mH

Page 31: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Error Correction in General

• Clearly e is correction vector– corrects error in

• Sufficient to prove:

weight(e)(dmin-1)/2 H is isomorphism: correction vectors syndrome vectors– i.e. for each correction vector (want to know)

unique syndrome vector

• Thus, possible to correct any error – may not be efficient

memm ˆ m̂

Page 32: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

H is Onto

• weight(e) (dmin-1)/2 < dmin

• rank(H) = n-k (dmin-1)/2

• Thus, rank(H) weight(e) and He 0– Not enough 1’s in e to sum H’s columns to 0

• H maps onto its range• Thus,

sHemH ˆ

sHees .

Page 33: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

H is 1-1

• Let e1 and e2 be correction vectors, e1 e2

• Suppose that:– weight(e1&e2) (dmin-1)/2 – He1 = He2 = s

• He1-He2 = H(e1-e2) = s-s = 0• And so, (e1-e2) is a codeword• Thus, weight(e1-e2) dmin

• But weight(e1&e2) (dmin-1)/2 and so weight(e1-e2) dmin-1

• Contradiction! e1 = e2

Page 34: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Other Encoding Schemes

• Linear codes preserved by matrix multiplication

• Presumably, fancier codes might be preserved by fancier computations

• Limit:– S. Winograd showed in 1962 that any code s.t.

f(xy) = f(x) f(y) has rate (k/n) or minimum weight0 as k

• How general can we get?• Do good solutions exist for small k?

– k=64 bits should be good enough

Page 35: Algorithm-Based Fault Tolerance Matrix Multiplication

CS717

Summary

• For Matrix Multiplication can encode input via linear codes

• Solutions exist for more complex codes– Ex: Fourier Transforms

• On parallel systems must ensure:– No processor touches >1 element per row/column– Else, if one processor fails, encoding

overwhelmed with errors– To ensure this must modify algorithm

• Separate check placement theory