Algorithm-Based Fault Tolerance Theory of Check Placement
description
Transcript of Algorithm-Based Fault Tolerance Theory of Check Placement
![Page 1: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/1.jpg)
CS717
Algorithm-Based Fault ToleranceTheory of Check Placement
Greg Bronevetsky
![Page 2: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/2.jpg)
CS717
So Far…
• Learned how certain computations could be checked using algorithm-specific checks.
• In any algorithm we can develop checks to verify any set of data items.
• How effective are these checks?• How many faults can given set of checks
detect?
![Page 3: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/3.jpg)
CS717
Abstract Checks
• Suppose we are given (g,h)-checks• Check defined on g data elements• If all elements correct, returns 0• If 0 and h elements erroneous, return 1• If h elements erroneous, undefined
![Page 4: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/4.jpg)
CS717
Checking Example
• Assume (2, 1) checks – 2 elements, 1-failure detect
• Both sets of checks can detect single errors• Neither can locate individual errors
…
d1
d2
dn
+ sum …d1
d2
dn
+sum
n checks: i. di and sum 1 check: sum
![Page 5: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/5.jpg)
CS717
But with one more check…
• If also check sum– can detect any pair of errors– can locate single errors
• Need general theory of effective and efficient check placement
…
d1
d2
dn
+ sum
n checks: i. di and sum1 more check: sum
![Page 6: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/6.jpg)
CS717
Goals
• Need models for correlating processor faults to data errors
• Given fault model and set of checks need to derive fault detectability and locatability
![Page 7: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/7.jpg)
CS717
Papers covered
• V.S.S. Nair, J.A. Abraham, P. Banerjee. "Efficient techniques for the analysis of algorithm-based fault tolerance (ABFT) schemes", 1996.
• Choon-Sik Park and Mineo Kaneko, "An Efficient Technique for Design of ABFT Systems Based on Modified PD Graph".
• Choon-Sik Park, "Algorithm-Based Fault Tolerant Systems Based on Graph-Theoretic Error Occurence+Propagation Models", 2000. (PhD Thesis)
• V.S.S. Nair, J.A. Abraham. "Hierarchical design and analysis of fault-tolerant multiprocessor systems using concurrent error detection", 1990.
![Page 8: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/8.jpg)
CS717
Outline
• Matrix-based formalism of Nair et al
• Dependence graph-based formalism of Park et al– Includes fault propagation models
• Framework for hierarchical fault tolerant systems by Nair et al– Building fault tolerant systems out of fault tolerant
components
![Page 9: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/9.jpg)
CS717
Basic Framework
• Each processor and check associated with set of elements
P1
P2
P3
P4
d1
d2
d3
d4
d5
d6
d7
C1
C2
C3
Processors
Checks
![Page 10: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/10.jpg)
CS717
Basic Framework
• Data(Pi) = set of data elements affected by processor i– If Pi fails, any subset of of Data(Pi) may be
erroneous– No notion of errors propagating based on data
dependences
• Data() defines the Processor-Data (PD) Matrix
![Page 11: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/11.jpg)
CS717
Associated PD Matrix
P1
P2
P3
P4
d1
d2
d3
d4
d5
d6
d7
Processors
Data Elements
Processors
1100000
0010000
0001010
0000111
![Page 12: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/12.jpg)
CS717
Basic Framework
• Check(di) = set of checks that check data element di.– Must be non-empty if we expect to detect errors
• Check defines the Data-Check (DC) Matrix
• Paper focuses on (g,1) checks– g data elements– can detect upto 1 fault
![Page 13: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/13.jpg)
CS717
Associated DC Matrixd1
d2
d3
d4
d5
d6
d7
C1
C2
C3
Checks
Checks
Data Elements
100
010
001
110
001
010
001
• C1 and C2 are (3,1) checks
• C3 is a (2,1) check
![Page 14: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/14.jpg)
CS717
The PC Matrix
• Finally, associate processors and checks:• Processor-check (PC) matrix = PDDC
Data Elements
Processors
0100000
0010000
0001010
0000111
Checks
Data Elements
=
=# elements verified by check
Processors
120
012
100
010
001
110
001
010
001PD DC
PC
![Page 15: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/15.jpg)
CS717
Using the PC Matrix
• PC matrix shows if we can detect single-processor errors:
• Assume all checks are (g,h) checks• If each row of PC has all entries h failure of
that process will be detected– Regardless of which entries actually become
erroneous
# elements verified by check
Processors
120
012
PC
![Page 16: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/16.jpg)
CS717
Using the PC Matrix
• If each row of PC has all entries h failure of that process will be detected
P1
P2
P3
P4
d2
d3
d4
d5
d6
d7
C1
C2
C3
Processors
Checks
d1
# elements verified by check
Processors
120
012
PC
![Page 17: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/17.jpg)
CS717
Relaxing Detectability
• Condition is too conservative• Suppose we have (3, 2) checks
• Pi’s PD row is:
• There are 2 checks. DC matrix:• PC Matrix:
P1
d1
d2
d3
d4
d5
C1
C2
11011
11
10
10
01
01
23
![Page 18: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/18.jpg)
CS717
Relaxing Detectability
• C1 may be overwhelmed by errors
– Will not notice error <d1, d2 d5>
• By above criterion system can’t detect failure in P1
P1
d1
d2
d3
d4
d5
C1
C2
![Page 19: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/19.jpg)
CS717
Reaching New Detectability Definition
• But how could C1 be overwhelmed?
• When all 3 of its elements have errors– Recall, these are (3,2) checks
P1
d1
d2
d3
d4
d5
C1
C2
![Page 20: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/20.jpg)
CS717
Reaching New Detectability Definition
• But C1 and C2 overlap on d5
• Thus if C1 overwhelmed, C2 detects error– It is not overwhelmed
• Thus, for any error pattern can see if any check will notice
P1
d1
d2
d3
d4
d5
C1
C2
![Page 21: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/21.jpg)
CS717
Trivial Algorithm 2
• Try every possible error pattern– Exponentially many of them
• For each pattern see if some check will detect it– Before: ensured that no check overwhelmed
• Pro: Correct and not conservative• Con: Expensive
![Page 22: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/22.jpg)
CS717
New Definition of Detectability
• Work with error patterns– Ex: <d1, d2, d5>, <d1, d3, d4>, <d3>, etc.
• If one check detects given error pattern, no problem if other checks overwhelmed
• Repeat until all error patterns detected:
If some check not overwhelmed, eliminate all detectable error patterns from
consideration
![Page 23: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/23.jpg)
CS717
Example of Detectability Algorithm
• Is failure of P1 detectable?
• P1 fails d1, d2 and/or d3 may have errors
• C1, C2 overwhelmed
• C3 not overwhelmed
P1
d1
d2
d3
d4
d5
C1
C2P2C3
(2,1) checks
C4
![Page 24: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/24.jpg)
CS717
Example of Detectability Algorithm
• Look at errors C3 can detect: d3
• Remove them from consideration– Since any error pattern involving d3 will be
detected
P1
d1
d2
d3
C1
C2P2C3
(2,1) checks
C4
d4
d5
![Page 25: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/25.jpg)
CS717
Example of Detectability Algorithm
• Look at remaining error patterns: combinations of d1 and/or d2
• Now C2 not overwhelmed
• Remove any error patterns involving d2
P1
d1
d2 C1
C2P2C3
(2,1) checks
C4
d4
d5
![Page 26: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/26.jpg)
CS717
Example of Detectability Algorithm
• Look at remaining error patterns: d1
• C1 not overwhelmed
• Remove any of its error patterns
P1
d1
C1
C2P2C3
(2,1) checks
C4
d4
d5
![Page 27: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/27.jpg)
CS717
Example of Detectability Algorithm
• All of P1’s error patterns detected
• We are done!
P1 C1
C2P2C3
(2,1) checks
C4
d4
d5
![Page 28: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/28.jpg)
CS717
Failing Check Processors
• What if processor performing check fails?
• Add “pseudo” data elements to represent processors
• Each check will also check its processor’s pseudo-data element– New element has weight, so error in it will
overwhelm any check
![Page 29: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/29.jpg)
CS717
Final System
• Check C3 is in P1
• Checks C1, C2 and C4 on P2
P1 d2
d3
d4
d5
C1
C2P2C3
(2,1) checks
C4d6
d7
![Page 30: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/30.jpg)
CS717
The Infinities
P1
d1
d2
d3
d4
d5
C1
C2P2C3
(2,1) checks
C4d6
d7
# elements verified by check
Processors
11
1122
Data Elements
Processors
011000
000111
PDChecks
Data Elements
1011
0100
1000
0100
0110
1011
0001
DC
PC
![Page 31: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/31.jpg)
CS717
The Infinities
P1
d1
d2
d3
d4
d5
C1
C2P2C3
(2,1) checks
C4d6
d7# elements verified by check
11
1122
PC
• If P1 fails, C1 and C2 overwhelmed• C3 also overwhelmed by +1
– Because C3 runs on failed P1
• Only C4 not overwhelmed
Processors
![Page 32: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/32.jpg)
CS717
The Infinities
P1
d1
d2
d3
d4
d5
C1
C2P2C3
(2,1) checks
C4d6
d7# elements verified by check
11
1122
PC
• Remove all error patterns detected by C4
– Any that include d2
Processors
![Page 33: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/33.jpg)
CS717
The Infinities
P1
d1
d3
d4
d5
C1
C2P2C3
(2,1) checks
C4d6
d7# elements verified by check
11
0111
PC
• C1 and C2 no longer overwhelmed
• Remove error patterns detected by C1 and C2
– Any that include d1 and d3
Processors C4’s entry must become 0Others may go lower
![Page 34: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/34.jpg)
CS717
The Infinities
P1
d4
d5
C1
C2P2C3
(2,1) checks
C4d6
d7# elements verified by check
11
000
PC
• Now P1’s row is all 0’s and ’s• All real data elements successfully checked• Only pseudo-elements remain
– Don’t care
Processors C1’s and C2’s entries must become 0Others may go lower
![Page 35: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/35.jpg)
CS717
The Infinities
P1
d4
d5
C1
C2P2C3
(2,1) checks
C4d6
d7# elements verified by check
11
000
PC
Processors
• Note failure of P2 not detectable
• d5 only checked by C4, which runs on P2
• Thus, entry will never drop to
![Page 36: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/36.jpg)
CS717
Multi-Process Errors
• Want to know if system detect failures of r processors
• For every subset of r processors– Take union of all data elements they touched– Pretend each r-set is single processor
• Use above algorithm to check if all resulting error patterns detectable
![Page 37: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/37.jpg)
CS717
Fault Locatability
• We only see errors, not faults• For each error pattern, want to know which
fault caused it
• Given two fault patterns, are they distinguishable?
• Only if they have different patterns of failed checks
• Will give intuition for analysis
![Page 38: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/38.jpg)
CS717
0-1 Disagreement
• Take rows Ri and Rj of rPC (faults Fi and Fj)
• For every possible error pattern in Ri and Rj look at what each check says on this pattern
• If check responses different on each pattern: Fi and Fj can be differentiated
![Page 39: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/39.jpg)
CS717
1-0 Disagreement
• Want to differentiate faults Fi and FiFj j
• Compare each error pattern of Fi and Fj: Eik and Ejl
• If some check meets Eik on 1 & h spots and meets Eil on 0 spots then Ejk and EjkEjl distinguishable
• If this is true for all error patterns then F i and FiFj distinguishable
![Page 40: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/40.jpg)
CS717
1-0 Disagreement Example
101
110
,
,
lj
ki
EError
EError
001
011
101
DC
102
012
,
,
lj
ki
EonChecks
EonChecks1-0 disagreement in
both directions
![Page 41: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/41.jpg)
CS717
1-0 Disagreement Example
• Clearly, Eik and Ejl look different
• EikEjl corresponds to fault pattern:
• Checks would say:
• Different from Eik or Ejl : Distinguishable!
001
011
101
DC
111
112
101
110
,
,
lj
ki
EError
EError
102
012
,
,
lj
ki
EonChecks
EonChecks
![Page 42: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/42.jpg)
CS717
Fault Locatability
• If can show 1-0 disagreement between every single-process fault and every r-process fault:System is r-fault locatable
• Algorithm for locatability is obscure
• Read the paper
![Page 43: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/43.jpg)
CS717
Summary
• Presented matrix-based framework for evaluating error detectability & locatability
• Framework deals with arbitrary errors
• More work by V.S.S. Nair with other coauthors
![Page 44: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/44.jpg)
CS717
Outline
• Matrix-based formalism of Nair et al
• Dependence graph-based formalism of Park et al– Includes fault propagation models
• Framework for hierarchical fault tolerant systems by Nair et al– Building fault tolerant systems out of fault tolerant
components
![Page 45: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/45.jpg)
CS717
Graph-Based Framework
• Developed by Choon-Sik Park• Does in graphs what Nair et al work does in
matrices• Assumes (g,1) checks• Differences:
– Different definition of fault locatability• Unknown if equivalent
– Presents more limited faulterror models• As opposed to “anything and everything”
• Will first present general view, then specific error models
![Page 46: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/46.jpg)
CS717
Basic Picture
Errors
……
…
Faults
……
…
Fi
Fj
Data
……
…eiu
ejv
Checks
……
…
c
c`
ProcessorData, DataData dependence info maintained
![Page 47: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/47.jpg)
CS717
ErrorsFaults Data
k-Faults
• Faults may cause number of possible errors– For given fault, many errors possible– If given error happens, all associated data
elements definitely corrupted
• k-Faults: faults generating errors that corrupt k data elements
Fi
eiu
![Page 48: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/48.jpg)
CS717
Fault Detectability
• System is k-fault detectable if for every error pattern check c s.t. |ceiu|=1 means intersection of affected data elements
• Proof:– If there exists such check then every error pattern
induced by fault will be detected– If k-fault detectable then must some check that
reliably yells for any possible error pattern• Can allow the check that yells to be the check in
definition
![Page 49: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/49.jpg)
CS717
Fault Management
• k-fault detectability: If a fault affects k data elements then checks will detect it
• k-fault locatability: For all faults that affect k data elements, can tell any pair of faults apart
• Will examine all fault patterns Fi that come from k data elements failing
![Page 50: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/50.jpg)
CS717
Fault Locatability 1
• To locate faults, must ensure that different faults cause different errors
• Theorem 1:System k-fault locatable only if for error patterns eiu, ejv (from faults Fi and Fj) eiuejv symmetric difference
• Proof clear:If two faults can show up as same error, can’t tell them apart
![Page 51: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/51.jpg)
CS717
Fault Locatability 2
• Theorem 2:System k-fault locatable only if for error patterns eiu, ejv checks c and c' s.t.
– |c(eiuejv)|=1 (recall: all checks are (g,1))
– |c(eiuejv)|=0
– If |c(eiu-ejv)|=1 then |c'ejv)|=1
– If |c(ejv-eiu)|=1 then |c'eiu)|=1
• Intuition: Trying to make tuple <c,c'> be different and <0,0> on errors eiu and ejv
![Page 52: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/52.jpg)
CS717
Fault Locatability Illustration
(eiuejv)
(eiuejv)
(eiu-ejv)
(ejv-eju)
eiu
ejv
![Page 53: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/53.jpg)
CS717
Fault Locatability Illustration
• |c(eiuejv)|=1
• i.e. c overlaps one element (eiuejv)
(because of (g,1) checks)
(eiuejv)
(eiuejv)
(eiu-ejv)
(ejv-eju)
eiu
ejv
c
![Page 54: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/54.jpg)
CS717
Fault Locatability Illustration
(eiuejv)
(eiuejv)
(eiu-ejv)
(ejv-eju)
eiu
ejv• |c(eiuejv)|=0
• i.e. c only touches on the part that is unique to ejv
c
![Page 55: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/55.jpg)
CS717
Fault Locatability Illustration
(eiuejv)
(eiuejv)
(eiu-ejv)
(ejv-eju)
eiu
ejv• If |c(ejv-eiu)|=1 then |
c'eiu)|=1
• If c notices ejv make sure that c‘ notices eiu
c
c'OR
![Page 56: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/56.jpg)
CS717
Fault Locatability Illustration
(eiuejv)
(eiuejv)
(eiu-ejv)
(ejv-eju)
eiu
ejv• Error eiu:<c,c'>=<0,1>• Error ejv:<c,c'>=<1,?>• Patterns distinguishable• Either error detected
c
c'OR
![Page 57: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/57.jpg)
CS717
Fault Locatability 2
• Theorem 2:System k-fault locatable only if for error patterns eiu, ejv checks c and c' s.t.
– |c(eiuejv)|=1 (recall: all checks are (g,1))
– |c(eiuejv)|=0
– If |c(eiu-ejv)|=1 then |c'ejv)|=1
– If |c(ejv-eiu)|=1 then |c'eiu)|=1
• This, is above true for every pair of error patterns, system k-fault detectable
![Page 58: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/58.jpg)
CS717
Extra Fault Detectability
• Theorem: if system is k-fault locatable then it is 2k-fault detectable
• Must show: for any fault Fl in 2k processors, resulting errors elw, check c. |celw|=1
• Note: Failures of 2k processors result in 2 errors as failures of k data elements
• Thus, can break up elw = (eiuejv), coming from k-fault patterns Fi and Fj
![Page 59: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/59.jpg)
CS717
Extra Fault Detectability
• Theorem: if system is k-fault locatable then it is 2k-fault detectable
• Must show: eiu,ejv check c. |c(eiuejv)|=1
• If (eiuejv) happens, both c and c' will notice
(eiuejv)
(eiu-ejv)
(ejv-eju)
eiu
ejv
c
c'OR
![Page 60: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/60.jpg)
CS717
FaultError Models
• So far trying to deal with arbitrary errors• Actual model of how faults turn into errors not
defined– i.e. arbitrary
• This is unnecessarily general
• Should focus on realistic models of error generation and propagation– Makes it easier to design reliable systems
![Page 61: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/61.jpg)
CS717
Single-Input-Driven Model
• Output of computation erroneous if any input(s) are– Even if processor is faulty
• If processor is faulty, its computations may or may not be erroneous(this is where we use data dependence information)
• Will focus on how model treats single-processor failures
![Page 62: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/62.jpg)
CS717
SID Model Picture
• … : data elements on Pi
– Synonymous with sets of data elements on Pi
• Focus on single-processor failures
Pi
iiWD
2iD
1iD
……
iwD
Data
1iD iiWD
![Page 63: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/63.jpg)
CS717
Fault Model in Practice
• If Pi fails, any subset of Diw’s may have error
• If Diw has error, any data depending on it has error– Bijection between Diw
and errors Eiw
Pi
iiWD
2iD
1iD
……
iwD
Data2iE
iwE
iiWE
![Page 64: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/64.jpg)
CS717
Single-Fault Detectability in SID
• Brute-Force algorithm: sets of Eiw’s
– If check c s.t. |c(Eiw’s)|=1 then this error pattern detectable
– If all patterns detectable, system is single-fault detectable
Pi
iiWD
2iD
1iD
……
iwD
Data
c
![Page 65: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/65.jpg)
CS717
Too Conservative
• Like before, algorithm too conservative• Examines exponentially many error patterns• Suppose set of errors
detected via check c– i.e. |cE|=1
• Look at
} E , E,{EE r21
1E..E }E ,E ,E{E r21 jcts
1D
2D
c
3D
EE
![Page 66: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/66.jpg)
CS717
Too Conservative
• Clearly, all overlap with c on one element– Thus, each one detectable– Similarly, all unions containing detectable
• Therefore, if a set of errors detectable, all unions containing suberrors also detectable– And thus, no need to check them
1D
2D
c
s'E j
s'E j
3D
EE
Can ignore:E1, E2, E1E2, E1E3, E1E2, E1 E2 E3
Can’t ignore:E3
![Page 67: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/67.jpg)
CS717
New Definition of Detectability
• = (start with all possible errors)
• For each check cs:– Check that detectable:
• Now ignore detectable subsets of • Remove detectable subsets:
• Repeat to ensure rest of also detectable
0iE iE
1
1
iws Ec
iwsi
si EEE
siE 1
siiw EEiws Ec
siE
siE
![Page 68: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/68.jpg)
CS717
Detectability Example
• Check (= )
• c1 meets E1 and E21D
2D
c1
3D
4D
5D
6D
0iE iE
![Page 69: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/69.jpg)
CS717
Detectability Example
• Check (= )
• c1 meets E1 and E2
• Remove them to get
1D
2D
c1
3D
4D
5D
6D
0iE iE
1iE
![Page 70: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/70.jpg)
CS717
Detectability Example
• Check
• C2 meets E3 and E4
– Also meets E2 but on error E2, c1 will ring
1D
2D
c1
3D
4D
5D
6D
},,,{ 65431 EEEEEi
c2
![Page 71: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/71.jpg)
CS717
Detectability Example
• Check
• C2 meets E3 and E4
– Also meets E2 but on error E2, c1 will ring
• Remove them to get
1D
2D
c1
3D
4D
5D
6D
},,,{ 65431 EEEEEi
c2
2iE
![Page 72: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/72.jpg)
CS717
Detectability Example
• Check
• C3 meets E5
1D
2D
c1
3D
4D
5D
6D
},{ 652 EEEi
c2c3
![Page 73: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/73.jpg)
CS717
Detectability Example
• Check
• C3 meets E5
• Remove it to get
1D
2D
c1
3D
4D
5D
6D
},{ 652 EEEi
c23iE
c3
![Page 74: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/74.jpg)
CS717
Detectability Example
• Check
• C3 meets E6
– Recall: circles on left are data on processor I
1D
2D
c1
3D
4D
5D
6D
}{ 63 EEi
c2c3
c4
![Page 75: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/75.jpg)
CS717
Detectability Example
• Check
• C3 meets E6
– Recall: circles on left are data on processor I
• Remove it to get
1D
2D
c1
3D
4D
5D
6D
}{ 63 EEi
c2
3iE
c3
c4
![Page 76: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/76.jpg)
CS717
Detectability Example
DONE!
1D
2D
c1
3D
4D
5D
6D
c2c3
c4
![Page 77: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/77.jpg)
CS717
Single-Fault Locatability in SID
• Basic definition:Must exist enough checks s.t. all error patterns produced by failure of Pi differentiable from error patterns of Pj
• Involves a lot of error patterns
• Start with brute-force definition
![Page 78: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/78.jpg)
CS717
Brute-Force Definition
error patterns Eq={Ei1, Ei5, Eiw, …} from Pi checks and s.t.–
• Detects error E
– • Ignores any error from Pj
– detect Ej and all subsets via above algorithm– And vice versa (since ‘s may ring on Pi’s errors)
• Result: – Any error pattern in Ei, none in Ej will ring some cq
– Every pattern in Ej detectable
rcc ...1qc1Ecq
0 jq Ec
rcc ...1
kc
![Page 79: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/79.jpg)
CS717
Responses of Checks
• On error pattern Eq (due to failure of Pi):
• On any error Ej due to failure of Pj
• Can brute-force evaluate test on every possible Eq
???11 rq ccc
1/01/01/001 rccc
At least one must be =1 (else Ej not detectable)
![Page 80: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/80.jpg)
CS717
Brute Force Too Exhaustive
• Recall that if then same true for all sets containing E1, … Er
• Thus, can eliminate many of the steps above
1} E , E,{Ec r21
![Page 81: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/81.jpg)
CS717
New Definition of Locatability
• = (start with all possible Pi errors)
• For each check cs:
– Check cs detects :
– But not Ej :
• Ensure that Ej is detectable via above algorithm
0iE iE
siE 1
siiw EEiws Ec
0 js Ec
![Page 82: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/82.jpg)
CS717
New Definition of Locatability
• Syndrome of Ei and detectable subsets:
• Syndrome of Ej all subsets:
• Can now ignore detectable subsets of • Remove detectable subsets:• Repeat until all covered• Do same for
– In paper, steps for and interleaved
1
1
iws Ec
iwsi
si EEE
siE
???11 rq ccc
1/01/01/001 rccc
At least one must be =1 (else Ej not detectable)
iE
jEiE jE
![Page 83: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/83.jpg)
CS717
Summary
• Presented graph-based framework for evaluating error detectability & locatability
• Framework deals with arbitrary errors• Can be specialized to a simpler fault model:
Single-Input Driven• Choon-Sik Park’s thesis presents the
Multiple-Input Driven model– More realistic but complex
![Page 84: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/84.jpg)
CS717
Outline
• Matrix-based formalism of Nair et al
• Dependence graph-based formalism of Park et al– Includes fault propagation models
• Framework for hierarchical fault tolerant systems by Nair et al– Building fault tolerant systems out of fault tolerant
components
![Page 85: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/85.jpg)
CS717
Building Larger Systems
• Now know how to analyze systems for detectability & locatability
• For large systems this can be very hard/expensive
• Large systems typically made up of smaller components
• Simplifies fault tolerance design
![Page 86: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/86.jpg)
CS717
Basic Idea
• Have component with known detectability (=t) & locatability (=l)
• Construct system S out of k components
• What is resulting fault tolerance?
![Page 87: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/87.jpg)
CS717
Basic Idea
• System fault tolerance no better than for individual component
• If >t data elements fail in same component, error not detected
• If >l elements fail in component, will not locate
• Detectability & locatability ratio tends to 0 as system size increases!
![Page 88: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/88.jpg)
CS717
Hierarchical Design
• To build fault tolerant systems must introduce checks with new components
• Will present hierarchical design scheme with specific detectability & locatability guarantees
• Assumptions:– All (g,h) checks have same h
• No restriction on g
– Every processor produces only one data element• Same true for blocks of processors
– Checks are fault tolerant• Claims that this doesn’t change problem
![Page 89: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/89.jpg)
CS717
Basic Component
• Start off with basic system:
• System has internal checks• Fault detectability = t• Fault locatability = l
…
B
![Page 90: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/90.jpg)
CS717
Basic Component
• Then replicate it k-fold
• Assumptions:– copies are independent
• (i.e. do not affect each other’s data)
– Each system produces one data element…
B1
…
B2
…
Bk
…
![Page 91: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/91.jpg)
CS717
Basic Component
• Then replicate it k-fold
• And add additional checks across all copies• Process repeated d-1 times to get d-level
hierarchical system…
B1
…
B2
…
Bk
c1c2
cr
…
![Page 92: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/92.jpg)
CS717
Detectability 1kh
• Theorem 1:– If 1kh then hierarchical system can detect |B|kd-1 errors
• Proof:– Base case: d=2– Suppose every element has error– Each check must deal with kh
errors– But they are (g,h) checks and
will detect such errors– Thus, system can detect |B|k errors
…
B1
…
B2
…
Bk
c1c2
cr
…
![Page 93: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/93.jpg)
CS717
Detectability 1kh
• Theorem 1:– If 1kh then hierarchical system can detect |B|kd-1 errors
• Proof:– Inductive case: d+1
– Components Bi each have |B|kd-2
elements– By argument above, system
detects (|B|kd-2)k=|B|kd-1 errors• Argument works because sub-systems
at each level produce one data element
…
B1
…
B2
…
Bk
c1c2
cr
…
![Page 94: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/94.jpg)
CS717
Detectability k>h
• Theorem 2:– If k>h then hierarchical system can detect (t+1)(h+1)d-1-1 errors
• Proof:– Base case: d=2– Suppose (t+1)(h+1) errors with h+1
copies of B having t+1 errors each– Detectability of B = t, so internal
checks will not notice errors– 2nd level checks will get h+1 errors
each: will not notice– Thus, error pattern of size (t+1)(h+1) that will not
be detected
…
B1
…
B2
…
Bk
c1c2
cr
…
![Page 95: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/95.jpg)
CS717
Detectability k>h
• Theorem 2:– If k>h then hierarchical system can detect (t+1)(h+1)d-1-1 errors
• Proof:– Base case: d=2– Suppose (t+1)(h+1)-1 errors– By pigeonhole principle, some unit
has t errors or some 2nd levelcheck has h errors
– Thus, some check at 1st or 2nd levelwill ring
– Thus, system detectability = (t+1)(h+1)-1
…
B1
…
B2
…
Bk
c1c2
cr
…
![Page 96: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/96.jpg)
CS717
Detectability k>h
• Theorem 2:– If k>h then hierarchical system can detect (t+1)(h+1)d-1-1 errors
• Proof:– Inductive case: d+1
– Components Bi detect Td errors
– By induction, Td= (t+1)(h+1)d-1-1
– By argument above, system detects (Td+1)(h+1)-1 errors
– Thus, system detectability = (t+1)(h+1)d-1
…
B1
…
B2
…
Bk
c1c2
cr
…
![Page 97: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/97.jpg)
CS717
Locatability
• Theorem 3:– If k>1 then hierarchical system can locate 2d-1(l+1)-1 errors
• Proof:– Base case: d=2– Suppose fault pattern of 2(l+1)
errors, l+1 errors in two Bi’s
– Bi & Bj can’t locate the errors
– 2nd level checks may locate erroneous rows, not columns
– Thus, unlocatable fault pattern of size 2(l+1)
… … …
Bk
c1c2
cr
…
B1 B2
![Page 98: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/98.jpg)
CS717
Locatability
• Theorem 3:– If k>1 then hierarchical system can locate 2d-1(l+1)-1 errors
• Proof:– Base case: d=2– Suppose fault pattern of 2(l+1)-1
– At most one Bi may have l+1 errors
• If none do, we’re done
– Remaining l errors distributed among other Bj’s
… …
c1c2
cr
B1
…
Bk
…
B2
![Page 99: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/99.jpg)
CS717
Locatability
• Let Bi have l+r errors (r1)
…
Bi
…
Bj
…
Bk
c1c2
cr
…
![Page 100: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/100.jpg)
CS717
Locatability
• Let Bi have l+r errors (r1)
• Remaining Bj’s share remaining l-r+1 errors
(l+r)-(l-r+1)=2r-1 rows only have errors in Bi
– =2r-1 rows when all l-r+1 errors are in same Bj…
Bi
…
Bj
…
Bk
c1c2
cr
…
![Page 101: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/101.jpg)
CS717
Finding Overwhelmed Unit
• First, find the Bi that have >l errors
• All but one sub-system detects and locates errors correctly
• Overwhelmed subsystem:– Detects correctly
• Locatability = l Detectability > 2*l• Citation of 1973 paper by Russel & Kime
– Error location mistakes
![Page 102: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/102.jpg)
CS717
Finding Overwhelmed Unit
• In 2r-1 rows only Bi has error– Thus, no other row will claim an error there
• 2nd-level checks will catch these errors– Bi’s checks can’t lie about it
– Will definitely know these are errorsBi Bj Bk…
l+12r-1
Known errors:Uknown errors:
No error:
![Page 103: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/103.jpg)
CS717
Finding Overwhelmed Unit
• Number of errors in Bi = l+r
• Number of known errors 2r-1
• Number of unknown errors in Bi
(l+r)-(2r-1) = l-r+1
• Since r1, l-r+1l
• Bi’s checks can identify l errors– Error patterns l produce unique check alert
patterns – This data enough to identify remaining unknown
errors
![Page 104: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/104.jpg)
CS717
Locatability
• Theorem 3:– If k>1 then hierarchical system can locate 2d-1(l+1)-1 errors
• Proof:– Base case: d=2– Can Locate errors size 2(l+1)-1– Inductive case: d+1
– Components Bi can locate 2d-1(l+1)-1 errors
– By argument above, system locates 2*[(2d-1(l+1)-1)+1]-1 = 2d(l+1)-1 errors
![Page 105: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/105.jpg)
CS717
Summary
• Presented systematic way to build hierarchical systems with good fault-detection properties
• For d-level system composed of identical independent components– Component detectability=t, locatability=l
11)1(
1)1)(1(
1
1
1
kforldL
hkforht
hkforkBT
dd
d
d
d
![Page 106: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/106.jpg)
CS717
Conclusion
• Formalisms for analyzing fault detectability & locatability– Matrix-based formalism of Nair et al– Dependence graph-based formalism of Park et al
• Includes fault propagation models
• Framework for hierarchical fault tolerant systems by Nair et al– Building fault tolerant systems out of fault tolerant
components
![Page 107: Algorithm-Based Fault Tolerance Theory of Check Placement](https://reader036.fdocuments.net/reader036/viewer/2022062409/568150bb550346895dbed7a6/html5/thumbnails/107.jpg)
CS717
Conclusion
• These schemes have complex rules for acceptable check placements
• Requires detailed analysis of system to place them manually
• More detailed analysis if checks are hand-designed– Likely since few known automatic techniques
• Overall, approach can support automatic solutions but currently very manual