Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale...
Transcript of Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale...
![Page 1: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/1.jpg)
1/1
Which verification for soft error detection?
Leonardo Bautista-Gomez1, Anne Benoit2, Aurelien Cavelan2,Saurabh K. Raina3, Yves Robert2,4 and Hongyang Sun2
1. Argonne National Laboratory, USA2. ENS Lyon & INRIA, France
3. Jaypee Institute of Information Technology, India4. University of Tennessee Knoxville, USA
Dagstuhl Seminar #15281: Algorithms and Scheduling Techniques to ManageResilience and Power Consumption in Distributed Systems
July 6, 2015, Schloss Dagstuhl, Germany
![Page 2: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/2.jpg)
2/1
Computing at Exascale
Exascale platform:
105 or 106 nodes, each equipped with 102 or 103 cores
Shorter Mean Time Between Failures (MTBF) µ
Theorem: µp = µindp for arbitrary distributions
MTBF (individual node) 1 year 10 years 120 yearsMTBF (platform of 106 nodes) 30 sec 5 mn 1 h
Need more reliable components!!Need more resilient techniques!!!
![Page 3: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/3.jpg)
2/1
Computing at Exascale
Exascale platform:
105 or 106 nodes, each equipped with 102 or 103 cores
Shorter Mean Time Between Failures (MTBF) µ
Theorem: µp = µindp for arbitrary distributions
MTBF (individual node) 1 year 10 years 120 yearsMTBF (platform of 106 nodes) 30 sec 5 mn 1 h
Need more reliable components!!Need more resilient techniques!!!
![Page 4: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/4.jpg)
3/1
General-purpose approach
Periodic checkpoint, rollback and recovery:
TimeW W
Error
Corrupt Detect
C C C
Fail-stop errors: instantaneous error detection, e.g., resource crash
Silent errors (aka silent data corruptions): e.g., soft faults in L1cache, ALU, double bit flip
Silent error is detected only when corrupted data is activated,which could happen long after its occurrence
Detection latency is problematic ⇒ risk of saving corrupted checkpoint!
![Page 5: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/5.jpg)
3/1
General-purpose approach
Periodic checkpoint, rollback and recovery:
TimeW W
Error
Corrupt Detect
C C C
Fail-stop errors: instantaneous error detection, e.g., resource crash
Silent errors (aka silent data corruptions): e.g., soft faults in L1cache, ALU, double bit flip
Silent error is detected only when corrupted data is activated,which could happen long after its occurrence
Detection latency is problematic ⇒ risk of saving corrupted checkpoint!
![Page 6: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/6.jpg)
3/1
General-purpose approach
Periodic checkpoint, rollback and recovery:
TimeW W
Error Corrupt DetectC C C
Fail-stop errors: instantaneous error detection, e.g., resource crash
Silent errors (aka silent data corruptions): e.g., soft faults in L1cache, ALU, double bit flip
Silent error is detected only when corrupted data is activated,which could happen long after its occurrence
Detection latency is problematic ⇒ risk of saving corrupted checkpoint!
![Page 7: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/7.jpg)
4/1
Coping with silent errorsCouple checkpointing with verification:
TimeW W
Error Detect
V ∗ C V ∗ C V ∗ C
Before each checkpoint, run some verification mechanism(checksum, ECC, coherence tests, TMR, etc)
Silent error is detected by verification ⇒ checkpoint always valid ,
Optimal period (Young/Daly):
Fail-stop (classical) Silent errorsPattern T = W + C T = W + V ∗ + C
Optimal W ∗ =√
2Cµ W ∗ =√
(C + V ∗)µ
![Page 8: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/8.jpg)
4/1
Coping with silent errorsCouple checkpointing with verification:
TimeW W
Error Detect
V ∗ C V ∗ C V ∗ C
Before each checkpoint, run some verification mechanism(checksum, ECC, coherence tests, TMR, etc)
Silent error is detected by verification ⇒ checkpoint always valid ,Optimal period (Young/Daly):
Fail-stop (classical) Silent errorsPattern T = W + C T = W + V ∗ + C
Optimal W ∗ =√
2Cµ W ∗ =√
(C + V ∗)µ
![Page 9: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/9.jpg)
5/1
One step further
Perform several verifications before each checkpoint:
Time
Error Detect
V ∗ C V ∗ V ∗ V ∗ C V ∗ V ∗ V ∗ C
Pro: silent error is detected earlier in the pattern ,Con: additional overhead in error-free executions /
How many intermediate verifications to use and the positions?
![Page 10: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/10.jpg)
5/1
One step further
Perform several verifications before each checkpoint:
Time
Error Detect
V ∗ C V ∗ V ∗ V ∗ C V ∗ V ∗ V ∗ C
Pro: silent error is detected earlier in the pattern ,Con: additional overhead in error-free executions /
How many intermediate verifications to use and the positions?
![Page 11: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/11.jpg)
6/1
Partial verification
Guaranteed/perfect verifications (V ∗) can be very expensive!Partial verifications (V ) are available for many HPC applications!
Lower accuracy: recall r = #detected errors#total errors < 1 /
Much lower cost, i.e., V < V ∗ ,
Time
Error Detect? Detect!
V ∗ C V1 V2 V ∗ C V1 V2 V ∗ C
Which verification(s) to use? How many? Positions?
![Page 12: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/12.jpg)
6/1
Partial verification
Guaranteed/perfect verifications (V ∗) can be very expensive!Partial verifications (V ) are available for many HPC applications!
Lower accuracy: recall r = #detected errors#total errors < 1 /
Much lower cost, i.e., V < V ∗ ,
Time
Error Detect? Detect!
V ∗ C V1 V2 V ∗ C V1 V2 V ∗ C
Which verification(s) to use? How many? Positions?
![Page 13: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/13.jpg)
6/1
Partial verification
Guaranteed/perfect verifications (V ∗) can be very expensive!Partial verifications (V ) are available for many HPC applications!
Lower accuracy: recall r = #detected errors#total errors < 1 /
Much lower cost, i.e., V < V ∗ ,
Time
Error Detect? Detect!
V ∗ C V1 V2 V ∗ C V1 V2 V ∗ C
Which verification(s) to use? How many? Positions?
![Page 14: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/14.jpg)
7/1
Outline
![Page 15: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/15.jpg)
8/1
Model and objective
Silent errors
Poisson process: arrival rate λ = 1/µ, where µ is platform MTBF
Strike only computations; checkpointing, recovery, and verificationsare protected
Resilience parameters
Cost of checkpointing C , cost of recovery R
k types of partial detectors and a perfect detector(D(1),D(2), . . . ,D(k),D∗)
D(i): cost V (i) and recall r (i) < 1D∗: cost V ∗ and recall r ∗ = 1
Design an optimal periodic computing pattern that minimizesexecution time (or makespan) of the application
![Page 16: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/16.jpg)
9/1
Pattern
Formally, a pattern Pattern(W , n,α,D) is defined by
W : pattern work length (or period)
n: number of work segments, of lengths wi (with∑n
i=1 wi = W )
α = [α1, α2, . . . , αn]: work fraction of each segment (αi = wi/Wand
∑ni=1 αi = 1)
D = [D1,D2, . . . ,Dn−1,D∗]: detectors used at the end of eachsegment (Di = D(j) for some type j)
Timew1 w2 w3 wn
· · ·· · ·
D∗ C D1 D2 D3 Dn−1 D∗ C
- Last detector is perfect to avoid saving corrupted checkpoints- The same detector type D(j) could be used at the end of severalsegments
![Page 17: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/17.jpg)
9/1
Pattern
Formally, a pattern Pattern(W , n,α,D) is defined by
W : pattern work length (or period)
n: number of work segments, of lengths wi (with∑n
i=1 wi = W )
α = [α1, α2, . . . , αn]: work fraction of each segment (αi = wi/Wand
∑ni=1 αi = 1)
D = [D1,D2, . . . ,Dn−1,D∗]: detectors used at the end of eachsegment (Di = D(j) for some type j)
Timew1 w2 w3 wn
· · ·· · ·
D∗ C D1 D2 D3 Dn−1 D∗ C
- Last detector is perfect to avoid saving corrupted checkpoints- The same detector type D(j) could be used at the end of severalsegments
![Page 18: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/18.jpg)
10/1
Outline
![Page 19: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/19.jpg)
11/1
Summary of results
In a nutshell:
Given a pattern Pattern(W , n,α,D),We show how to compute the expected execution timeWe are able to characterize its optimal lengthWe can compute the optimal positions of the partialverifications
However, we prove that finding the optimal pattern is NP-hard
We design an FPTAS (Fully Polynomial-Time ApproximationScheme) that gives a makespan within (1 + ε) times the optimalwith running time polynomial in the input size and 1/ε
We show a simple greedy algorithm that works well in practice
![Page 20: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/20.jpg)
11/1
Summary of results
In a nutshell:
Given a pattern Pattern(W , n,α,D),We show how to compute the expected execution timeWe are able to characterize its optimal lengthWe can compute the optimal positions of the partialverifications
However, we prove that finding the optimal pattern is NP-hard
We design an FPTAS (Fully Polynomial-Time ApproximationScheme) that gives a makespan within (1 + ε) times the optimalwith running time polynomial in the input size and 1/ε
We show a simple greedy algorithm that works well in practice
![Page 21: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/21.jpg)
12/1
Summary of results
Algorithm to determine a pattern Pattern(W , n,α,D):
Use FPTAS or Greedy (or even brute force for small instances) tofind (optimal) number n of segments and set D of used detectors
Arrange the n − 1 partial detectors in any order
Compute W ∗ =√
offλfre
and α∗i = 1
Un· 1−gi−1gi
(1+gi−1)(1+gi ) for 1 ≤ i ≤ n,
where off =n−1∑i=1
Vi + V ∗ + C and fre =12
(1 +
1Un
)with gi = 1 − ri and Un = 1 +
n−1∑i=1
1 − gi1 + gi
![Page 22: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/22.jpg)
13/1
Expected execution time of a pattern
Proposition
The expected time to execute a pattern Pattern(W , n,α,D) is
E(W ) = W +n−1∑i=1
Vi + V ∗ + C + λW (R + W αT Aα + dT α) + o(λ),
where A is a symmetric matrix defined by Aij = 12
(1 +
∏j−1k=i gk
)for
i ≤ j and d is a vector defined by di =∑n
j=i
(∏j−1k=i gk
)Vi for 1 ≤ i ≤ n.
First-order approximation (as in Young/Daly’s classic formula)Matrix A is essential to analysis. For instance, when n = 4 we have:
A = 12
2 1 + g1 1 + g1g2 1 + g1g2g3
1 + g1 2 1 + g2 1 + g2g31 + g1g2 1 + g2 2 1 + g3
1 + g1g2g3 1 + g2g3 1 + g3 2
![Page 23: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/23.jpg)
14/1
Minimizing makespanFor an application with total work Wbase, the makespan is
Wfinal ≈E(W )
W ×Wbase
= Wbase + H(W )×Wbase,
where H(W ) = E(W )W − 1 is the execution overhead
For instance, if Wbase = 100,Wfinal = 120, we have H(W ) = 20%
Minimizing makespan is equivalent to minimizing overhead!
H(W ) = offW + λfreW + λ(R + dT α) + o(λ)
fault-free overhead: off =n−1∑i=1
Vi + V ∗ + C
re-execution fraction: fre = αT Aα
![Page 24: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/24.jpg)
14/1
Minimizing makespanFor an application with total work Wbase, the makespan is
Wfinal ≈E(W )
W ×Wbase
= Wbase + H(W )×Wbase,
where H(W ) = E(W )W − 1 is the execution overhead
For instance, if Wbase = 100,Wfinal = 120, we have H(W ) = 20%
Minimizing makespan is equivalent to minimizing overhead!
H(W ) = offW + λfreW + λ(R + dT α) + o(λ)
fault-free overhead: off =n−1∑i=1
Vi + V ∗ + C
re-execution fraction: fre = αT Aα
![Page 25: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/25.jpg)
15/1
Optimal pattern length to minimize overhead
Proposition
The execution overhead of a pattern Pattern(W , n,α,D) is minimizedwhen its length is
W ∗ =√
offλfre
.
The optimal overhead is
H(W ∗) = 2√λofffre + o(
√λ).
When the platform MTBF µ = 1/λ is large, o(√λ) is negligible
Minimizing overhead is reduced to minimizing the product offfre!Tradeoff between fault-free overhead and fault-inducedre-execution
![Page 26: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/26.jpg)
15/1
Optimal pattern length to minimize overhead
Proposition
The execution overhead of a pattern Pattern(W , n,α,D) is minimizedwhen its length is
W ∗ =√
offλfre
.
The optimal overhead is
H(W ∗) = 2√λofffre + o(
√λ).
When the platform MTBF µ = 1/λ is large, o(√λ) is negligible
Minimizing overhead is reduced to minimizing the product offfre!Tradeoff between fault-free overhead and fault-inducedre-execution
![Page 27: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/27.jpg)
16/1
Optimal positions of verifications to minimize fre
Theorem
The re-execution fraction fre of a pattern Pattern(W , n,α,D) isminimized when α = α∗, where
α∗k = 1
Un× 1− gk−1gk
(1 + gk−1)(1 + gk ) for 1 ≤ k ≤ n,
where g0 = gn = 0 and Un = 1 +∑n−1
i=11−gi1+gi
.
In this case, the optimal value of fre is
f ∗re = 1
2
(1 + 1
Un
).
Most technically involved result (lengthy proof of 3 pages!)Given a set of partial verifications, the minimal value of fre does notdepend upon their ordering within the pattern
![Page 28: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/28.jpg)
16/1
Optimal positions of verifications to minimize fre
Theorem
The re-execution fraction fre of a pattern Pattern(W , n,α,D) isminimized when α = α∗, where
α∗k = 1
Un× 1− gk−1gk
(1 + gk−1)(1 + gk ) for 1 ≤ k ≤ n,
where g0 = gn = 0 and Un = 1 +∑n−1
i=11−gi1+gi
.
In this case, the optimal value of fre is
f ∗re = 1
2
(1 + 1
Un
).
Most technically involved result (lengthy proof of 3 pages!)Given a set of partial verifications, the minimal value of fre does notdepend upon their ordering within the pattern
![Page 29: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/29.jpg)
17/1
Two special cases
When all verifications use the same partial detector (r), we get
α∗k =
{1
(n−2)r+2 for k = 1 and k = nr
(n−2)r+2 for 2 ≤ k ≤ n − 1
Time1 r r 1
· · ·· · ·
D∗ C D D D D D∗ C
When all verifications use the perfect detector, we get equal-lengthsegments, i.e., α∗
k = 1n for all 1 ≤ k ≤ n
Time1 1 1 1
· · ·· · ·
D∗ C D∗ D∗ D∗ D∗ D∗ C
![Page 30: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/30.jpg)
18/1
Optimal number and set of detectors
It remains to determine optimal n and D of a patternPattern(W , n,α,D).
Equivalent to the following optimization problem:
Minimize freoff = V ∗ + C2
(1 + 1
1 +∑k
j=1 mj a(j)
)(1 +
k∑j=1
mj b(j)
)subject to mj ∈ N0 ∀j = 1, 2, . . . , k
accuracy: a(j) = 1− g (j)
1 + g (j) relative cost: b(j) = V (j)
V ∗ + C
accuracy-to-cost ratio: φ(j) = a(j)
b(j)
NP-hard even when all detectors share the same accuracy-to-cost ratio(reduction from unbounded subset sum), but admits an FPTAS.
![Page 31: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/31.jpg)
18/1
Optimal number and set of detectors
It remains to determine optimal n and D of a patternPattern(W , n,α,D).
Equivalent to the following optimization problem:
Minimize freoff = V ∗ + C2
(1 + 1
1 +∑k
j=1 mj a(j)
)(1 +
k∑j=1
mj b(j)
)subject to mj ∈ N0 ∀j = 1, 2, . . . , k
accuracy: a(j) = 1− g (j)
1 + g (j) relative cost: b(j) = V (j)
V ∗ + C
accuracy-to-cost ratio: φ(j) = a(j)
b(j)
NP-hard even when all detectors share the same accuracy-to-cost ratio(reduction from unbounded subset sum), but admits an FPTAS.
![Page 32: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/32.jpg)
18/1
Optimal number and set of detectors
It remains to determine optimal n and D of a patternPattern(W , n,α,D).
Equivalent to the following optimization problem:
Minimize freoff = V ∗ + C2
(1 + 1
1 +∑k
j=1 mj a(j)
)(1 +
k∑j=1
mj b(j)
)subject to mj ∈ N0 ∀j = 1, 2, . . . , k
accuracy: a(j) = 1− g (j)
1 + g (j) relative cost: b(j) = V (j)
V ∗ + C
accuracy-to-cost ratio: φ(j) = a(j)
b(j)
NP-hard even when all detectors share the same accuracy-to-cost ratio(reduction from unbounded subset sum), but admits an FPTAS.
![Page 33: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/33.jpg)
19/1
Greedy algorithm
Practically, a greedy algorithm:
Employs only the detector with highest accuracy-to-cost ratioφmax = a
b
Optimal number of detectors: m∗ = −1a +
√1a
(1b −
1a
)
Optimal overhead: H∗ =
√2(C + V ∗)
µ
(√1
φmax +
√1− 1
φmax
)
Rounds up the optimal rational solution dm∗e
The greedy algorithm has an approximation ratio√
3/2 < 1.23
![Page 34: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/34.jpg)
19/1
Greedy algorithm
Practically, a greedy algorithm:
Employs only the detector with highest accuracy-to-cost ratioφmax = a
b
Optimal number of detectors: m∗ = −1a +
√1a
(1b −
1a
)
Optimal overhead: H∗ =
√2(C + V ∗)
µ
(√1
φmax +
√1− 1
φmax
)
Rounds up the optimal rational solution dm∗e
The greedy algorithm has an approximation ratio√
3/2 < 1.23
![Page 35: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/35.jpg)
20/1
Outline
![Page 36: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/36.jpg)
21/1
Simulation configuration
Exascale platform:
105 computing nodes with individual MTBF of 100 years⇒ platform MTBF µ ≈ 8.7 hours
Checkpoint sizes of 300GB with throughput of 0.5GB/s⇒ C = 600s
Realistic detectors (designed at ANL):
cost recall ACRTime series prediction D(1) V (1) = 3s r (1) = 0.5 φ(1) = 133Spatial interpolation D(2) V (2) = 30s r (2) = 0.95 φ(2) = 36Combination of the two D(3) V (3) = 6s r (3) = 0.8 φ(3) = 133Perfect detector D∗ V ∗ = 600s r ∗ = 1 φ∗ = 2
![Page 37: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/37.jpg)
21/1
Simulation configuration
Exascale platform:
105 computing nodes with individual MTBF of 100 years⇒ platform MTBF µ ≈ 8.7 hours
Checkpoint sizes of 300GB with throughput of 0.5GB/s⇒ C = 600s
Realistic detectors (designed at ANL):
cost recall ACRTime series prediction D(1) V (1) = 3s r (1) = 0.5 φ(1) = 133Spatial interpolation D(2) V (2) = 30s r (2) = 0.95 φ(2) = 36Combination of the two D(3) V (3) = 6s r (3) = 0.8 φ(3) = 133Perfect detector D∗ V ∗ = 600s r ∗ = 1 φ∗ = 2
![Page 38: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/38.jpg)
22/1
Evaluation results
Using individual detector (greedy algorithm)
Best partial detectors offer ∼9% improvement in overhead.Saving ∼55 minutes for every 10 hours of computation!
![Page 39: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/39.jpg)
23/1
Evaluation results
Mixing two detectors: depending on application or dataset, a detector’srecall may vary, but its cost stays the same
Realistic dataagain!r (1) = [0.5, 0.9]r (2) = [0.75, 0.95]r (3) = [0.8, 0.99]
φ(1) = [133, 327]φ(2) = [24, 36]φ(3) = [133, 196]
m overhead H diff. from opt.
Scenario 1: r (1) = 0.51, r (3) = 0.82, φ(1) ≈ 137, φ(3) ≈ 139Optimal solution (1, 15) 29.828% 0%Greedy with D(3) (0, 16) 29.829% 0.001%
Scenario 2: r (1) = 0.58, r (3) = 0.9, φ(1) ≈ 163, φ(3) ≈ 164Optimal solution (1, 14) 29.659% 0%Greedy with D(3) (0, 15) 29.661% 0.002%
Scenario 3: r (1) = 0.64, r (3) = 0.97, φ(1) ≈ 188, φ(3) ≈ 188Optimal solution (1, 13) 29.523% 0%Greedy with D(1) (27, 0) 29.524% 0.001%Greedy with D(3) (0, 14) 29.525% 0.002%
The greedy algorithm works very well in this practical scenario!
![Page 40: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/40.jpg)
24/1
Outline
![Page 41: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/41.jpg)
25/1
ConclusionA first comprehensive analysis of computing patterns with partialverifications to detect silent errors
Theoretically: assess the complexity of the problem and proposeefficient approximation schemes
Practically: present a greedy algorithm and demonstrate its goodperformance with realistic detectors
Future directions
Partial detectors with false positives/alarms
precision p = #true errors#detected errors < 1
Errors in checkpointing, recovery, and verifications
Coexistence of fail-stop and silent errors
Research report available at https://hal.inria.fr/hal-01164445v1
![Page 42: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/42.jpg)
25/1
ConclusionA first comprehensive analysis of computing patterns with partialverifications to detect silent errors
Theoretically: assess the complexity of the problem and proposeefficient approximation schemes
Practically: present a greedy algorithm and demonstrate its goodperformance with realistic detectors
Future directions
Partial detectors with false positives/alarms
precision p = #true errors#detected errors < 1
Errors in checkpointing, recovery, and verifications
Coexistence of fail-stop and silent errors
Research report available at https://hal.inria.fr/hal-01164445v1
![Page 43: Which verification for soft error detection?abenoit/slides/dagstuhl15.pdf2/1 Computing at Exascale Exascale platform: 105 or 106 nodes, each equipped with 102 or 103 cores Shorter](https://reader031.fdocuments.net/reader031/viewer/2022040401/5e779accbabc0a7e38427326/html5/thumbnails/43.jpg)
25/1
ConclusionA first comprehensive analysis of computing patterns with partialverifications to detect silent errors
Theoretically: assess the complexity of the problem and proposeefficient approximation schemes
Practically: present a greedy algorithm and demonstrate its goodperformance with realistic detectors
Future directions
Partial detectors with false positives/alarms
precision p = #true errors#detected errors < 1
Errors in checkpointing, recovery, and verifications
Coexistence of fail-stop and silent errors
Research report available at https://hal.inria.fr/hal-01164445v1