Fault Tolerance in Embedded Systems
description
Transcript of Fault Tolerance in Embedded Systems
![Page 2: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/2.jpg)
Fault Tolerance
• This presentation is based upon [1]• Focus is on the basics as applied to embedded
systems with processors
• This presentation does not rely on Wikipedia.• See Byzantine fault tolerance on wiki
![Page 3: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/3.jpg)
1. Trends Problems2. Fault Tolerance Definitions3. Fault Hiding4. Fault Avoidance5. Error Models6. # Simultaneous Errors7. Fault Tolerance Metrics8. Error Detection9. Error Recovery10. Fault Diagnosis11. Self-Recovery
Overview
![Page 4: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/4.jpg)
Trends Problems
• Fault Tolerance• Goal = safety + liveness• Safe: Hide faults from
hurting the user, even in failure
• Live: performs the desired task
• Better to fail than to do harm
Cosmic rays and alpha particles
![Page 5: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/5.jpg)
Trends Problems
• More devices/processor means more units can fail– Think CISC v.s. RISC
• More complex designs mean more failure cases exist– Think AVX v.s. MMX
• Cache faults and more generally memory faults– Recharging DRAM is
“easier” than reloading a destroyed cache line
![Page 6: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/6.jpg)
Fault Tolerance Definitions
• Fault– Physical faults– Software faults
• May manifest as error• Masked fault does not
show up as an error• Errors may also be
masked• Otherwise the error
results in a failure
• Logical mask - 0 AND error bit
• Architectural mask – NOP reg destination error
• Application mask – silent fault like writing garbage to an unused address … produces no failure
![Page 7: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/7.jpg)
Fault Hiding• Some faults are
automatically recovered already: branch prediction can recover from faulty branches
• Dangerous cases are the faults that are NOT masked
• Goal: mask all faults– E.g. HDD faults are
common but hidden
• Transient fault – signal glitch
• Permanent fault – wire burns
• Intermittent fault – cold soldered wire
• Fault tolerance scheme – design a system for masking the expected fault type (transient/permanent/intermittent)
![Page 8: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/8.jpg)
Fault Avoidance
• Fault avoidance is just as good as fault tolerance
• Error detection and correction is the alternative
• Permanent faults– Physical wear-out– Fabrication defects– Design bugs
![Page 9: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/9.jpg)
Error Models• We only care about errors,
since masked faults are innocuous
• Error models– For improving fault tolerance– E.g. stuck at 0/1 model tells us
that there is a potential error– Many many stuck at 0 errors
can mean that there is NO PROBLEM
– Reduces the need to evaluate all sources of error. Design space size↓↓
• 3 main error model parameters
• Type of error – bridging/coupling error (e.g. short, cross-talk), stuck-at error, fail-stop error, delay error
• Error duration – transient, intermittent, permanent
• # simultaneous errors – errors are rare, how many wars can you fight at once?
![Page 10: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/10.jpg)
# Simultaneous Errors• Maybe 1 error hides
another error• E.g. 2-bit flip parity checker• Reasons for resolving:
– Mission critical– High error rate– Latent errors (undetected
and lingering) may overlap with other errors. Think about an incorrectly stored word: the error occurs upon NEXT read of the word
• Better to detect the first error AND to have double error correction since the error rate trends are against us.
![Page 11: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/11.jpg)
Fault Tolerance Metrics
• Availability– 99.999% = five nines of
availability• Reliability– P(time t and still no
failure)– Most errors are not
failures
• Mean != probability• Variance (2 and 20 v.s.
11 and 12)• MTTF – Mean Time to
Failure• MTTR – Mean Time To
Repair• MTBF = MTTF+MTTR
![Page 12: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/12.jpg)
Fault Tolerance Metrics• Failures in Time (FIT)
– Rate– # failures / 1 billion hours – Additive– α 1/MTTF– Arbitrary– Raw rate includes masked failures– Effective rate excludes masked
failures• Effective FIT = FIT*AVF
– Helps locate transient error vulnerability
– Shown to be a good lower bound on reliability
• Architectural Vulnerability Factor (AVF)– Architecturally Correct Execution
=ACE state– Otherwise = un-ACE state– E.g. PC state = ACE; branch pred=un-
ACE– Fraction of time in ACE state
• Component AVF = – avg # ACE bits per cycle / # state bits.
• If many ACE bits reside in a structure for a long time, that structure is highly vulnerable. Large AVF
![Page 13: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/13.jpg)
Error Detection
• Helps to provide safety• Without redundancy
we cannot detect errors
• What kind of redundancy do we need?
• Redundancy– Physical (majority gate = TMR,
dual modular redundancy =DMR, NMR where N is odd>3)
– Temporal (run twice & compare results)
– Information (extra bits like parity)
• Boeing 777 uses “triple-triple” modular redundancy, 2 levels of triple voting, where each vote is from a different architecture
DMR
![Page 14: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/14.jpg)
Error Detection
• Physical Redundancy• Heterogeneous
hardware units can provide physical redundancy– E.g. Watchdog timer– E.g. Boeing 777 different
architectures running same program and then voting on results.
– Design Diversity
• Unit replication– Gate level– Register level– Core level
• Wastes lots of area & power• NMR impractical for PCs• False error reporting
becomes more likely• Using different hardware
for the voters avoids the possibility of design bugs
![Page 15: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/15.jpg)
Error Detection
Temporal Redundancy• Twice the active power but
not twice the area• Can find transient but not
permanent errors• Smart pipelining can have
the votes arrive 1 cycle apart, but wastes pipeline slots
Information Redundancy• Error-Detecting Code (EDC)• Words mapped to code
words like checksums and CRC
• Hamming Distance (HD)• Single-Error Correcting (SEC)
Double-Error Detecting (DED) with HD of 4
![Page 16: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/16.jpg)
Error Detection
![Page 17: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/17.jpg)
Error Detection• For ALU we can compare bitcount of inputs out outputs, but this is not
common• Many other techniques exist like BIST or calculating a known quantity
and comparing to a ROM with the answer in it.• ReExecution with Shifted Operands (RESO) finds permanent errors.• Redundant multithreading: use empty slots to run redundancy threads• Checking invariant conditions• Anomaly detection like behavioural antivirus (look at data and/or
traces)• Error Detection by Duplicated Instructions (EDDI) – let software look
into the hardware using randomly inserted dummy code• Way way more stuff about caches, CAMs, consistency, and more.
![Page 18: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/18.jpg)
Error Recovery• Safety from detection but
what about liveness?• Forward Error Recovery
– FER– Once detected, the error is
seamlessly corrected• FER implemented using
physical, information, or temporal redundancy
• More HW needed to correct than detect– E.g. DMR can detect but
TMR or triple-triple can correct (spatial)
• HD=k (information redundancy)
– k-1 bit errors detection– (k-1)/2 error correction– (HD,Detect,correct)
• (5,4,2)
• TMR by repetition (temporal)
![Page 19: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/19.jpg)
Error Recovery• Backwards Error Recovery
– BER– Rollback / Safe point– Restore point– Recovery line for multicore
(cool!)– How do we model
communication in MP /w caches??
– Just log everything? Nope, save it distributed and in the caches. Possibly use software.
– Way more crazy algorithm selection magic….
• The Output Commit Problem– Sphere of recoverability– Don’t let bad data out– Wait for error detection
hardware to complete– Latency is usually hidden– Processor state is
difficult to store/restore
![Page 20: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/20.jpg)
Error Recovery
FER when DRAM module fails – RAID-M/chipkill
![Page 21: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/21.jpg)
Fault Diagnosis• Diagnosis hardware
– FER and BER do not solve livelock
– E.g. mult fails, recover, mult again.. livelock
• Idea: be smart, figure out what components are toast
• BIST– Compare boundary scan
data or stored tests to a ROM with the right answers
• Run BIST at fixed intervals or at end of context switch
• Commit changes if error free, otherwise restore
• Try to test all components in system, ideally all gates in the system
• MPs/NoC typically have dedicated diagnosis hardware
![Page 22: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/22.jpg)
Self-Repair• BIST can tell you what broke, but
not how to fix it.• i7 can respond to errors on the on-
chip busses at runtime. Partial bus shorts do not kill the system. Data is transferred like a packet (NoC)
– Because of all the prediction, lanes, and issue logic, superscalar has much more redundancy than RISC
– For RISC just steal a core from the grid and mark the old core dead
– CISC has some very crazy metrics for triggering self-repair
• Remember the infinite loop mult we diagnosed?• Alternative: notice that mult is
dead, use shift-add booth• Another cool idea: if shift breaks
use the mult with base 2 inputs (hot spare)
• A cold spare would be a fully dedicated redundant unit
– CellBE only uses 7 cores and has an 8th cold spare SPE! So cool!
![Page 23: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/23.jpg)
Conclusions• Things are getting a bit
crazy in error detection and correction
• Multicore and caches complicated everything
• Although up until now this fault stuff was known, it is only now entering the PC market because the error rate is increasing with process technology
• Like the byzantine generals problem, we start to worry about who to trust in the running but broken chip
• Voting works best for transient errors. For permanent errors too, but land the plane or you will end up crashing.
• You can prove that it is easier to detect a problem than fix it.
![Page 24: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/24.jpg)
References
[1] Daniel J. Sorin, “Fault Tolerant Computer Architecture (Synthesis Lectures on Computer Architecture),” 2010.
![Page 25: Fault Tolerance in Embedded Systems](https://reader035.fdocuments.net/reader035/viewer/2022062310/56816932550346895de085c1/html5/thumbnails/25.jpg)
Questions?