CMSC 5719 MSc Seminar Fault-Tolerant Computing
description
Transcript of CMSC 5719 MSc Seminar Fault-Tolerant Computing
Qiang Xu CUHK, Fall 2012 Part.1 .1
CMSC 5719 MSc Seminar
Fault-Tolerant Computing
XU, Qiang (Johnny) 徐強
[Partly adapted from Koren & Krishna, and B. Parhami Slides]
Qiang Xu CUHK, Fall 2012 Part.1 .2
Why Learn This Stuff?
Qiang Xu CUHK, Fall 2012 Part.1 .3
Outline
Motivation Fault classification Redundancy Metrics for Reliability Case studies
Qiang Xu CUHK, Fall 2012 Part.1 .4
Fault-Tolerance - Basic definition
Fault-tolerant systems - ideally systems capable of executing their tasks correctly regardless of either hardware failures or software errors
In practice - we can never guarantee the flawless execution of tasks under any circumstances
Limit ourselves to types of failures and errors which are more likely to occur
Qiang Xu CUHK, Fall 2012 Part.1 .5
Need For Fault-Tolerance
Critical applications require extreme fault tolerance (e.g., aircrafts, nuclear reactors, medical equipment, and financial applications) A malfunction of a computer in such applications
can lead to catastrophe Their probability of failure must be extremely low,
possibly one in a billion per hour of operation System operating in a harsh environment
with high failure possibilitieselectromagnetic disturbancesparticle hits and alike
Complex systems consisting of millions of devices
Qiang Xu CUHK, Fall 2012 Part.1 .6
Get to Know the Enemy: What cause Faults?
Manufacturing Defects
Aging
(a.k.a., Circuit Wearout)
Qiang Xu CUHK, Fall 2012 Part.1 .7
Get to Know the Enemy: What cause Faults?
Internal Electronic Noise Electromagnetic Interference
Qiang Xu CUHK, Fall 2012 Part.1 .8
Get to Know the Enemy: What cause Faults?
Bugs … Malicious attack (beyond the scope)
Qiang Xu CUHK, Fall 2012 Part.1 .9
Fault Classification according to Duration
Permanent Faults - never go away, component has to be repaired or replaced
Transient Faults - disappear after a relatively short time Example - a memory cell whose contents are
changed due to some electromagnetic interference
Overwriting the memory cell with the right content will make the fault go away
Intermittent Faults - cycle between active and benign states Example - a loose connectionAn increasing threat largely due to
temeprature and voltage fluctuations
Qiang Xu CUHK, Fall 2012 Part.1 .10
Failures during Lifetime
Three phases of system lifetime Infant mortality (imperfect test, weak
components) Normal lifetime (transient/intermittent faults) Wear-out period (circuit aging)
Qiang Xu CUHK, Fall 2012 Part.1 .11
Seriously, Why Fault-Tolerance Comes Back?
Simply put, technology-driven
Time
Transistor Cost
Reliability Cost
Total Cost
With technology scaling
Today’s chips are extremely complex (billion transistors
running with less noise margin) and are much
hotter!
We cannot afford heavyweight, macro-scale redundancy for commodity
computing systems.
Qiang Xu CUHK, Fall 2012 Part.1 .12
The Impact of Technology Scaling
More leakage More process variability Smaller critical charges Weaker transistors and wires
Burn-in test less
effective
Higher random
failure rate
Faster wear-out
Qiang Xu CUHK, Fall 2012 Part.1 .13
What Can We Do when Confronting Enemies?
Surrender, but don’t become traitor Fail, but safe, i.e., don’t corrupt anything (ATM machine) Not that easy as you may think, you have to detect
faults! Weaken the enemies
fault-avoidance and fault-removal» Process improvement with less threats» Testing and DfT to remove defective circuits» Careful design reviews to remove design bugs» More training to reduce operator errors
Always some faults cannot be avoided and removed completely
Make yourself stronger Fault-tolerance
» Adding redundancy to detect, diagnose, confine, mask, compensate and recover from faults
» Mind the cost in terms of hardware, power, and performance Fault-evasion (a.k.a., Fault-prediction)
» Observe, learn and take pre-emptive steps to stop fault from occurring
Qiang Xu CUHK, Fall 2012 Part.1 .14
A Motivating Case StudyData availability and integrity concerns
Distributed DB system with 5 sitesFull connectivity, dedicated links
Only direct communication allowedSites and links may malfunction
Redundancy improves availability
S0
S1
S2S3
S4
L1
L0
L2
L3
L4L5
L6
L7
L8
L9
S: Probability of a site being availableL: Probability of a link being available
Data replication methods, and a challenge
File duplication: home / mirror sitesFile triplication: home / backup 1 / backup 2
Are there availability improvement methods with less redundancy?
Single-copy availability = SLUnavailability = 1 – SL
= 1 – 0.99 0.95 = 5.95% Fi
User
Qiang Xu CUHK, Fall 2012 Part.1 .15
Data Duplication: Home and Mirror Sites
S0
S1
S2S3
S4
L1
L0
L2
L3
L4L5
L6
L7
L8
L9
Data unavailability reduced from 5.95% to 0.35%
Availability improved from 94% to 99.65%
Duplicated availability = 2SL – (SL)2 Unavailability = 1 – 2SL + (SL)2
= (1 – SL)2 = 0.35%
A = SL + (1 – SL)SL
Primary site can be reached
Primary site inaccessible
Mirror site can be reached
S: Site availability e.g., 99%L: Link availability e.g., 95%
Fi home
Fi mirror
User
Qiang Xu CUHK, Fall 2012 Part.1 .16
Data Triplication: Home and Two Backups
S0
S1
S2S3
S4
L1
L0
L2
L3
L4L5
L6
L7
L8
L9
Data unavailability reduced from 5.95% to 0.02%
Availability improved from 94% to 99.98%
Triplicated avail. = 3SL – 3(SL)2 – (SL)3 Unavailability = 1 – 3SL – 3(SL)2 + (SL)3
= (1 – SL)3 = 0.02%
S: Site availability e.g., 99%L: Link availability e.g., 95%
Fi home Fi backup 2
User
Fi backup 1
A = SL + (1 – SL)SL + (1 – SL)2SL
Primary site can be reached
Primary site inaccessible
Backup 1 can be reached
Primary and backup 1
inaccessible
Backup 2 can be reached
Qiang Xu CUHK, Fall 2012 Part.1 .17
Data Dispersion: Three of Five Pieces
S0
S1
S2S3
S4
L1
L0
L2
L3
L4L5
L6
L7
L8
L9
Scheme Nonredund. Duplication Triplication DispersionUnavailability 5.95% 0.35% 0.02% 0.08%Redundancy 0% 100% 200% 67%
Dispersed avail. = 6(SL)2 – 8(SL)3 + 3(SL)4
Availability = 99.92%Unavailability = 1 – Availability = 0.08%
S: Site availability e.g., 99%L: Link availability e.g., 95%
Piece 3 Piece 2
User
Piece 0A = (SL)4 + 4(1 – SL)(SL)3 + 6(1 – SL)2(SL)2
All 4 pieces can be reached
Exactly 3 piecescan be reached
Only 2 pieces can be reached
Piece 1
Piece 4
Qiang Xu CUHK, Fall 2012 Part.1 .18
Questions Ignored in Our Simple Example
1. How redundant copies of data are kept consistent When a user modifies the data, how to update the redundant copies (pieces) quickly and prevent the use of stale data in the meantime?
2. How malfunctioning sites and links are identified Malfunction diagnosis must be quick to avoid data contamination
3. How recovery is accomplished when a malfunctioning site / link returns to service after repairThe returning site must be brought up to date with regard to changes
4. How data corrupted by the actions of an adversary is detected This is more difficult than detecting random malfunctions
The example does demonstrate, however, that: Many alternatives are available for improving dependability Proposed methods must be assessed through modeling The most cost-effective solution may be far from obvious
Qiang Xu CUHK, Fall 2012 Part.1 .19
Redundancy
Redundancy is at the heart of fault-tolerance Incorporation of extra components in the design
of a system to improve its reliability
Four forms of redundancy: Hardware redundancy (spatial redundancy)
» Static, dynamic and hybrid redundancy
Software redundancy» N-version programming
Information redundancy » Error detecting and correcting codes» Usually requires extra hardware for processing
Time redundancy » Re-execution
Qiang Xu CUHK, Fall 2012 Part.1 .20
Physical Redundancy
Physically replicate modules Effective for all sorts of faults Mind the area/energy overhead
Design issues How many copies? How to detect faults? How to recover from faults? How to organize redundancy (passive, active, or
hybrid)?
Qiang Xu CUHK, Fall 2012 Part.1 .21
Triple Modular Redundancy (TMR)
The best known FT technique Tolerate single error (soft or hard error) in any
module Low performance overhead Simple design Very high cost in terms of area and energy
To tolerate simultaneous faults, we can resort to N-modular redundancy (NMR) N is an odd integer Tolerates up to (N-1)/2 faults
Single point of failure at voter Voter is typically small and hence often assumed
to be very reliable
Qiang Xu CUHK, Fall 2012 Part.1 .22
Reliability of TMR Systems
M-of-N system with M=2, N=3 - system good if at least two modules are operational
A voter picks the majority output
Voter can fail - reliability of voter Rvot(t)
= Rvot(t) ( 3R² (t) - 2R³ (t) )
1
0
3)())(1(3)()( )(i
iivottmr tRtRitRtR
3
2
3))(1()(3)( )(i
iivot tRtRitR
Qiang Xu CUHK, Fall 2012 Part.1 .23
Triplicated Processor/Memory System
All communications (in either direction) between triplicated processors and triplicated memories go through majority voting
Higher reliability than a single majority voting of triplicated processor/memory structure
Qiang Xu CUHK, Fall 2012 Part.1 .24
Design Redundancy
Use diverse designs to furnish the same serviceAnother kind of physical redundancy
AdvantagesProtection against design deficiency Lower cost with simple “back-up” unit
Qiang Xu CUHK, Fall 2012 Part.1 .25
Watchdog Processor
Performs concurrent system-level error detection Monitoring bus connecting main processor and memory Targets control-flow checking: correct program blocks
executed in the right order Can detect hardware/software faults causing erroneous
instructions to be executed or wrong execution paths Watchdog needs the program's control-flow information
Qiang Xu CUHK, Fall 2012 Part.1 .26
DIVA: Dynamic Implementation Verification Architecture
Core computation, communication, and control validated by checker
Checker relaxes the burden of correctness on the core processor
Key checker requirements: simple, fast, and reliable
Qiang Xu CUHK, Fall 2012 Part.1 .27
N-Version Programming
N independent teams of programmers develop software to same specifications - N versions are run in parallel - output voted on If programs are developed independently - very
unlikely that they will fail on same inputs Assumption - failures are statistically
independent; probability of failure of an individual version = q
Probability of no more than m failures out of N versions –
What are the limitations?
Qiang Xu CUHK, Fall 2012 Part.1 .28
Information Redundancy - Coding
A data word with d bits is encoded into a codeword with c bits - c>d Not all combinations are valid codewords To extract original data - c bits must be decoded If the c bits do not constitute a valid codeword an
error is detected For certain encoding schemes - some types of
errors can also be corrected
Key parameters of code: Number of erroneous bits that can be detected Number of erroneous bits that can be corrected
Overhead of code: Additional bits required Additional hardware/latency for encoding and
decoding
2C
Qiang Xu CUHK, Fall 2012 Part.1 .29
Hamming Distance
The Hamming distance between two codewords - the number of bit positions in which the two words differ
A Hamming distance of two between two codewords implies that a single bit error will not change one of the codewords into the other
Qiang Xu CUHK, Fall 2012 Part.1 .30
Distance of a Code
The Distance of a code - the minimum Hamming distance between any two valid codewords
Example - Code with four codewords -{001,010,100,111} has a distance of 2can detect any single bit error
Example - Code with two codewords - {000,111} has a distance of 3 can detect any single or double bit error if double bit errors are not likely to
happen - code can correct any single bit error
Qiang Xu CUHK, Fall 2012 Part.1 .31
Coding vs. Redundancy
The code {000,111} can be used to encode a single data bit 0 can be encoded as 000 and 1 as 111This code is identical to TMR
The code {00,11} can also be used to encode a single data bit 0 can be encoded as 00 and 1 as 11This code is identical to a duplex
To detect up to k bit errors, the code distance should be at least k+1
To correct up to k bit errors, the code distance should be at least 2k+1
Qiang Xu CUHK, Fall 2012 Part.1 .32
Separability of a Code
A code is separable if it has separate fields for the data and the code bits Decoding consists of disregarding the code bits The code bits can be processed separately to verify
the correctness of the data
A non-separable code has the data and code bits integrated together - extracting the data from the encoded word requires some processing
The simplest separable codes are the parity codes A parity code has a distance of 2 Can detect all odd-bit errors Even or odd parity code?
Qiang Xu CUHK, Fall 2012 Part.1 .33
Error-Correcting Parity Codes Simplest scheme - data is organized in a
2-dimensional array Bits at the end of row –
parity over the row Bits at the bottom of column –
parity over the column
A single-bit error anywhere will cause a row and a column to be erroneous This identifies a unique erroneous bit
This is an example of overlapping parity - each bit is covered by more than one parity bit
Qiang Xu CUHK, Fall 2012 Part.1 .34
Cyclic Redundancy Check (CRC)
Many applications need to detect burst errors Why CRC is popular?
Effectiveness: A n-bit CRC check can detect all errors of less than n bits and a large portion of longer multi-bit errors
Ease of hardware implementation: shifters and XORs
How does it work? Consider dataword and codeword as polynomials At transmitter side, Codeword = Dataword * Generator
» Generator function is a pre-defined CRC polynomial» An example CRC-16 polynomial:
At receiver side, divide Codeword by CRC polynomial and check whether the remainder is zero
1)( 21516 XXXXG
Qiang Xu CUHK, Fall 2012 Part.1 .35
Time Redundancy
Perform execution multiple times (typically twice), and then compare the resultsEffective for transient faultsDoes it work for permanent errors?
Cost of time redundancyPerformance cost, can we mitigate it? Energy cost, can we mitigate it?
Qiang Xu CUHK, Fall 2012 Part.1 .36
Reversible Computation
Many operations are reversible Addition/subtraction; shift left/shift right; etc.
If reversible operations result in unexpected value, we know there’s a problem What operations are non-reversible? Devil are in the details
Qiang Xu CUHK, Fall 2012 Part.1 .37
With Redundancy, What can We Do?
Forward Error Correction (FEC)Also known as forward error recovery
(FER), although it’s actually not recovery
Use redundancy to mask error effects System continues to go forward in
presence of errors
Triple modular redundancy (TMR)
Qiang Xu CUHK, Fall 2012 Part.1 .38
With Redundancy, What can We Do?
Backward Error Recovery (BER)Use redundancy to recover from errors System go backward to a saved good
state
Periodic checkpoint and replay
Qiang Xu CUHK, Fall 2012 Part.1 .39
The Impact of FEC vs. BER
FEC BER
Performance
(when no faults)
Performance
(when faults occur)
Energy cost
Hardware cost
Design cost
Qiang Xu CUHK, Fall 2012 Part.1 .40
The Impact of FEC vs. BER
FEC BER
Performance
(when no faults)
Some degradation
Little degradation
Performance
(when faults occur)
No change Take time for recovery
Energy cost Usually high Usually low
Hardware cost High Low
Design cost Low High
When failure rate is very high, which one is preferred?
Qiang Xu CUHK, Fall 2012 Part.1 .41
Fault-Tolerance is NOT Free!
Fault-tolerance can be achieved to any arbitrary degree if you’re willing to throw resources for it Canonical solutions are there for a long time
Many FT solutions hurt performance, e.g., Checkpoint and replay Tightly lockstepped redundant cores Redundant multithreading
Many FT solutions increase cost, e.g., TMR and NMR RAID N-programming
Almost all the FT techniques increase energy consumption
Qiang Xu CUHK, Fall 2012 Part.1 .42
Fault-Tolerance for Designers
Fault-tolerance is essentially redundancy allocation and management, and design is about tradeoff!
As designers, smarter FT solutions can be obtained by Know your enemies better (what causes the failure, failure
rate, failure distribution, etc.) Know your design better (specific properties, anything
“free”, when and what to sacrifice, etc.)
Performance Reliability
Cost
Design
Energy
Qiang Xu CUHK, Fall 2012 Part.1 .43
Levels of Fault-Tolerance
Hardware System
Processor(s)
ALURegister
FileCache
Hardware Accelerator
Q
QSET
CLR
S
R
Memory System
interconnnection network
(Host) Operating System
Virtual Machine MonitorDrivers
Guest OS
Guest OS
Core-level redundancyDynamic verification
Block-level redundancyECC for memory
Circuit hardening
VirtualizationTask migration
Redundant multithreadingFault-tolerant
scheduling
Aplication Software Software redundancy
Qiang Xu CUHK, Fall 2012 Part.1 .44
Lots of FT Buzzwords over Time …
Reliability – continuation of service while being used Availability – readiness for use whenever needed Serviceability – ease of service or repair
Safety - absence of catastrophic consequences Maintainability - ability to undergo modifications and
repairs Surviability, Confidentiality, Accessibility …
Security - the degree of protection against danger, loss, and criminals
Dependability – the trustworthiness of a computing system which allows reliance to be justifiably placed on the service it delivers (defined by IFIP 10.4 Working Group on Dependable Computing and Fault Tolerance ) You’re right, I don’t know what it exactly means … Who cares?
Qiang Xu CUHK, Fall 2012 Part.1 .45
Designers Need Measures
A measure is a mathematical abstraction, which expresses only some subset of the object's nature, i.e., FT capability here
Reliability, R(t), probability that the system is up during the whole interval [0,t], for non-repairable products
Availability, A(t), fraction of time system is up during the interval [0,t], for repairable products Point Availability, Ap(t), probability that the
system is up at time t Long-Term Availability,
People usually talk about “the 9’s”
(t)AlimA(t)limA ptt
Qiang Xu CUHK, Fall 2012 Part.1 .46
Designers Need Intuitive Measures
Mean Time To Failure, MTTF, average time the system remains up before it goes down and has to be repaired or replaced MTTF is about the mean only, so there is also
nTTF
Mean Time To Repair, MTTR Mean Time Between Failures, MTBF =
MTTF + MTTR
Failures in Time, FIT, number of failures per 109 hours
MTTRMTTFMTTF
MTBFMTTF
A
Be careful to the assumptions behind these measures!
Qiang Xu CUHK, Fall 2012 Part.1 .47
More Detailed (Complex) Measures
The assumption of the system being in state ‘’up” or ‘’down” is very limiting Example: Multicore processors Let Pi = Prob {i processors are operational} Let c = computational capacity of a processor (e.g.,
number of fixed-size tasks it can execute) Computational capacity of i processors: Ci = i c Average computational capacity of system:
Performability, consider everything from the perspective of the application Application is used to define ‘’accomplishment levels”
L1, L2,...,Ln, each representing a QoS level vector (P(L1),P(L2),...,P(Ln)) where P(Li) is the
probability that the computer functions well enough to permit the application to reach up to accomplishment level Li
i1i
iPC
Qiang Xu CUHK, Fall 2012 Part.1 .48
Example: Tandem for Transaction Processing
Design objective: “Nonstop” operation Modular system expansion
FT design features: Loosely-coupled multi-computer architecture Hardware/software module fast-fail Error-correcting memory Error-detecting message Watch-dog timers …
Qiang Xu CUHK, Fall 2012 Part.1 .49
Example: AT&T Switching Systems
Design objective: High-availability: 2 hours downtime in 40 years Differentiate user aggravation level
» Extremely low disconnection rate for established calls » Low failure rate for call establishment
FT design features: Redundant processors 30% of control logic devoted to self-checking (for
1981 3B20D processor) Various forms of EDC and ECC Watch-dog timers Multiple levels of fault recovery …
Qiang Xu CUHK, Fall 2012 Part.1 .50
Example: Personal Computer
Design objective: Fast and cheap Occasional corruption is tolerable Expected lifetime: couple of years
FT design features: ECC for memory and hard disk …
More FT features will be in-place for commodity ICs in the near future due to increasing reliability threats the key is cost-effectiveness!