CMSC 5719 MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .1

CMSC 5719 MSc Seminar

Fault-Tolerant Computing

XU, Qiang (Johnny) 徐強

[Partly adapted from Koren & Krishna, and B. Parhami Slides]


Why Learn This Stuff?


Outline

Motivation Fault classification Redundancy Metrics for Reliability Case studies


Fault-Tolerance - Basic definition

Fault-tolerant systems - ideally systems capable of executing their tasks correctly regardless of either hardware failures or software errors

In practice - we can never guarantee the flawless execution of tasks under any circumstances

Limit ourselves to types of failures and errors which are more likely to occur


Need For Fault-Tolerance

Critical applications require extreme fault tolerance (e.g., aircrafts, nuclear reactors, medical equipment, and financial applications) A malfunction of a computer in such applications

can lead to catastrophe Their probability of failure must be extremely low,

possibly one in a billion per hour of operation System operating in a harsh environment

with high failure possibilitieselectromagnetic disturbancesparticle hits and alike

Complex systems consisting of millions of devices


Get to Know the Enemy: What cause Faults?

Manufacturing Defects

Aging

(a.k.a., Circuit Wearout)



Internal Electronic Noise Electromagnetic Interference



Bugs … Malicious attack (beyond the scope)


Fault Classification according to Duration

Permanent Faults - never go away, component has to be repaired or replaced

Transient Faults - disappear after a relatively short time Example - a memory cell whose contents are

changed due to some electromagnetic interference

Overwriting the memory cell with the right content will make the fault go away

Intermittent Faults - cycle between active and benign states Example - a loose connectionAn increasing threat largely due to

temeprature and voltage fluctuations


Failures during Lifetime

Three phases of system lifetime Infant mortality (imperfect test, weak

components) Normal lifetime (transient/intermittent faults) Wear-out period (circuit aging)


Seriously, Why Fault-Tolerance Comes Back?

Simply put, technology-driven

Time

Transistor Cost

Reliability Cost

Total Cost

With technology scaling

Today’s chips are extremely complex (billion transistors

running with less noise margin) and are much

hotter!

We cannot afford heavyweight, macro-scale redundancy for commodity

computing systems.


The Impact of Technology Scaling

More leakage More process variability Smaller critical charges Weaker transistors and wires

Burn-in test less

effective

Higher random

failure rate

Faster wear-out


What Can We Do when Confronting Enemies?

Surrender, but don’t become traitor Fail, but safe, i.e., don’t corrupt anything (ATM machine) Not that easy as you may think, you have to detect

faults! Weaken the enemies

fault-avoidance and fault-removal» Process improvement with less threats» Testing and DfT to remove defective circuits» Careful design reviews to remove design bugs» More training to reduce operator errors

Always some faults cannot be avoided and removed completely

Make yourself stronger Fault-tolerance

» Adding redundancy to detect, diagnose, confine, mask, compensate and recover from faults

» Mind the cost in terms of hardware, power, and performance Fault-evasion (a.k.a., Fault-prediction)

» Observe, learn and take pre-emptive steps to stop fault from occurring


A Motivating Case StudyData availability and integrity concerns

Distributed DB system with 5 sitesFull connectivity, dedicated links

Only direct communication allowedSites and links may malfunction

Redundancy improves availability

S0

S1

S2S3

S4

L1

L0

L2

L3

L4L5

L6

L7

L8

L9

S: Probability of a site being availableL: Probability of a link being available

Data replication methods, and a challenge

File duplication: home / mirror sitesFile triplication: home / backup 1 / backup 2

Are there availability improvement methods with less redundancy?

Single-copy availability = SLUnavailability = 1 – SL

= 1 – 0.99 0.95 = 5.95% Fi

User


Data Duplication: Home and Mirror Sites

S0

S1

S2S3

S4

L1

L0

L2

L3

L4L5

L6

L7

L8

L9

Data unavailability reduced from 5.95% to 0.35%

Availability improved from 94% to 99.65%

Duplicated availability = 2SL – (SL)2 Unavailability = 1 – 2SL + (SL)2

= (1 – SL)2 = 0.35%

A = SL + (1 – SL)SL

Primary site can be reached

Primary site inaccessible

Mirror site can be reached

S: Site availability e.g., 99%L: Link availability e.g., 95%

Fi home

Fi mirror

User


Data Triplication: Home and Two Backups

S0

S1

S2S3

S4

L1

L0

L2

L3

L4L5

L6

L7

L8

L9

Data unavailability reduced from 5.95% to 0.02%

Availability improved from 94% to 99.98%

Triplicated avail. = 3SL – 3(SL)2 – (SL)3 Unavailability = 1 – 3SL – 3(SL)2 + (SL)3

= (1 – SL)3 = 0.02%


Fi home Fi backup 2

User

Fi backup 1

A = SL + (1 – SL)SL + (1 – SL)2SL

Primary site can be reached

Primary site inaccessible

Backup 1 can be reached

Primary and backup 1

inaccessible

Backup 2 can be reached


Data Dispersion: Three of Five Pieces

S0

S1

S2S3

S4

L1

L0

L2

L3

L4L5

L6

L7

L8

L9

Scheme Nonredund. Duplication Triplication DispersionUnavailability 5.95% 0.35% 0.02% 0.08%Redundancy 0% 100% 200% 67%

Dispersed avail. = 6(SL)2 – 8(SL)3 + 3(SL)4

Availability = 99.92%Unavailability = 1 – Availability = 0.08%


Piece 3 Piece 2

User

Piece 0A = (SL)4 + 4(1 – SL)(SL)3 + 6(1 – SL)2(SL)2

All 4 pieces can be reached

Exactly 3 piecescan be reached

Only 2 pieces can be reached

Piece 1

Piece 4


Questions Ignored in Our Simple Example

1. How redundant copies of data are kept consistent When a user modifies the data, how to update the redundant copies (pieces) quickly and prevent the use of stale data in the meantime?

2. How malfunctioning sites and links are identified Malfunction diagnosis must be quick to avoid data contamination

3. How recovery is accomplished when a malfunctioning site / link returns to service after repairThe returning site must be brought up to date with regard to changes

4. How data corrupted by the actions of an adversary is detected This is more difficult than detecting random malfunctions

The example does demonstrate, however, that: Many alternatives are available for improving dependability Proposed methods must be assessed through modeling The most cost-effective solution may be far from obvious


Redundancy

Redundancy is at the heart of fault-tolerance Incorporation of extra components in the design

of a system to improve its reliability

Four forms of redundancy: Hardware redundancy (spatial redundancy)

» Static, dynamic and hybrid redundancy

Software redundancy» N-version programming

Information redundancy » Error detecting and correcting codes» Usually requires extra hardware for processing

Time redundancy » Re-execution


Physical Redundancy

Physically replicate modules Effective for all sorts of faults Mind the area/energy overhead

Design issues How many copies? How to detect faults? How to recover from faults? How to organize redundancy (passive, active, or

hybrid)?


Triple Modular Redundancy (TMR)

The best known FT technique Tolerate single error (soft or hard error) in any

module Low performance overhead Simple design Very high cost in terms of area and energy

To tolerate simultaneous faults, we can resort to N-modular redundancy (NMR) N is an odd integer Tolerates up to (N-1)/2 faults

Single point of failure at voter Voter is typically small and hence often assumed

to be very reliable


Reliability of TMR Systems

M-of-N system with M=2, N=3 - system good if at least two modules are operational

A voter picks the majority output

Voter can fail - reliability of voter Rvot(t)

= Rvot(t) ( 3R² (t) - 2R³ (t) )

1

0

3)())(1(3)()( )(i

iivottmr tRtRitRtR

3

2

3))(1()(3)( )(i

iivot tRtRitR


Triplicated Processor/Memory System

All communications (in either direction) between triplicated processors and triplicated memories go through majority voting

Higher reliability than a single majority voting of triplicated processor/memory structure


Design Redundancy

Use diverse designs to furnish the same serviceAnother kind of physical redundancy

AdvantagesProtection against design deficiency Lower cost with simple “back-up” unit


Watchdog Processor

Performs concurrent system-level error detection Monitoring bus connecting main processor and memory Targets control-flow checking: correct program blocks

executed in the right order Can detect hardware/software faults causing erroneous

instructions to be executed or wrong execution paths Watchdog needs the program's control-flow information


DIVA: Dynamic Implementation Verification Architecture

Core computation, communication, and control validated by checker

Checker relaxes the burden of correctness on the core processor

Key checker requirements: simple, fast, and reliable


N-Version Programming

N independent teams of programmers develop software to same specifications - N versions are run in parallel - output voted on If programs are developed independently - very

unlikely that they will fail on same inputs Assumption - failures are statistically

independent; probability of failure of an individual version = q

Probability of no more than m failures out of N versions –

What are the limitations?


Information Redundancy - Coding

A data word with d bits is encoded into a codeword with c bits - c>d Not all combinations are valid codewords To extract original data - c bits must be decoded If the c bits do not constitute a valid codeword an

error is detected For certain encoding schemes - some types of

errors can also be corrected

Key parameters of code: Number of erroneous bits that can be detected Number of erroneous bits that can be corrected

Overhead of code: Additional bits required Additional hardware/latency for encoding and

decoding

2C


Hamming Distance

The Hamming distance between two codewords - the number of bit positions in which the two words differ

A Hamming distance of two between two codewords implies that a single bit error will not change one of the codewords into the other


Distance of a Code

The Distance of a code - the minimum Hamming distance between any two valid codewords

Example - Code with four codewords -{001,010,100,111} has a distance of 2can detect any single bit error

Example - Code with two codewords - {000,111} has a distance of 3 can detect any single or double bit error if double bit errors are not likely to

happen - code can correct any single bit error


Coding vs. Redundancy

The code {000,111} can be used to encode a single data bit 0 can be encoded as 000 and 1 as 111This code is identical to TMR

The code {00,11} can also be used to encode a single data bit 0 can be encoded as 00 and 1 as 11This code is identical to a duplex

To detect up to k bit errors, the code distance should be at least k+1

To correct up to k bit errors, the code distance should be at least 2k+1


Separability of a Code

A code is separable if it has separate fields for the data and the code bits Decoding consists of disregarding the code bits The code bits can be processed separately to verify

the correctness of the data

A non-separable code has the data and code bits integrated together - extracting the data from the encoded word requires some processing

The simplest separable codes are the parity codes A parity code has a distance of 2 Can detect all odd-bit errors Even or odd parity code?


Error-Correcting Parity Codes Simplest scheme - data is organized in a

2-dimensional array Bits at the end of row –

parity over the row Bits at the bottom of column –

parity over the column

A single-bit error anywhere will cause a row and a column to be erroneous This identifies a unique erroneous bit

This is an example of overlapping parity - each bit is covered by more than one parity bit


Cyclic Redundancy Check (CRC)

Many applications need to detect burst errors Why CRC is popular?

Effectiveness: A n-bit CRC check can detect all errors of less than n bits and a large portion of longer multi-bit errors

Ease of hardware implementation: shifters and XORs

How does it work? Consider dataword and codeword as polynomials At transmitter side, Codeword = Dataword * Generator

» Generator function is a pre-defined CRC polynomial» An example CRC-16 polynomial:

At receiver side, divide Codeword by CRC polynomial and check whether the remainder is zero

1)( 21516 XXXXG


Time Redundancy

Perform execution multiple times (typically twice), and then compare the resultsEffective for transient faultsDoes it work for permanent errors?

Cost of time redundancyPerformance cost, can we mitigate it? Energy cost, can we mitigate it?


Reversible Computation

Many operations are reversible Addition/subtraction; shift left/shift right; etc.

If reversible operations result in unexpected value, we know there’s a problem What operations are non-reversible? Devil are in the details


With Redundancy, What can We Do?

Forward Error Correction (FEC)Also known as forward error recovery

(FER), although it’s actually not recovery

Use redundancy to mask error effects System continues to go forward in

presence of errors

Triple modular redundancy (TMR)


With Redundancy, What can We Do?

Backward Error Recovery (BER)Use redundancy to recover from errors System go backward to a saved good

state

Periodic checkpoint and replay


The Impact of FEC vs. BER

FEC BER

Performance

(when no faults)

Performance

(when faults occur)

Energy cost

Hardware cost

Design cost


The Impact of FEC vs. BER

FEC BER

Performance

(when no faults)

Some degradation

Little degradation

Performance

(when faults occur)

No change Take time for recovery

Energy cost Usually high Usually low

Hardware cost High Low

Design cost Low High

When failure rate is very high, which one is preferred?


Fault-Tolerance is NOT Free!

Fault-tolerance can be achieved to any arbitrary degree if you’re willing to throw resources for it Canonical solutions are there for a long time

Many FT solutions hurt performance, e.g., Checkpoint and replay Tightly lockstepped redundant cores Redundant multithreading

Many FT solutions increase cost, e.g., TMR and NMR RAID N-programming

Almost all the FT techniques increase energy consumption


Fault-Tolerance for Designers

Fault-tolerance is essentially redundancy allocation and management, and design is about tradeoff!

As designers, smarter FT solutions can be obtained by Know your enemies better (what causes the failure, failure

rate, failure distribution, etc.) Know your design better (specific properties, anything

“free”, when and what to sacrifice, etc.)

Performance Reliability

Cost

Design

Energy


Levels of Fault-Tolerance

Hardware System

Processor(s)

ALURegister

FileCache

Hardware Accelerator

Q

QSET

CLR

S

R

Memory System

interconnnection network

(Host) Operating System

Virtual Machine MonitorDrivers

Guest OS

Guest OS

Core-level redundancyDynamic verification

Block-level redundancyECC for memory

Circuit hardening

VirtualizationTask migration

Redundant multithreadingFault-tolerant

scheduling

Aplication Software Software redundancy


Lots of FT Buzzwords over Time …

Reliability – continuation of service while being used Availability – readiness for use whenever needed Serviceability – ease of service or repair

Safety - absence of catastrophic consequences Maintainability - ability to undergo modifications and

repairs Surviability, Confidentiality, Accessibility …

Security - the degree of protection against danger, loss, and criminals

Dependability – the trustworthiness of a computing system which allows reliance to be justifiably placed on the service it delivers (defined by IFIP 10.4 Working Group on Dependable Computing and Fault Tolerance ) You’re right, I don’t know what it exactly means … Who cares?


Designers Need Measures

A measure is a mathematical abstraction, which expresses only some subset of the object's nature, i.e., FT capability here

Reliability, R(t), probability that the system is up during the whole interval [0,t], for non-repairable products

Availability, A(t), fraction of time system is up during the interval [0,t], for repairable products Point Availability, Ap(t), probability that the

system is up at time t Long-Term Availability,

People usually talk about “the 9’s”

(t)AlimA(t)limA ptt


Designers Need Intuitive Measures

Mean Time To Failure, MTTF, average time the system remains up before it goes down and has to be repaired or replaced MTTF is about the mean only, so there is also

nTTF

Mean Time To Repair, MTTR Mean Time Between Failures, MTBF =

MTTF + MTTR

Failures in Time, FIT, number of failures per 109 hours

MTTRMTTFMTTF

MTBFMTTF

A

Be careful to the assumptions behind these measures!


More Detailed (Complex) Measures

The assumption of the system being in state ‘’up” or ‘’down” is very limiting Example: Multicore processors Let Pi = Prob {i processors are operational} Let c = computational capacity of a processor (e.g.,

number of fixed-size tasks it can execute) Computational capacity of i processors: Ci = i c Average computational capacity of system:

Performability, consider everything from the perspective of the application Application is used to define ‘’accomplishment levels”

L1, L2,...,Ln, each representing a QoS level vector (P(L1),P(L2),...,P(Ln)) where P(Li) is the

probability that the computer functions well enough to permit the application to reach up to accomplishment level Li

i1i

iPC


Example: Tandem for Transaction Processing

Design objective: “Nonstop” operation Modular system expansion

FT design features: Loosely-coupled multi-computer architecture Hardware/software module fast-fail Error-correcting memory Error-detecting message Watch-dog timers …


Example: AT&T Switching Systems

Design objective: High-availability: 2 hours downtime in 40 years Differentiate user aggravation level

» Extremely low disconnection rate for established calls » Low failure rate for call establishment

FT design features: Redundant processors 30% of control logic devoted to self-checking (for

1981 3B20D processor) Various forms of EDC and ECC Watch-dog timers Multiple levels of fault recovery …


Example: Personal Computer

Design objective: Fast and cheap Occasional corruption is tolerable Expected lifetime: couple of years

FT design features: ECC for memory and hard disk …

More FT features will be in-place for commodity ICs in the near future due to increasing reliability threats the key is cost-effectiveness!

CMSC 5719 MSc Seminar Fault-Tolerant Computing

Documents

Transcript of CMSC 5719 MSc Seminar Fault-Tolerant Computing