CMSC 5719 MSc Seminar Fault-Tolerant Computing

50
Qiang Xu CUHK, Fall 2012 Part.1 .1 CMSC 5719 MSc Seminar Fault-Tolerant Computing XU, Qiang (Johnny) 徐徐 [Partly adapted from Koren & Krishna, and B. Parhami Slides]

description

CMSC 5719 MSc Seminar Fault-Tolerant Computing. XU, Qiang (Johnny) 徐強 [ Partly a dapted from Koren & Krishna, and B. Parhami Slides ]. Why Learn This Stuff?. Outline. Motivation Fault classification Redundancy Metrics for Reliability Case studies. - PowerPoint PPT Presentation

Transcript of CMSC 5719 MSc Seminar Fault-Tolerant Computing

Page 1: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .1

CMSC 5719 MSc Seminar

Fault-Tolerant Computing

XU, Qiang (Johnny) 徐強

[Partly adapted from Koren & Krishna, and B. Parhami Slides]

Page 2: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .2

Why Learn This Stuff?

Page 3: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .3

Outline

Motivation Fault classification Redundancy Metrics for Reliability Case studies

Page 4: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .4

Fault-Tolerance - Basic definition

Fault-tolerant systems - ideally systems capable of executing their tasks correctly regardless of either hardware failures or software errors

In practice - we can never guarantee the flawless execution of tasks under any circumstances

Limit ourselves to types of failures and errors which are more likely to occur

Page 5: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .5

Need For Fault-Tolerance

Critical applications require extreme fault tolerance (e.g., aircrafts, nuclear reactors, medical equipment, and financial applications) A malfunction of a computer in such applications

can lead to catastrophe Their probability of failure must be extremely low,

possibly one in a billion per hour of operation System operating in a harsh environment

with high failure possibilitieselectromagnetic disturbancesparticle hits and alike

Complex systems consisting of millions of devices

Page 6: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .6

Get to Know the Enemy: What cause Faults?

Manufacturing Defects

Aging

(a.k.a., Circuit Wearout)

Page 7: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .7

Get to Know the Enemy: What cause Faults?

Internal Electronic Noise Electromagnetic Interference

Page 8: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .8

Get to Know the Enemy: What cause Faults?

Bugs … Malicious attack (beyond the scope)

Page 9: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .9

Fault Classification according to Duration

Permanent Faults - never go away, component has to be repaired or replaced

Transient Faults - disappear after a relatively short time Example - a memory cell whose contents are

changed due to some electromagnetic interference

Overwriting the memory cell with the right content will make the fault go away

Intermittent Faults - cycle between active and benign states Example - a loose connectionAn increasing threat largely due to

temeprature and voltage fluctuations

Page 10: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .10

Failures during Lifetime

Three phases of system lifetime Infant mortality (imperfect test, weak

components) Normal lifetime (transient/intermittent faults) Wear-out period (circuit aging)

Page 11: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .11

Seriously, Why Fault-Tolerance Comes Back?

Simply put, technology-driven

Time

Transistor Cost

Reliability Cost

Total Cost

With technology scaling

Today’s chips are extremely complex (billion transistors

running with less noise margin) and are much

hotter!

We cannot afford heavyweight, macro-scale redundancy for commodity

computing systems.

Page 12: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .12

The Impact of Technology Scaling

More leakage More process variability Smaller critical charges Weaker transistors and wires

Burn-in test less

effective

Higher random

failure rate

Faster wear-out

Page 13: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .13

What Can We Do when Confronting Enemies?

Surrender, but don’t become traitor Fail, but safe, i.e., don’t corrupt anything (ATM machine) Not that easy as you may think, you have to detect

faults! Weaken the enemies

fault-avoidance and fault-removal» Process improvement with less threats» Testing and DfT to remove defective circuits» Careful design reviews to remove design bugs» More training to reduce operator errors

Always some faults cannot be avoided and removed completely

Make yourself stronger Fault-tolerance

» Adding redundancy to detect, diagnose, confine, mask, compensate and recover from faults

» Mind the cost in terms of hardware, power, and performance Fault-evasion (a.k.a., Fault-prediction)

» Observe, learn and take pre-emptive steps to stop fault from occurring

Page 14: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .14

A Motivating Case StudyData availability and integrity concerns

Distributed DB system with 5 sitesFull connectivity, dedicated links

Only direct communication allowedSites and links may malfunction

Redundancy improves availability

S0

S1

S2S3

S4

L1

L0

L2

L3

L4L5

L6

L7

L8

L9

S: Probability of a site being availableL: Probability of a link being available

Data replication methods, and a challenge

File duplication: home / mirror sitesFile triplication: home / backup 1 / backup 2

Are there availability improvement methods with less redundancy?

Single-copy availability = SLUnavailability = 1 – SL

= 1 – 0.99 0.95 = 5.95% Fi

User

Page 15: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .15

Data Duplication: Home and Mirror Sites

S0

S1

S2S3

S4

L1

L0

L2

L3

L4L5

L6

L7

L8

L9

Data unavailability reduced from 5.95% to 0.35%

Availability improved from 94% to 99.65%

Duplicated availability = 2SL – (SL)2 Unavailability = 1 – 2SL + (SL)2

= (1 – SL)2 = 0.35%

A = SL + (1 – SL)SL

Primary site can be reached

Primary site inaccessible

Mirror site can be reached

S: Site availability e.g., 99%L: Link availability e.g., 95%

Fi home

Fi mirror

User

Page 16: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .16

Data Triplication: Home and Two Backups

S0

S1

S2S3

S4

L1

L0

L2

L3

L4L5

L6

L7

L8

L9

Data unavailability reduced from 5.95% to 0.02%

Availability improved from 94% to 99.98%

Triplicated avail. = 3SL – 3(SL)2 – (SL)3 Unavailability = 1 – 3SL – 3(SL)2 + (SL)3

= (1 – SL)3 = 0.02%

S: Site availability e.g., 99%L: Link availability e.g., 95%

Fi home Fi backup 2

User

Fi backup 1

A = SL + (1 – SL)SL + (1 – SL)2SL

Primary site can be reached

Primary site inaccessible

Backup 1 can be reached

Primary and backup 1

inaccessible

Backup 2 can be reached

Page 17: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .17

Data Dispersion: Three of Five Pieces

S0

S1

S2S3

S4

L1

L0

L2

L3

L4L5

L6

L7

L8

L9

Scheme Nonredund. Duplication Triplication DispersionUnavailability 5.95% 0.35% 0.02% 0.08%Redundancy 0% 100% 200% 67%

Dispersed avail. = 6(SL)2 – 8(SL)3 + 3(SL)4

Availability = 99.92%Unavailability = 1 – Availability = 0.08%

S: Site availability e.g., 99%L: Link availability e.g., 95%

Piece 3 Piece 2

User

Piece 0A = (SL)4 + 4(1 – SL)(SL)3 + 6(1 – SL)2(SL)2

All 4 pieces can be reached

Exactly 3 piecescan be reached

Only 2 pieces can be reached

Piece 1

Piece 4

Page 18: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .18

Questions Ignored in Our Simple Example

1. How redundant copies of data are kept consistent When a user modifies the data, how to update the redundant copies (pieces) quickly and prevent the use of stale data in the meantime?

2. How malfunctioning sites and links are identified Malfunction diagnosis must be quick to avoid data contamination

3. How recovery is accomplished when a malfunctioning site / link returns to service after repairThe returning site must be brought up to date with regard to changes

4. How data corrupted by the actions of an adversary is detected This is more difficult than detecting random malfunctions

The example does demonstrate, however, that: Many alternatives are available for improving dependability Proposed methods must be assessed through modeling The most cost-effective solution may be far from obvious

Page 19: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .19

Redundancy

Redundancy is at the heart of fault-tolerance Incorporation of extra components in the design

of a system to improve its reliability

Four forms of redundancy: Hardware redundancy (spatial redundancy)

» Static, dynamic and hybrid redundancy

Software redundancy» N-version programming

Information redundancy » Error detecting and correcting codes» Usually requires extra hardware for processing

Time redundancy » Re-execution

Page 20: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .20

Physical Redundancy

Physically replicate modules Effective for all sorts of faults Mind the area/energy overhead

Design issues How many copies? How to detect faults? How to recover from faults? How to organize redundancy (passive, active, or

hybrid)?

Page 21: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .21

Triple Modular Redundancy (TMR)

The best known FT technique Tolerate single error (soft or hard error) in any

module Low performance overhead Simple design Very high cost in terms of area and energy

To tolerate simultaneous faults, we can resort to N-modular redundancy (NMR) N is an odd integer Tolerates up to (N-1)/2 faults

Single point of failure at voter Voter is typically small and hence often assumed

to be very reliable

Page 22: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .22

Reliability of TMR Systems

M-of-N system with M=2, N=3 - system good if at least two modules are operational

A voter picks the majority output

Voter can fail - reliability of voter Rvot(t)

= Rvot(t) ( 3R² (t) - 2R³ (t) )

1

0

3)())(1(3)()( )(i

iivottmr tRtRitRtR

3

2

3))(1()(3)( )(i

iivot tRtRitR

Page 23: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .23

Triplicated Processor/Memory System

All communications (in either direction) between triplicated processors and triplicated memories go through majority voting

Higher reliability than a single majority voting of triplicated processor/memory structure

Page 24: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .24

Design Redundancy

Use diverse designs to furnish the same serviceAnother kind of physical redundancy

AdvantagesProtection against design deficiency Lower cost with simple “back-up” unit

Page 25: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .25

Watchdog Processor

Performs concurrent system-level error detection Monitoring bus connecting main processor and memory Targets control-flow checking: correct program blocks

executed in the right order Can detect hardware/software faults causing erroneous

instructions to be executed or wrong execution paths Watchdog needs the program's control-flow information

Page 26: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .26

DIVA: Dynamic Implementation Verification Architecture

Core computation, communication, and control validated by checker

Checker relaxes the burden of correctness on the core processor

Key checker requirements: simple, fast, and reliable

Page 27: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .27

N-Version Programming

N independent teams of programmers develop software to same specifications - N versions are run in parallel - output voted on If programs are developed independently - very

unlikely that they will fail on same inputs Assumption - failures are statistically

independent; probability of failure of an individual version = q

Probability of no more than m failures out of N versions –

What are the limitations?

Page 28: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .28

Information Redundancy - Coding

A data word with d bits is encoded into a codeword with c bits - c>d Not all combinations are valid codewords To extract original data - c bits must be decoded If the c bits do not constitute a valid codeword an

error is detected For certain encoding schemes - some types of

errors can also be corrected

Key parameters of code: Number of erroneous bits that can be detected Number of erroneous bits that can be corrected

Overhead of code: Additional bits required Additional hardware/latency for encoding and

decoding

2C

Page 29: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .29

Hamming Distance

The Hamming distance between two codewords - the number of bit positions in which the two words differ

A Hamming distance of two between two codewords implies that a single bit error will not change one of the codewords into the other

Page 30: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .30

Distance of a Code

The Distance of a code - the minimum Hamming distance between any two valid codewords

Example - Code with four codewords -{001,010,100,111} has a distance of 2can detect any single bit error

Example - Code with two codewords - {000,111} has a distance of 3 can detect any single or double bit error if double bit errors are not likely to

happen - code can correct any single bit error

Page 31: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .31

Coding vs. Redundancy

The code {000,111} can be used to encode a single data bit 0 can be encoded as 000 and 1 as 111This code is identical to TMR

The code {00,11} can also be used to encode a single data bit 0 can be encoded as 00 and 1 as 11This code is identical to a duplex

To detect up to k bit errors, the code distance should be at least k+1

To correct up to k bit errors, the code distance should be at least 2k+1

Page 32: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .32

Separability of a Code

A code is separable if it has separate fields for the data and the code bits Decoding consists of disregarding the code bits The code bits can be processed separately to verify

the correctness of the data

A non-separable code has the data and code bits integrated together - extracting the data from the encoded word requires some processing

The simplest separable codes are the parity codes A parity code has a distance of 2 Can detect all odd-bit errors Even or odd parity code?

Page 33: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .33

Error-Correcting Parity Codes Simplest scheme - data is organized in a

2-dimensional array Bits at the end of row –

parity over the row Bits at the bottom of column –

parity over the column

A single-bit error anywhere will cause a row and a column to be erroneous This identifies a unique erroneous bit

This is an example of overlapping parity - each bit is covered by more than one parity bit

Page 34: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .34

Cyclic Redundancy Check (CRC)

Many applications need to detect burst errors Why CRC is popular?

Effectiveness: A n-bit CRC check can detect all errors of less than n bits and a large portion of longer multi-bit errors

Ease of hardware implementation: shifters and XORs

How does it work? Consider dataword and codeword as polynomials At transmitter side, Codeword = Dataword * Generator

» Generator function is a pre-defined CRC polynomial» An example CRC-16 polynomial:

At receiver side, divide Codeword by CRC polynomial and check whether the remainder is zero

1)( 21516 XXXXG

Page 35: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .35

Time Redundancy

Perform execution multiple times (typically twice), and then compare the resultsEffective for transient faultsDoes it work for permanent errors?

Cost of time redundancyPerformance cost, can we mitigate it? Energy cost, can we mitigate it?

Page 36: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .36

Reversible Computation

Many operations are reversible Addition/subtraction; shift left/shift right; etc.

If reversible operations result in unexpected value, we know there’s a problem What operations are non-reversible? Devil are in the details

Page 37: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .37

With Redundancy, What can We Do?

Forward Error Correction (FEC)Also known as forward error recovery

(FER), although it’s actually not recovery

Use redundancy to mask error effects System continues to go forward in

presence of errors

Triple modular redundancy (TMR)

Page 38: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .38

With Redundancy, What can We Do?

Backward Error Recovery (BER)Use redundancy to recover from errors System go backward to a saved good

state

Periodic checkpoint and replay

Page 39: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .39

The Impact of FEC vs. BER

FEC BER

Performance

(when no faults)

Performance

(when faults occur)

Energy cost

Hardware cost

Design cost

Page 40: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .40

The Impact of FEC vs. BER

FEC BER

Performance

(when no faults)

Some degradation

Little degradation

Performance

(when faults occur)

No change Take time for recovery

Energy cost Usually high Usually low

Hardware cost High Low

Design cost Low High

When failure rate is very high, which one is preferred?

Page 41: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .41

Fault-Tolerance is NOT Free!

Fault-tolerance can be achieved to any arbitrary degree if you’re willing to throw resources for it Canonical solutions are there for a long time

Many FT solutions hurt performance, e.g., Checkpoint and replay Tightly lockstepped redundant cores Redundant multithreading

Many FT solutions increase cost, e.g., TMR and NMR RAID N-programming

Almost all the FT techniques increase energy consumption

Page 42: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .42

Fault-Tolerance for Designers

Fault-tolerance is essentially redundancy allocation and management, and design is about tradeoff!

As designers, smarter FT solutions can be obtained by Know your enemies better (what causes the failure, failure

rate, failure distribution, etc.) Know your design better (specific properties, anything

“free”, when and what to sacrifice, etc.)

Performance Reliability

Cost

Design

Energy

Page 43: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .43

Levels of Fault-Tolerance

Hardware System

Processor(s)

ALURegister

FileCache

Hardware Accelerator

Q

QSET

CLR

S

R

Memory System

interconnnection network

(Host) Operating System

Virtual Machine MonitorDrivers

Guest OS

Guest OS

Core-level redundancyDynamic verification

Block-level redundancyECC for memory

Circuit hardening

VirtualizationTask migration

Redundant multithreadingFault-tolerant

scheduling

Aplication Software Software redundancy

Page 44: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .44

Lots of FT Buzzwords over Time …

Reliability – continuation of service while being used Availability – readiness for use whenever needed Serviceability – ease of service or repair

Safety - absence of catastrophic consequences Maintainability - ability to undergo modifications and

repairs Surviability, Confidentiality, Accessibility …

Security - the degree of protection against danger, loss, and criminals

Dependability – the trustworthiness of a computing system which allows reliance to be justifiably placed on the service it delivers (defined by IFIP 10.4 Working Group on Dependable Computing and Fault Tolerance ) You’re right, I don’t know what it exactly means … Who cares?

Page 45: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .45

Designers Need Measures

A measure is a mathematical abstraction, which expresses only some subset of the object's nature, i.e., FT capability here

Reliability, R(t), probability that the system is up during the whole interval [0,t], for non-repairable products

Availability, A(t), fraction of time system is up during the interval [0,t], for repairable products Point Availability, Ap(t), probability that the

system is up at time t Long-Term Availability,

People usually talk about “the 9’s”

(t)AlimA(t)limA ptt

Page 46: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .46

Designers Need Intuitive Measures

Mean Time To Failure, MTTF, average time the system remains up before it goes down and has to be repaired or replaced MTTF is about the mean only, so there is also

nTTF

Mean Time To Repair, MTTR Mean Time Between Failures, MTBF =

MTTF + MTTR

Failures in Time, FIT, number of failures per 109 hours

MTTRMTTFMTTF

MTBFMTTF

A

Be careful to the assumptions behind these measures!

Page 47: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .47

More Detailed (Complex) Measures

The assumption of the system being in state ‘’up” or ‘’down” is very limiting Example: Multicore processors Let Pi = Prob {i processors are operational} Let c = computational capacity of a processor (e.g.,

number of fixed-size tasks it can execute) Computational capacity of i processors: Ci = i c Average computational capacity of system:

Performability, consider everything from the perspective of the application Application is used to define ‘’accomplishment levels”

L1, L2,...,Ln, each representing a QoS level vector (P(L1),P(L2),...,P(Ln)) where P(Li) is the

probability that the computer functions well enough to permit the application to reach up to accomplishment level Li

i1i

iPC

Page 48: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .48

Example: Tandem for Transaction Processing

Design objective: “Nonstop” operation Modular system expansion

FT design features: Loosely-coupled multi-computer architecture Hardware/software module fast-fail Error-correcting memory Error-detecting message Watch-dog timers …

Page 49: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .49

Example: AT&T Switching Systems

Design objective: High-availability: 2 hours downtime in 40 years Differentiate user aggravation level

» Extremely low disconnection rate for established calls » Low failure rate for call establishment

FT design features: Redundant processors 30% of control logic devoted to self-checking (for

1981 3B20D processor) Various forms of EDC and ECC Watch-dog timers Multiple levels of fault recovery …

Page 50: CMSC 5719  MSc Seminar Fault-Tolerant Computing

Qiang Xu CUHK, Fall 2012 Part.1 .50

Example: Personal Computer

Design objective: Fast and cheap Occasional corruption is tolerable Expected lifetime: couple of years

FT design features: ECC for memory and hard disk …

More FT features will be in-place for commodity ICs in the near future due to increasing reliability threats the key is cost-effectiveness!