Reliability and Fault Tolerance

38
Reliability and Reliability and Fault Tolerance Fault Tolerance Setha Pan-ngum Setha Pan-ngum

description

Reliability and Fault Tolerance. Setha Pan-ngum. Introduction. From the survey by American Society for Quality Control [1]. Ten most important product attributes. Introduction. Embedded system major requirements Low failure rate Leads to fault tolerance design Gracefully degradable. - PowerPoint PPT Presentation

Transcript of Reliability and Fault Tolerance

Page 1: Reliability and Fault Tolerance

Reliability and Fault Reliability and Fault ToleranceTolerance

Setha Pan-ngumSetha Pan-ngum

Page 2: Reliability and Fault Tolerance

IntroductionIntroduction From the survey by American Society for From the survey by American Society for

Quality Control [1]. Quality Control [1]. Ten most important Ten most important product attributesproduct attributes

AttributeAttribute Ave. Ave. ScoreScore

AttributeAttribute Ave. Ave. ScoreScore

performanceperformance 9.59.5 Ease of useEase of use 8.38.3

Last a long Last a long time time (reliability)(reliability)

9.09.0 AppearanceAppearance 7.77.7

ServiceService 8.98.9 Brand nameBrand name 6.36.3

Easily repaired Easily repaired (maintainabilit(maintainability)y)

8.88.8 Packaging/Packaging/displaydisplay

5.85.8

warrantywarranty 8.48.4 Latest modelLatest model 5.45.4

Page 3: Reliability and Fault Tolerance

IntroductionIntroduction

Embedded system major Embedded system major requirementsrequirements– Low failure rateLow failure rate– Leads to fault tolerance designLeads to fault tolerance design– Gracefully degradableGracefully degradable

Page 4: Reliability and Fault Tolerance

Failures, errors, faultsFailures, errors, faults

Fault – defects that cause malfunctionFault – defects that cause malfunction– Hardware fault e.g. broken wire, stuck Hardware fault e.g. broken wire, stuck

logiclogic– Software fault e.g. bugSoftware fault e.g. bug

Error – unintended state caused by Error – unintended state caused by fault. E.g. software bug leads to fault. E.g. software bug leads to wrong calculation wrong calculation wrong output wrong output

Failure – errors leads to system Failure – errors leads to system failure (opearates differently from failure (opearates differently from intended)intended)

Page 5: Reliability and Fault Tolerance

Causes of FailuresCauses of Failures

Errors in specification or designErrors in specification or design Component defectsComponent defects Environmental effectsEnvironmental effects

Page 6: Reliability and Fault Tolerance

Errors in specification Errors in specification or designor design Probably the hardest to detectProbably the hardest to detect Embedded system development:Embedded system development:

– SpecificationSpecification– DesignDesign– ImplementationImplementation

If specification is wrong, the If specification is wrong, the following steps will be wrong. E.g. following steps will be wrong. E.g. unit compatibility of rocket example.unit compatibility of rocket example.

Page 7: Reliability and Fault Tolerance

Component defectsComponent defects

Depends on deviceDepends on device Electronic components can have Electronic components can have

defects from manufacturing, and defects from manufacturing, and wear and tear.wear and tear.

Page 8: Reliability and Fault Tolerance

Operating Operating environmentenvironment StressesStresses TemperaturesTemperatures MoistureMoisture vibrationvibration

Page 9: Reliability and Fault Tolerance

Classification of Classification of failuresfailures Nature Nature

– Value – incorrect outputValue – incorrect output– Timing – correct output but too late.Timing – correct output but too late.

Perception – as seen by usersPerception – as seen by users– Persistent – all users see same results. Persistent – all users see same results.

E.g. sensor reading stuck at ‘0’E.g. sensor reading stuck at ‘0’– Inconsistent – users see differently. Inconsistent – users see differently.

E.g. sensor reading floats (say between E.g. sensor reading floats (say between 1-3V, and could be seen as ‘1’ or ‘0’). 1-3V, and could be seen as ‘1’ or ‘0’).

Called malicious or Called malicious or Byzantine failuresByzantine failures

Page 10: Reliability and Fault Tolerance

Classification of Classification of failuresfailures EffectsEffects

– Benign – not serious e.g. broken tvBenign – not serious e.g. broken tv– Malign – serious e.g. plane crashMalign – serious e.g. plane crash

OftennessOftenness– Permanent – broken equipmentPermanent – broken equipment– Transient – lose wire, processors Transient – lose wire, processors

under stress (EMI, power supply, under stress (EMI, power supply, radiation)radiation)

– Transient occurs a lot more often!Transient occurs a lot more often!

Page 11: Reliability and Fault Tolerance

Example of transient Example of transient failurefailure From report on fire control radar of From report on fire control radar of

F-16 fighters [3]F-16 fighters [3]– Pilot noticed malfunctions every 6 hrsPilot noticed malfunctions every 6 hrs– Pilot requested maintenance every 31 Pilot requested maintenance every 31

hrshrs– 1/3 of requests can be reproduced in 1/3 of requests can be reproduced in

workshopworkshop– Overall less than 10% of transient Overall less than 10% of transient

failures can be reproduced!failures can be reproduced!

Page 12: Reliability and Fault Tolerance

Types of errorsTypes of errors

Transient Transient – Regularly occurs. E.g. electrical Regularly occurs. E.g. electrical

glitches causes temporary value glitches causes temporary value errorerror

PermanentPermanent– Transient fault can be kept in Transient fault can be kept in

database, making it permanent.database, making it permanent.

Page 13: Reliability and Fault Tolerance

Classifications of Classifications of faultsfaults NatureNature

– By chance – broken wireBy chance – broken wire– Intentional – virusIntentional – virus

PerceptionPerception– PhysicalPhysical– DesignDesign

BoundaryBoundary– Internal – component breakdownInternal – component breakdown– External – EMI causes faultsExternal – EMI causes faults

Page 14: Reliability and Fault Tolerance

Classifications of Classifications of faultsfaults OriginOrigin

– Development e.g. in program or deviceDevelopment e.g. in program or device– Operation e.g. user entering wrong Operation e.g. user entering wrong

inputinput PersistencePersistence

– Transient – glitches caused by lightningTransient – glitches caused by lightning– Permanent faults that need repairPermanent faults that need repair

Page 15: Reliability and Fault Tolerance

Definitions Definitions

Reliability R(t)Reliability R(t)– Probability that a system will perform its intended Probability that a system will perform its intended

function in the specified environment up to time t.function in the specified environment up to time t.

Maintainability M(t)Maintainability M(t)– Probability that a system can be restored within t Probability that a system can be restored within t

units after a failure.units after a failure.

Availability A(t)Availability A(t)– Probability that a system is available to perform Probability that a system is available to perform

the specified service at tthe specified service at tdtdt. (% of system working). (% of system working)

Page 16: Reliability and Fault Tolerance

Reliability [4]Reliability [4]

> R(0) = 1, R(R(0) = 1, R(> Failure density f(t) = -dR(t)/dtFailure density f(t) = -dR(t)/dt> Failure rate Failure rate (t)(t) = f(t)/R(t) = f(t)/R(t) (t) (t) dt dt is the conditional probability is the conditional probability

that a system will fail in the interval that a system will fail in the interval dt, provided it has been operational dt, provided it has been operational at the beginning of this interval at the beginning of this interval

> When When (t) = constant then R(t) = e(t) = constant then R(t) = e--tt

= MTTF = MTTF (Mean Time to Failure)(Mean Time to Failure)

Page 17: Reliability and Fault Tolerance

Failure rateFailure rate

(t)

Real-time

Period of constant Failure Rate

Earlyfaillures

Latefaillures

Burn-in Wear-out

Page 18: Reliability and Fault Tolerance

Failure rate vs Costs Failure rate vs Costs [4][4]

(t)

Cost of System

US Air Force:Failure rate of electronic systemswithin a given technologyincreases with increasing system cost.

Page 19: Reliability and Fault Tolerance

1919

MaintainabiMaintainabilitylity

> Mesured by Repair-rate Mesured by Repair-rate > When When (t) = constant then M(t) = e(t) = constant then M(t) = e--tt

= MTTR = MTTR (Mean Time to Repair)(Mean Time to Repair)> Preventive maintenace:Preventive maintenace:

– If If increases in time, then it makes increases in time, then it makes sense to replace the aging unit.sense to replace the aging unit.

– If If of different units evolves of different units evolves differently, preventive maintenace differently, preventive maintenace consists in replacing the “Smallest consists in replacing the “Smallest Replaceable Units” with growing Replaceable Units” with growing

Page 20: Reliability and Fault Tolerance

2020

Reliability vs. Reliability vs. MaintainabilityMaintainability> Reliability and maintainability are, to Reliability and maintainability are, to

a certain extent, conflicting goals.a certain extent, conflicting goals.> Example: ConnectorsExample: Connectors

> Inside a SRU, reliability must be Inside a SRU, reliability must be optimizedoptimized

> Between SRU’s, maintainability is Between SRU’s, maintainability is importantimportant

PlugPlugSolderSolder

ReliabilityReliability

badbadgoodgood

MaintainabilityMaintainability

goodgoodbadbad

Page 21: Reliability and Fault Tolerance

2121

AvailabiliAvailabilityty> A = MTTF / ( MTTF + MTTR )A = MTTF / ( MTTF + MTTR )

> Good availability can be achieved Good availability can be achieved eithereither– by a high MTTFby a high MTTF– by a small MTTRby a small MTTR

A high system MTTF can be achieved A high system MTTF can be achieved by means of fault tolerance: the by means of fault tolerance: the system continues to operate properly system continues to operate properly even when some components have even when some components have failed.failed.

Fault tolerance reduces also the MTTR Fault tolerance reduces also the MTTR requirements.requirements.

Page 22: Reliability and Fault Tolerance

2222

Fault toleranceFault toleranceobtained through obtained through redundancyredundancy(more resources assigned to a task than (more resources assigned to a task than strictly required)strictly required)

REDUNDANCYREDUNDANCY can be used forcan be used for

– Fault detectionFault detection– Fault correctionFault correction

can be implemented at various can be implemented at various levelslevels– at component levelat component level– at processor levelat processor level– at system levelat system level

Page 23: Reliability and Fault Tolerance

2323

RedundancyRedundancy at component at component level level

Error detection/correction in memoriesError detection/correction in memories

Error detection by parity bit.Error detection by parity bit.

Error correction by multiple parity bits.Error correction by multiple parity bits.

Page 24: Reliability and Fault Tolerance

2424

RedundancyRedundancy at component at component level level Stripe Sets with Parity (RAID)Stripe Sets with Parity (RAID)

Disk 1Disk 1 Disk 3Disk 3Disk 2Disk 2

= XOR of two other disks= XOR of two other disks

Page 25: Reliability and Fault Tolerance

2525

RedundancyRedundancy at component at component level level

Error detection in an ALUError detection in an ALU

ALUALU

proofproofby 9by 9 Error !Error !

Page 26: Reliability and Fault Tolerance

2626

Redundancy in Redundancy in componentscomponents Error detectionError detection– to correct transient errors by retryto correct transient errors by retry– to avoid using corrupted datato avoid using corrupted data

Error correctionError correction– to correct transient errors on the to correct transient errors on the

flyfly– to remain operational after to remain operational after

catastrophic component failurecatastrophic component failure– Scheduled maintenance instead of Scheduled maintenance instead of

urgent repair.urgent repair.

Page 27: Reliability and Fault Tolerance

2727

Fault detection at Fault detection at Processor LevelProcessor LevelCC

PPUU11

CCPPUU22

==

ErrorError

Page 28: Reliability and Fault Tolerance

2828

Fault correction at Fault correction at Processor LevelProcessor Level

CCPPUU11

CCPPUU33

CCPPUU22

Voting LogicVoting Logic

Page 29: Reliability and Fault Tolerance

2929

Replica Replica DeterminismDeterminism A set of replicated RT objects is A set of replicated RT objects is

“replica determinate” if all “replica determinate” if all objects of this set visit the same objects of this set visit the same state state at about the same timeat about the same time..

““At about the same time” At about the same time” makes a makes a concession to the finite precision concession to the finite precision of the clock synchronizationof the clock synchronization

Replica determinism is needed forReplica determinism is needed for– consistent distributed actionsconsistent distributed actions– fault tolerance by active redundancy fault tolerance by active redundancy

Page 30: Reliability and Fault Tolerance

3030

Replica Replica DeterminismDeterminism Lack of replica determinism makes Lack of replica determinism makes

voting meaningless.voting meaningless. Example: Airplane on takeoffExample: Airplane on takeoff

Lack of replica determinism Lack of replica determinism causes the faulty channel to causes the faulty channel to win !!!win !!!

System 1:System 1:System 2:System 2:System 3:System 3:

Majority:Majority:

Take offTake offAbortAbort

Take offTake off

Take offTake off

Accelerate EngineAccelerate EngineStop EngineStop Engine

Stop Engine (fault)Stop Engine (fault)

Stop EngineStop Engine

Page 31: Reliability and Fault Tolerance

3131

Fault Correction at System Fault Correction at System LevelLevelHot Stand-ByHot Stand-By

SSYYSSTTEEMM11

SSYYSSTTEEMM22

Error DetectionError Detection

Page 32: Reliability and Fault Tolerance

3232

Fault Correction at System Fault Correction at System LevelLevelCold Stand-ByCold Stand-By

SSYYSSTTEEMM11

SSYYSSTTEEMM22

Error DetectionError Detection

Common MemoryCommon Memory

Page 33: Reliability and Fault Tolerance

3333

Fault Correction at System Fault Correction at System LevelLevelDistributed Common Distributed Common MemoryMemory

SSYYSSTTEEMM11

SSYYSSTTEEMM22

Distributed Common MemoryDistributed Common Memory

In fact, each processor has access to theIn fact, each processor has access to thememory of the other to keep a copy of thememory of the other to keep a copy of the

state of all critical processesstate of all critical processes

Error DetectionError Detection

Page 34: Reliability and Fault Tolerance

3434

Fault Correction at System Fault Correction at System LevelLevelLoad SharingLoad Sharing

Common MemoryCommon Memory

SSYYSSTTEEMM11

SSYYSSTTEEMM11

SSYYSSTTEEMM11

SSYYSSTTEEMM11

Page 35: Reliability and Fault Tolerance

3535

Safety Critical Safety Critical systemssystems

SSYYSS11

Voting LogicVoting Logic

SSYYSS22

SSYYSS44

SSYYSS33

Fail once, still operational, fail twice, still safe.Fail once, still operational, fail twice, still safe.

Page 36: Reliability and Fault Tolerance

3636

Safety Critical Safety Critical SystemsSystems

ButBut

What happens in case What happens in case

of a Software Bug ???of a Software Bug ???

Page 37: Reliability and Fault Tolerance

3737

Space Shuttle Computer Space Shuttle Computer systemsystem

SSYYSS11

Voting LogicVoting Logic

SSYYSS22

SSYYSS44

SSYYSS33

SSYYSS55

Page 38: Reliability and Fault Tolerance

ReferencesReferences

1.1. Ebeling C, An introduction to reliability and Ebeling C, An introduction to reliability and maintainability engineering, McGraw-Hill, maintainability engineering, McGraw-Hill, 19971997

2.2. Krishna C, Real-time systems, McGraw-Hill, Krishna C, Real-time systems, McGraw-Hill, 19971997

3.3. Kopetz H, Real-time systems design Kopetz H, Real-time systems design principles for distributed embedded principles for distributed embedded applications, Kluwer, 1997applications, Kluwer, 1997

4.4. Tiberghien J, Real-time system fault Tiberghien J, Real-time system fault tolerance, Lecture slidestolerance, Lecture slides