8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a...

28
8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes, if provided that none of the following parameters change: The inputs The computing environment The user requirements
  • date post

    20-Jan-2016
  • Category

    Documents

  • view

    217
  • download

    0

Transcript of 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a...

Page 1: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.1 Introduction

Is it true that a program that has once performed

a given task as specified will continue to do so?

Yes, if provided that none of the following parameters change:

The inputs

The computing environment

The user requirements

Page 2: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.1 Introduction

Consistency of failure rates in time.Consistency of failure rates in time.

Federal Reserve Funds Transfer ProgramFederal Reserve Funds Transfer Program, active 12 hours/day, 5 days/week, active 12 hours/day, 5 days/week..

Page 3: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.1 Introduction

Failure rates of Command and Control Systems.Failure rates of Command and Control Systems.

Data and Analysis Center for Software (DACS)Data and Analysis Center for Software (DACS), fault density: the # of faults per , fault density: the # of faults per 1000 lines of code1000 lines of code, ranges from , ranges from 10 – 50 for “good” SW10 – 50 for “good” SW and from and from 1 – 5 after1 – 5 after

intensive testingintensive testing using automated tools. using automated tools.

Page 4: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.1 Introduction

Consequences of SW failureConsequences of SW failure::

Attendance has personal experience with Attendance has personal experience with incorrect billingincorrect billing, , lost airlinelost airline or or hotel reservationshotel reservations..

More serious errorsMore serious errors reported in the media, such as the disruption of reported in the media, such as the disruption of phone service to over phone service to over 20 million customers20 million customers during the summer of 1991 during the summer of 1991 due to due to coding errorcoding error in a new generation digital switch. in a new generation digital switch.

The most seriousThe most serious consequences are related to consequences are related to real-time applicationsreal-time applications, , such as those involving such as those involving spacecraftsspacecrafts: the launch failure of Mariner I : the launch failure of Mariner I (1962), the destruction of a French meteorological satellite in 1968, (1962), the destruction of a French meteorological satellite in 1968, several problems during the Apollo missions in the early of 1970s, the several problems during the Apollo missions in the early of 1970s, the NASA Space Shuttle, the NASA Space Shuttle, the fly-by-wire Airbus A320fly-by-wire Airbus A320, the Russian satellite , the Russian satellite ““MarsMars”, the satellite launcher ”, the satellite launcher ArianeAriane..

Page 5: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.1 Introduction

Causes of SW failureCauses of SW failure::

Malfunction of a process. E.g. exception handling, timeout

computation, design error (solution: check the outputs and

timer);

Erroneous control sequence (solution: set an upper limit on

loop iterations);

Data entry error (solution: use of error-detecting code and

type checks in input data).

Page 6: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.3 Dealing with Faulty Programs 8.3.1 Robustness

The minimum requirement is that the program will properly handle inputs

out of range, or in a different type of format than defined, without

degrading its performance of functions not dependent on the nonstandard

input.

When these input data are found not to comply with the program

specification:

a new input may be requested;

the last acceptable value of a variable can be used;

or a predefined default can e assigned.

Page 7: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.3 Dealing with Faulty Programs 8.3.1 Robustness

In general, Robustness is used to test:

the function of a process (e.g., by checking the outputs);

the control sequence (e.g., by setting an upper limit on loop

iterations);

the input data (e.g., by using error-detecting code and type

checks).

Page 8: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.3 Dealing with Faulty Programs 8.3.2 Temporal Redundancy

Temporal Redundancy consists of the reexecution of a program when an

error is encountered. The error may involve faulty data (as detected by

Robustness), faulty execution (e.g., accessing protected memory), or

incorrect output (as detected by Acceptance Tests).

Temporary reexecution will clear errors that arose from temporary

circumstances that are no longer present when a new pass through the

program is taken.

E.g., busy or noisy communication channels, full buffers,

power supply transients.

Page 9: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.3 Dealing with Faulty Programs 8.3.2 Temporal Redundancy

When the error persists,

Fault Containment Procedures

must be triggered by the system.

Page 10: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.3 Dealing with Faulty Programs 8.3.3 Software Diversity

SW Diversity permits uninterrupted system operation under the

presence of program faults through multiple implementations of a

given functional process and it is therefore particularly applicable

to real-time control systems.

It is divided into two categories:

Static SW Fault Tolerance: N-Version programming

Dynamic SW Fault Tolerance: Recovery Block

Page 11: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.3 Dealing with Faulty Programs 8.3.3 Software Diversity

Static SW Fault ToleranceStatic SW Fault Tolerance: N-Version Programming: N-Version Programming

A given task is executed by several programs (consecutively on

the same machine) and the result accepted only if a specified # of

programs agree within specified limits. The same computer

performs comparison and selection of the results to be

propagated to the external system.

In practice, the programs are executed concurrently, and therefore

multiple computers are required to implement this technique.

Page 12: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.3 Dealing with Faulty Programs 8.3.3 Software Diversity

Dynamic SW Fault Tolerance: Recovery Block

A single program is executed and the result (including

intermediate results) is subjected to an Acceptance Test.

Page 13: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.3 Dealing with Faulty Programs 8.3.3 Software Diversity

The term STATIC is used because the selection of the acceptable

result does not affect the subsequent execution of the programs.

The term DYNAMIC is used because the selection between the

original and alternate program is made during execution based on

the outcome of the Acceptance Test.

Page 14: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using Diversity 8.4.1 N-Version Programming

Defined as the independent generation of N 2 functionally equivalent

programs, called versions, from the same initial specification. In this case,

fault masking is not provided and upon disagreement among the versions,

3 alternatives are available:

Retry or restart (in this case fault containment rather than FT is provided;

Transition to a predefined “safe state”, possibly followed by later retries;

Reliance on one of the versions, either designated in advance as more

reliable or selected by a diagnostic program (in the latter case the

technique takes on some aspects of dynamic redundancy).

Page 15: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using Diversity 8.4.1 N-Version Programming

For N > 2, a majority voting logic can be implemented (N = 3), it is

required:

I. Three independent programs, each furnishing identical output formats;

II. An acceptance program that evaluates the output of (i) and selects the

result to be furnished as N-version output;

III. A driver (process controller) that invokes requirements (i) and (ii) and

furnishes the N-version output to other programs or the physical

system.

Page 16: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using Diversity 8.4.1 N-Version Programming

Experiment carried out at UCLA (1978):

7 separate versions for the application program;

From this, 12 3-version sets were constructed;

Each set was subject to 32 test cases,yielding 384 total tests.

One of the conclusions:

Cases where a single faulty version resulted in incorrect execution,

the OS of the computer intervened before the program reached the

voting stage. Most later N-version experiments overcame this

problem by incorporating acceptance tests for abort conditions and

precluding the intervention of the OS under these conditions.

Page 17: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using Diversity 8.4.1 N-Version Programming

Results of an Early N-Version Programming Experiment.Results of an Early N-Version Programming Experiment.

Page 18: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block

Represents the Dynamic Redundancy Approach to SW fault tolerance.

Consists of 3 SW elements:

a primary routing, which executes critical SW functions;

an acceptance test, which tests the output of the primary routine

after every execution;

at least one alternate routine which performs the same function as

the primary routine (but may be less capable or slower) and is invoked

by the acceptance test upon detection of a failure.

Page 19: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block

The basic structure is:

Ensure T

By P

Else by Q

Else Error

Where:

T is the acceptance test condition that is expected to be met by successful execution of either the primary routine P or the alternate routine Q.

The structure is easily expanded to accommodate several alternates Q1, Q2, GQ3,...,Qn.

Page 20: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block

Difference between Recovery Block and N-Version

Programming are:

only a single implementation of the program is run at a

time (in this case: P or Q);

the acceptability of the results is decided by a test rather than by comparison with functionally equivalent alternate versions.

Page 21: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block

Real-time control applications require that results furnished

by a program be both correct and timely.

For this reason, the recovery block for a real-time program

should incorporate a watchdog timer which initiates

execution by Q (if P does not produce an acceptance result

within the allocated time).

Page 22: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block

Recovery block for real-time Recovery block for real-time

applicationapplication**..

(Program flow under direction (Program flow under direction of the of the applicationapplication module is module is shown in shown in solid linessolid lines; ; timer-timer-

triggeredtriggered interrupts are interrupts are shown in shown in dashed linesdashed lines.).)

Page 23: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

A single program is executed at any given time:No special demands on computer redundancy or

computer architecture are made.

Performance penalty in normal operation is small:the execution of the acceptance test.

8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block

Highlights ...

Page 24: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

Storage requirements are expanded:in addition to the primary application program,

the acceptance test and the backup program must also be available in memory.

SW development cost is increased:Need to generate two programs and the

associated acceptance test.

8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block

Highlights ...

Page 25: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

The Acceptance Test is divided into 2 separate tests which are invoked

before and after the execution of the primary routine:

Before:The first acceptance test checks on the call format and

parameters.

The second acceptance test checks on the validity of the input data. (When data errors are common, provision of an alternate data source may be considered: dashed lines indicating the backup data)

After:The last acceptance test examines the output data.

8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block

Details about the Basic Recovery Block Structure ...

Page 26: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block

Internal Structure for primary Internal Structure for primary application module.application module.

Page 27: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block

The integration of application modules structured as recovery blocks

into a fault-tolerant SW system is shown in the next figure.

“Application Modules” and the decision diamond labeled “Return”

together represent the structure shown in figure ** .

In the absence of failures of the recovery blocks, the process will always remain in the inner loop.

If an abort is taken, the failure is recorded and some diagnostics may be performed. In case of a first failure in a recovery block, a retry may be initiated. If the failure persists, further execution of the task represented by the recovery block is suspended

Page 28: 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,

8. Fault Tolerance in Software

8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block

Executive and application modules.Executive and application modules.