8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a...
-
date post
20-Jan-2016 -
Category
Documents
-
view
217 -
download
0
Transcript of 8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a...
8. Fault Tolerance in Software
8.1 Introduction
Is it true that a program that has once performed
a given task as specified will continue to do so?
Yes, if provided that none of the following parameters change:
The inputs
The computing environment
The user requirements
8. Fault Tolerance in Software
8.1 Introduction
Consistency of failure rates in time.Consistency of failure rates in time.
Federal Reserve Funds Transfer ProgramFederal Reserve Funds Transfer Program, active 12 hours/day, 5 days/week, active 12 hours/day, 5 days/week..
8. Fault Tolerance in Software
8.1 Introduction
Failure rates of Command and Control Systems.Failure rates of Command and Control Systems.
Data and Analysis Center for Software (DACS)Data and Analysis Center for Software (DACS), fault density: the # of faults per , fault density: the # of faults per 1000 lines of code1000 lines of code, ranges from , ranges from 10 – 50 for “good” SW10 – 50 for “good” SW and from and from 1 – 5 after1 – 5 after
intensive testingintensive testing using automated tools. using automated tools.
8. Fault Tolerance in Software
8.1 Introduction
Consequences of SW failureConsequences of SW failure::
Attendance has personal experience with Attendance has personal experience with incorrect billingincorrect billing, , lost airlinelost airline or or hotel reservationshotel reservations..
More serious errorsMore serious errors reported in the media, such as the disruption of reported in the media, such as the disruption of phone service to over phone service to over 20 million customers20 million customers during the summer of 1991 during the summer of 1991 due to due to coding errorcoding error in a new generation digital switch. in a new generation digital switch.
The most seriousThe most serious consequences are related to consequences are related to real-time applicationsreal-time applications, , such as those involving such as those involving spacecraftsspacecrafts: the launch failure of Mariner I : the launch failure of Mariner I (1962), the destruction of a French meteorological satellite in 1968, (1962), the destruction of a French meteorological satellite in 1968, several problems during the Apollo missions in the early of 1970s, the several problems during the Apollo missions in the early of 1970s, the NASA Space Shuttle, the NASA Space Shuttle, the fly-by-wire Airbus A320fly-by-wire Airbus A320, the Russian satellite , the Russian satellite ““MarsMars”, the satellite launcher ”, the satellite launcher ArianeAriane..
8. Fault Tolerance in Software
8.1 Introduction
Causes of SW failureCauses of SW failure::
Malfunction of a process. E.g. exception handling, timeout
computation, design error (solution: check the outputs and
timer);
Erroneous control sequence (solution: set an upper limit on
loop iterations);
Data entry error (solution: use of error-detecting code and
type checks in input data).
8. Fault Tolerance in Software
8.3 Dealing with Faulty Programs 8.3.1 Robustness
The minimum requirement is that the program will properly handle inputs
out of range, or in a different type of format than defined, without
degrading its performance of functions not dependent on the nonstandard
input.
When these input data are found not to comply with the program
specification:
a new input may be requested;
the last acceptable value of a variable can be used;
or a predefined default can e assigned.
8. Fault Tolerance in Software
8.3 Dealing with Faulty Programs 8.3.1 Robustness
In general, Robustness is used to test:
the function of a process (e.g., by checking the outputs);
the control sequence (e.g., by setting an upper limit on loop
iterations);
the input data (e.g., by using error-detecting code and type
checks).
8. Fault Tolerance in Software
8.3 Dealing with Faulty Programs 8.3.2 Temporal Redundancy
Temporal Redundancy consists of the reexecution of a program when an
error is encountered. The error may involve faulty data (as detected by
Robustness), faulty execution (e.g., accessing protected memory), or
incorrect output (as detected by Acceptance Tests).
Temporary reexecution will clear errors that arose from temporary
circumstances that are no longer present when a new pass through the
program is taken.
E.g., busy or noisy communication channels, full buffers,
power supply transients.
8. Fault Tolerance in Software
8.3 Dealing with Faulty Programs 8.3.2 Temporal Redundancy
When the error persists,
Fault Containment Procedures
must be triggered by the system.
8. Fault Tolerance in Software
8.3 Dealing with Faulty Programs 8.3.3 Software Diversity
SW Diversity permits uninterrupted system operation under the
presence of program faults through multiple implementations of a
given functional process and it is therefore particularly applicable
to real-time control systems.
It is divided into two categories:
Static SW Fault Tolerance: N-Version programming
Dynamic SW Fault Tolerance: Recovery Block
8. Fault Tolerance in Software
8.3 Dealing with Faulty Programs 8.3.3 Software Diversity
Static SW Fault ToleranceStatic SW Fault Tolerance: N-Version Programming: N-Version Programming
A given task is executed by several programs (consecutively on
the same machine) and the result accepted only if a specified # of
programs agree within specified limits. The same computer
performs comparison and selection of the results to be
propagated to the external system.
In practice, the programs are executed concurrently, and therefore
multiple computers are required to implement this technique.
8. Fault Tolerance in Software
8.3 Dealing with Faulty Programs 8.3.3 Software Diversity
Dynamic SW Fault Tolerance: Recovery Block
A single program is executed and the result (including
intermediate results) is subjected to an Acceptance Test.
8. Fault Tolerance in Software
8.3 Dealing with Faulty Programs 8.3.3 Software Diversity
The term STATIC is used because the selection of the acceptable
result does not affect the subsequent execution of the programs.
The term DYNAMIC is used because the selection between the
original and alternate program is made during execution based on
the outcome of the Acceptance Test.
8. Fault Tolerance in Software
8.4 Design of Fault Tolerant Software Using Diversity 8.4.1 N-Version Programming
Defined as the independent generation of N 2 functionally equivalent
programs, called versions, from the same initial specification. In this case,
fault masking is not provided and upon disagreement among the versions,
3 alternatives are available:
Retry or restart (in this case fault containment rather than FT is provided;
Transition to a predefined “safe state”, possibly followed by later retries;
Reliance on one of the versions, either designated in advance as more
reliable or selected by a diagnostic program (in the latter case the
technique takes on some aspects of dynamic redundancy).
8. Fault Tolerance in Software
8.4 Design of Fault Tolerant Software Using Diversity 8.4.1 N-Version Programming
For N > 2, a majority voting logic can be implemented (N = 3), it is
required:
I. Three independent programs, each furnishing identical output formats;
II. An acceptance program that evaluates the output of (i) and selects the
result to be furnished as N-version output;
III. A driver (process controller) that invokes requirements (i) and (ii) and
furnishes the N-version output to other programs or the physical
system.
8. Fault Tolerance in Software
8.4 Design of Fault Tolerant Software Using Diversity 8.4.1 N-Version Programming
Experiment carried out at UCLA (1978):
7 separate versions for the application program;
From this, 12 3-version sets were constructed;
Each set was subject to 32 test cases,yielding 384 total tests.
One of the conclusions:
Cases where a single faulty version resulted in incorrect execution,
the OS of the computer intervened before the program reached the
voting stage. Most later N-version experiments overcame this
problem by incorporating acceptance tests for abort conditions and
precluding the intervention of the OS under these conditions.
8. Fault Tolerance in Software
8.4 Design of Fault Tolerant Software Using Diversity 8.4.1 N-Version Programming
Results of an Early N-Version Programming Experiment.Results of an Early N-Version Programming Experiment.
8. Fault Tolerance in Software
8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block
Represents the Dynamic Redundancy Approach to SW fault tolerance.
Consists of 3 SW elements:
a primary routing, which executes critical SW functions;
an acceptance test, which tests the output of the primary routine
after every execution;
at least one alternate routine which performs the same function as
the primary routine (but may be less capable or slower) and is invoked
by the acceptance test upon detection of a failure.
8. Fault Tolerance in Software
8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block
The basic structure is:
Ensure T
By P
Else by Q
Else Error
Where:
T is the acceptance test condition that is expected to be met by successful execution of either the primary routine P or the alternate routine Q.
The structure is easily expanded to accommodate several alternates Q1, Q2, GQ3,...,Qn.
8. Fault Tolerance in Software
8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block
Difference between Recovery Block and N-Version
Programming are:
only a single implementation of the program is run at a
time (in this case: P or Q);
the acceptability of the results is decided by a test rather than by comparison with functionally equivalent alternate versions.
8. Fault Tolerance in Software
8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block
Real-time control applications require that results furnished
by a program be both correct and timely.
For this reason, the recovery block for a real-time program
should incorporate a watchdog timer which initiates
execution by Q (if P does not produce an acceptance result
within the allocated time).
8. Fault Tolerance in Software
8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block
Recovery block for real-time Recovery block for real-time
applicationapplication**..
(Program flow under direction (Program flow under direction of the of the applicationapplication module is module is shown in shown in solid linessolid lines; ; timer-timer-
triggeredtriggered interrupts are interrupts are shown in shown in dashed linesdashed lines.).)
A single program is executed at any given time:No special demands on computer redundancy or
computer architecture are made.
Performance penalty in normal operation is small:the execution of the acceptance test.
8. Fault Tolerance in Software
8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block
Highlights ...
Storage requirements are expanded:in addition to the primary application program,
the acceptance test and the backup program must also be available in memory.
SW development cost is increased:Need to generate two programs and the
associated acceptance test.
8. Fault Tolerance in Software
8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block
Highlights ...
The Acceptance Test is divided into 2 separate tests which are invoked
before and after the execution of the primary routine:
Before:The first acceptance test checks on the call format and
parameters.
The second acceptance test checks on the validity of the input data. (When data errors are common, provision of an alternate data source may be considered: dashed lines indicating the backup data)
After:The last acceptance test examines the output data.
8. Fault Tolerance in Software
8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block
Details about the Basic Recovery Block Structure ...
8. Fault Tolerance in Software
8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block
Internal Structure for primary Internal Structure for primary application module.application module.
8. Fault Tolerance in Software
8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block
The integration of application modules structured as recovery blocks
into a fault-tolerant SW system is shown in the next figure.
“Application Modules” and the decision diamond labeled “Return”
together represent the structure shown in figure ** .
In the absence of failures of the recovery blocks, the process will always remain in the inner loop.
If an abort is taken, the failure is recorded and some diagnostics may be performed. In case of a first failure in a recovery block, a retry may be initiated. If the failure persists, further execution of the task represented by the recovery block is suspended
8. Fault Tolerance in Software
8.4 Design of Fault Tolerant Software Using Diversity 8.4.2 Recovery Block
Executive and application modules.Executive and application modules.