Real Time Systems IX

7/27/2019 Real Time Systems IX

1/40

Fault-Tolerance in Real-Time

SystemsSidra Rashid

Bahria University, Islamabad Campus

Lecture IX


2/40

Fault Tolerance

What is Fault Tolerance? Ability of an operational system to tolerate the presence of faults

Why tolerate faults? It is proven that it is impossible to completely test a practical-sized

system. Therefore, it is important to implement techniques which allow a

system to detect and tolerate faults during normal operation.

4 phases of fault tolerance: Error detection detection of an erroneous state Damage assessment computes the severity of the fault Error processing substitute erroneous state for an error-free one Fault Treatment determine the cause of the error, then run fault

passivation to ensure it doesnt happen again

5/3/13


3/40

Fault Classification

Nature of faults distinguishes the intention of the fault

Accid-ental

Persistence of faults determine the duration of the fault state

Origin of faults are categorized into 3 types: Phenomenon is the fault from physical or human phenomenon Extent does the internal or external environment cause the fault Phase is the fault caused within the design or operation of the system

Faults

Nature Origin

Int.Human-made

Intent-ional

Operation Perm-anent

ExtentPhenomenon

Physical Temp-orary

Ext. Design

Phase

Persistence

5/3/13


4/40

Software Fault Tolerance Techniques

Key to fault-tolerance is redundancy Three domains:

Space Several hardware channels each executing same task

Information

Recover the system via data structures storing system contents Repetition

Restarts module in event of a faulty module Two major schemes have evolved

Recovery Block (RB) 1H/Nds/NT-System

There is only one hardware channel (1H), and the faults are tolerated by executing severaldiverse software modules (NdS) sequentially (NT)

N-Version Programming (NVP) NH/Nds/1T-System

The system has a number of (identical) hardware channels (NH) each executing one of thediverse software versions (NdS), hence no redundancy in time (1T)

5/3/13


5/40

Software Fault Tolerance Recovery Block

Checkpoint

S

wi t c h

Primary

Alternate 1

Alternate 2

Alternate N-1

.

.

.

AcceptanceTest

Restore from Checkpoint More alternates?Deadline not exceeded?

Passed

Failed

FaultTrue

5/3/13


6/40

Software Fault Tolerance Recovery Block

Considerations Software diversity

Idea: different teams, one specification, different products Hope that failure domains do not overlap

Difficulties in designing acceptance test Single test for all modules of recovery block Test is most crucial element in improving reliability

Design of Recovery Cache sufficiently simple to ensure no faults

Increased System Overhead Domino Effect

Recovery blocks can push concurrent tasks that communicate intouncontrolled rollback

5/3/13


7/40

Software Fault Tolerance N-Version Programming

N-Version Programming ( NH/Nds/1T ) Several Hardware channels Software diverse versions of code Results are voted upon Initial Specification is crucial

S wi t c h

Version 1

Version 2

Version N

.

.

.

.

.

Voter

Output

No agreement Failure

S yn

c h

MajorityAgreement

5/3/13


8/40

Considerations Software diversity! Difficult to create good specification Decision Mechanism

Some results will not always be identical (valid and invalid) define a range of valid solutions but decreases distance from acceptance test

approach System Overhead

temporal: Synchronization and decision algorithm space: multiple hardware channels and space for multiple software versions

Extensions Community Error Recovery ( forward recovery)

enough information from good versions to recover failed versions

5/3/13

Software Fault Tolerance N-Version Programming


9/40

Software Fault Tolerance Consensus Recovery Block(CRB)

NH/Nds/1T Synthesis of N-version Programming and recovery block Basic Assumption:

no similar errors will occur (erroneous results resembling each other) if two or more versions agree, the result is considered correct

S wi t c h

Version 1

Version 2

Version N

.

.

.

.

.

Voter

Output

No agreement

Failure

Agreement

Input

AT

Versions untried?

Time limit not expired?

5/3/13


10/40

Software Fault Tolerance Distributed Recovery Block

NH/NS/1T or Nhs/Nds/1T Reproducing RB Scheme on Multiple Network Nodes

Considerations Synchronization between nodes especially during rollback

Version A

Version B

AcceptanceTest

More alternates?Deadline not exceeded?

Accepted

False

Input

True

Version A

Version B

AcceptanceTest

More alternates?Deadline not exceeded?

Accepted

FailedTrue

Primary Node

Secondary Node

5/3/13


11/40

Extended Distributed Recovery Block

Heartbeat scheme Active Node Shadow Node Supervisor Node

Each node contains Primary version Alternate version Acceptance test Device Drivers

RecoveryManager

Supervisor

To the system To the system

PrimaryVersion

AlternateVersion

AcceptanceTest

DeviceDrivers

HeartbeatsNodeExec.

Active

Heartbeat/ResetRequest Consent

NodeExec.

Shadow

PrimaryVersion

AlternateVersion

AcceptanceTest

DeviceDrivers

5/3/13


12/40

5/3/13


13/40

Roll-Forward Checkpointing Scheme Used for multiprocessor systems Pool of Active Processing Modules

Processor Volatile storage Stable storage

Checkpoint processor The checkpoint processor detects module failures by comparing the state of

each pair of processing modules that perform the same task.

The two processors execute their tasks, checkpoint their states, and send thecheckpoints to the checkpoint processor.

The checkpoint processor compares the states, and if the states match thenew checkpoint is considered correct and it replaces the old checkpoint.

5/3/13


14/40

5/3/13


15/40

5/3/13


16/40

N Self-Checking Program

Made up of several Self Checking Components Made up of different variants

Variants are either associated with an acceptance test or pairedtogether and associated with a comparison algorithm

Components execute in parallel Fault tolerance is provided by parallel execution of components Each component is responsible for determining whether a delivered

result is acceptablethe system is divided into several self-checking components comprised of different variants (equivalent to alternates in RB and versions in NVP) of the software. These components execute in parallel. A self-checkingcomponent is made up in one of two ways: a) each variant is associatedwith an acceptance test which tests the results of the variant (Figure a),or b)

5/3/13


17/40

5/3/13


18/40

Data Diversity Retry Block

Executes test normally If the results are accepted by the test, execution is complete If the results are not accepted the test runs again once the input data

has been restated

N-copy Programming Upon entry to the block, data is restated to N-1 ways

This creates N different data sets The copies execute in parallel Output is selected with a voting scheme

5/3/13


19/40

5/3/13


20/40

5/3/13


21/40

Summary

Fault tolerant design considerations Anticipated faults

In most cases, a simple acceptance test is all that is needed Unanticipated faults

Designers must decide what is the most practical solution Most of the techniques in this report are hardware based, and

many designers will not be able to use them This leaves designers with

Recovery Blocks (Software Design Diversity)

Retry Blocks (Data Diversity)

5/3/13


22/40

5/3/13

Fault-Tolerance inReal-Time Databases


23/40

Overview

The causes of the downtime Availability solutions

CASE 1: Clustra CASE 2: TelORB CASE 3: RODAIN

5/3/13


24/40

The Causes of Downtime

Planned downtime Hardware expansion Database software upgrades Operating system upgrades

Unplanned downtime Hardware failure OS failure Database software bugs Power failure Disaster Human error

5/3/13


25/40

Traditional Availability Solutions Replication:

The standby system needs to duplicate transactions as they occur on the primarysystem. Ideally, this replication is done in near-real time, so the standby system isvery close to current in the event of a primary system failure.

FailoverFailover is the moment of truth. When a failure occurs on the primary system, allconnections must be re established on the standby, and all active transactionsmust be rolled back and restarted. Because everything must be transferred,typical failover times are measured in minutes at best, during which time thedatabase is unavailable.

Primary restartOnce the standby system takes over, there is no longer a standby. This is especiallyvulnerable period, and so the primary must be restarted as quickly as possible. Insome schemes the primary becomes the new standby, and in other schemesprocessing must, at some point, be switched back to the primary.

5/3/13


26/40

CASE 1: Clustra Developed for telephony applications such as

mobility management and intelligentnetworks.

Relational database with location andreplication transparency.

Real-Time data locked in main memory andAPI provides precompiled transactions.

NOT a Real-Time Database !

5/3/13


27/40

Clustra hardware architecture

5/3/13


28/40

Data distribution and replication

5/3/13


29/40

How Clustra Handles Failures Real-Time failover: Hot-standby data is up to date, so failover

occurs in milliseconds. Automatic restart and takeback: Restart of the failed node and

takeback of operations is automatic, and again transparent tousers and operators.

Self-repair: If a node fails completely, data is copied from thecomplementary node to standby. This is also automatic andtransparent.

Limited failure effects

5/3/13


30/40

How Clustra Handles Upgades

Hardware, operating system, and databasesoftware upgrades without ever going down.

Process called rolling upgrade I.e. required changes are performed node by node. Each node upgraded to catch up to the status of

complementary node. When this is completed, the operation is performed to

next node.

5/3/13


31/40

CASE 2: TelORB

CharacteristicsVery high availability (HA), robustness implemented in SW(soft) Real Time

Scalability by using loosely coupled processors

OpennessHardware: Intel/Pentium

Language: C++, JavaInteroperability: CORBA/IIOP, TCP/IP, Java RMI

3:rd party SW: Java

5/3/13


32/40

TelORB Availability

Real-time object-oriented DBMS supporting

Distributed Transactions

ACID properties expected from a DBMS

Data Replication (providing redundancy)

Network Redundancy

Software Configuration Control

Automatic restart of processes that originally executedon a faulty processor on the ones that are working

Self healingIn service upgrade of software with no disturbance to operation

Hot replacement of faulty processors

5/3/13


33/40

Automatic Reconfiguration

reloading

5/3/13


34/40

Software upgrade

Smooth software upgrade when old and newversion of same process can coexistPossibility for application to arrange for statetransfer between old and new static process(unless important states arent already storedin the database)

5/3/13


35/40

Partioning: Types and Data

21 221817

A B

2019 2019 A

B

1817

21 22

5/3/13


36/40

Advantages

Standard interfaces through Corba

Standard languages : C++, Java

Based on commercial hardware

(Soft) Real-time OSFault tolerance implemented in software

Fully scalable architecture

Includes powerful middleware: A database management system and

functions for software managementFully compatible simulated environment for development on Unix/Linux/NT workstations

5/3/13


37/40

CASE 3: RODAIN

Real-Time Object-Oriented DatabaseArchitechture for Intelligent Networks

Real-Time Main-Memory Database System Runs on Real-Time OS: Linux

5/3/13


38/40

Rodain Cluster

5/3/13


39/40

Rodain Database Node

Distributed DatabaseSubsystem

User RequestInterpreter Subsystem

Watchdog Subsystem

Fault-Tolerance andRecovery Subsystem

Object-OrientedDatabaseManagementSubsystem

Database Primary Unit


Watchdog Subsystem


Database Mirror Unit



shared disk

5/3/13


40/40



Watchdog Subsystem



Database Primary Unit


Watchdog Subsystem


Database Mirror Unit



shared disk

RODAIN Database Node II

Real Time Systems IX

Documents

Transcript of Real Time Systems IX