Real Time Systems IX

download Real Time Systems IX

of 40

Transcript of Real Time Systems IX

  • 7/27/2019 Real Time Systems IX

    1/40

    Fault-Tolerance in Real-Time

    SystemsSidra Rashid

    Bahria University, Islamabad Campus

    Lecture IX

  • 7/27/2019 Real Time Systems IX

    2/40

    Fault Tolerance

    What is Fault Tolerance? Ability of an operational system to tolerate the presence of faults

    Why tolerate faults? It is proven that it is impossible to completely test a practical-sized

    system. Therefore, it is important to implement techniques which allow a

    system to detect and tolerate faults during normal operation.

    4 phases of fault tolerance: Error detection detection of an erroneous state Damage assessment computes the severity of the fault Error processing substitute erroneous state for an error-free one Fault Treatment determine the cause of the error, then run fault

    passivation to ensure it doesnt happen again

    5/3/13

  • 7/27/2019 Real Time Systems IX

    3/40

    Fault Classification

    Nature of faults distinguishes the intention of the fault

    Accid-ental

    Persistence of faults determine the duration of the fault state

    Origin of faults are categorized into 3 types: Phenomenon is the fault from physical or human phenomenon Extent does the internal or external environment cause the fault Phase is the fault caused within the design or operation of the system

    Faults

    Nature Origin

    Int.Human-made

    Intent-ional

    Operation Perm-anent

    ExtentPhenomenon

    Physical Temp-orary

    Ext. Design

    Phase

    Persistence

    5/3/13

  • 7/27/2019 Real Time Systems IX

    4/40

    Software Fault Tolerance Techniques

    Key to fault-tolerance is redundancy Three domains:

    Space Several hardware channels each executing same task

    Information

    Recover the system via data structures storing system contents Repetition

    Restarts module in event of a faulty module Two major schemes have evolved

    Recovery Block (RB) 1H/Nds/NT-System

    There is only one hardware channel (1H), and the faults are tolerated by executing severaldiverse software modules (NdS) sequentially (NT)

    N-Version Programming (NVP) NH/Nds/1T-System

    The system has a number of (identical) hardware channels (NH) each executing one of thediverse software versions (NdS), hence no redundancy in time (1T)

    5/3/13

  • 7/27/2019 Real Time Systems IX

    5/40

    Software Fault Tolerance Recovery Block

    Checkpoint

    S

    wi t c h

    Primary

    Alternate 1

    Alternate 2

    Alternate N-1

    .

    .

    .

    AcceptanceTest

    Restore from Checkpoint More alternates?Deadline not exceeded?

    Passed

    Failed

    FaultTrue

    5/3/13

  • 7/27/2019 Real Time Systems IX

    6/40

    Software Fault Tolerance Recovery Block

    Considerations Software diversity

    Idea: different teams, one specification, different products Hope that failure domains do not overlap

    Difficulties in designing acceptance test Single test for all modules of recovery block Test is most crucial element in improving reliability

    Design of Recovery Cache sufficiently simple to ensure no faults

    Increased System Overhead Domino Effect

    Recovery blocks can push concurrent tasks that communicate intouncontrolled rollback

    5/3/13

  • 7/27/2019 Real Time Systems IX

    7/40

    Software Fault Tolerance N-Version Programming

    N-Version Programming ( NH/Nds/1T ) Several Hardware channels Software diverse versions of code Results are voted upon Initial Specification is crucial

    S wi t c h

    Version 1

    Version 2

    Version N

    .

    .

    .

    .

    .

    Voter

    Output

    No agreement Failure

    S yn

    c h

    MajorityAgreement

    5/3/13

  • 7/27/2019 Real Time Systems IX

    8/40

    Considerations Software diversity! Difficult to create good specification Decision Mechanism

    Some results will not always be identical (valid and invalid) define a range of valid solutions but decreases distance from acceptance test

    approach System Overhead

    temporal: Synchronization and decision algorithm space: multiple hardware channels and space for multiple software versions

    Extensions Community Error Recovery ( forward recovery)

    enough information from good versions to recover failed versions

    5/3/13

    Software Fault Tolerance N-Version Programming

  • 7/27/2019 Real Time Systems IX

    9/40

    Software Fault Tolerance Consensus Recovery Block(CRB)

    NH/Nds/1T Synthesis of N-version Programming and recovery block Basic Assumption:

    no similar errors will occur (erroneous results resembling each other) if two or more versions agree, the result is considered correct

    S wi t c h

    Version 1

    Version 2

    Version N

    .

    .

    .

    .

    .

    Voter

    Output

    No agreement

    Failure

    Agreement

    Input

    AT

    Versions untried?

    Time limit not expired?

    5/3/13

  • 7/27/2019 Real Time Systems IX

    10/40

    Software Fault Tolerance Distributed Recovery Block

    NH/NS/1T or Nhs/Nds/1T Reproducing RB Scheme on Multiple Network Nodes

    Considerations Synchronization between nodes especially during rollback

    Version A

    Version B

    AcceptanceTest

    More alternates?Deadline not exceeded?

    Accepted

    False

    Input

    True

    Version A

    Version B

    AcceptanceTest

    More alternates?Deadline not exceeded?

    Accepted

    FailedTrue

    Primary Node

    Secondary Node

    5/3/13

  • 7/27/2019 Real Time Systems IX

    11/40

    Extended Distributed Recovery Block

    Heartbeat scheme Active Node Shadow Node Supervisor Node

    Each node contains Primary version Alternate version Acceptance test Device Drivers

    RecoveryManager

    Supervisor

    To the system To the system

    PrimaryVersion

    AlternateVersion

    AcceptanceTest

    DeviceDrivers

    HeartbeatsNodeExec.

    Active

    Heartbeat/ResetRequest Consent

    NodeExec.

    Shadow

    PrimaryVersion

    AlternateVersion

    AcceptanceTest

    DeviceDrivers

    5/3/13

  • 7/27/2019 Real Time Systems IX

    12/40

    5/3/13

  • 7/27/2019 Real Time Systems IX

    13/40

    Roll-Forward Checkpointing Scheme Used for multiprocessor systems Pool of Active Processing Modules

    Processor Volatile storage Stable storage

    Checkpoint processor The checkpoint processor detects module failures by comparing the state of

    each pair of processing modules that perform the same task.

    The two processors execute their tasks, checkpoint their states, and send thecheckpoints to the checkpoint processor.

    The checkpoint processor compares the states, and if the states match thenew checkpoint is considered correct and it replaces the old checkpoint.

    5/3/13

  • 7/27/2019 Real Time Systems IX

    14/40

    5/3/13

  • 7/27/2019 Real Time Systems IX

    15/40

    5/3/13

  • 7/27/2019 Real Time Systems IX

    16/40

    N Self-Checking Program

    Made up of several Self Checking Components Made up of different variants

    Variants are either associated with an acceptance test or pairedtogether and associated with a comparison algorithm

    Components execute in parallel Fault tolerance is provided by parallel execution of components Each component is responsible for determining whether a delivered

    result is acceptablethe system is divided into several self-checking components comprised of different variants (equivalent to alternates in RB and versions in NVP) of the software. These components execute in parallel. A self-checkingcomponent is made up in one of two ways: a) each variant is associatedwith an acceptance test which tests the results of the variant (Figure a),or b)

    5/3/13

  • 7/27/2019 Real Time Systems IX

    17/40

    5/3/13

  • 7/27/2019 Real Time Systems IX

    18/40

    Data Diversity Retry Block

    Executes test normally If the results are accepted by the test, execution is complete If the results are not accepted the test runs again once the input data

    has been restated

    N-copy Programming Upon entry to the block, data is restated to N-1 ways

    This creates N different data sets The copies execute in parallel Output is selected with a voting scheme

    5/3/13

  • 7/27/2019 Real Time Systems IX

    19/40

    5/3/13

  • 7/27/2019 Real Time Systems IX

    20/40

    5/3/13

  • 7/27/2019 Real Time Systems IX

    21/40

    Summary

    Fault tolerant design considerations Anticipated faults

    In most cases, a simple acceptance test is all that is needed Unanticipated faults

    Designers must decide what is the most practical solution Most of the techniques in this report are hardware based, and

    many designers will not be able to use them This leaves designers with

    Recovery Blocks (Software Design Diversity)

    Retry Blocks (Data Diversity)

    5/3/13

  • 7/27/2019 Real Time Systems IX

    22/40

    5/3/13

    Fault-Tolerance inReal-Time Databases

  • 7/27/2019 Real Time Systems IX

    23/40

    Overview

    The causes of the downtime Availability solutions

    CASE 1: Clustra CASE 2: TelORB CASE 3: RODAIN

    5/3/13

  • 7/27/2019 Real Time Systems IX

    24/40

    The Causes of Downtime

    Planned downtime Hardware expansion Database software upgrades Operating system upgrades

    Unplanned downtime Hardware failure OS failure Database software bugs Power failure Disaster Human error

    5/3/13

  • 7/27/2019 Real Time Systems IX

    25/40

    Traditional Availability Solutions Replication:

    The standby system needs to duplicate transactions as they occur on the primarysystem. Ideally, this replication is done in near-real time, so the standby system isvery close to current in the event of a primary system failure.

    FailoverFailover is the moment of truth. When a failure occurs on the primary system, allconnections must be re established on the standby, and all active transactionsmust be rolled back and restarted. Because everything must be transferred,typical failover times are measured in minutes at best, during which time thedatabase is unavailable.

    Primary restartOnce the standby system takes over, there is no longer a standby. This is especiallyvulnerable period, and so the primary must be restarted as quickly as possible. Insome schemes the primary becomes the new standby, and in other schemesprocessing must, at some point, be switched back to the primary.

    5/3/13

  • 7/27/2019 Real Time Systems IX

    26/40

    CASE 1: Clustra Developed for telephony applications such as

    mobility management and intelligentnetworks.

    Relational database with location andreplication transparency.

    Real-Time data locked in main memory andAPI provides precompiled transactions.

    NOT a Real-Time Database !

    5/3/13

  • 7/27/2019 Real Time Systems IX

    27/40

    Clustra hardware architecture

    5/3/13

  • 7/27/2019 Real Time Systems IX

    28/40

    Data distribution and replication

    5/3/13

  • 7/27/2019 Real Time Systems IX

    29/40

    How Clustra Handles Failures Real-Time failover: Hot-standby data is up to date, so failover

    occurs in milliseconds. Automatic restart and takeback: Restart of the failed node and

    takeback of operations is automatic, and again transparent tousers and operators.

    Self-repair: If a node fails completely, data is copied from thecomplementary node to standby. This is also automatic andtransparent.

    Limited failure effects

    5/3/13

  • 7/27/2019 Real Time Systems IX

    30/40

    How Clustra Handles Upgades

    Hardware, operating system, and databasesoftware upgrades without ever going down.

    Process called rolling upgrade I.e. required changes are performed node by node. Each node upgraded to catch up to the status of

    complementary node. When this is completed, the operation is performed to

    next node.

    5/3/13

  • 7/27/2019 Real Time Systems IX

    31/40

    CASE 2: TelORB

    CharacteristicsVery high availability (HA), robustness implemented in SW(soft) Real Time

    Scalability by using loosely coupled processors

    OpennessHardware: Intel/Pentium

    Language: C++, JavaInteroperability: CORBA/IIOP, TCP/IP, Java RMI

    3:rd party SW: Java

    5/3/13

  • 7/27/2019 Real Time Systems IX

    32/40

    TelORB Availability

    Real-time object-oriented DBMS supporting

    Distributed Transactions

    ACID properties expected from a DBMS

    Data Replication (providing redundancy)

    Network Redundancy

    Software Configuration Control

    Automatic restart of processes that originally executedon a faulty processor on the ones that are working

    Self healingIn service upgrade of software with no disturbance to operation

    Hot replacement of faulty processors

    5/3/13

  • 7/27/2019 Real Time Systems IX

    33/40

    Automatic Reconfiguration

    reloading

    5/3/13

  • 7/27/2019 Real Time Systems IX

    34/40

    Software upgrade

    Smooth software upgrade when old and newversion of same process can coexistPossibility for application to arrange for statetransfer between old and new static process(unless important states arent already storedin the database)

    5/3/13

  • 7/27/2019 Real Time Systems IX

    35/40

    Partioning: Types and Data

    21 221817

    A B

    2019 2019 A

    B

    1817

    21 22

    5/3/13

  • 7/27/2019 Real Time Systems IX

    36/40

    Advantages

    Standard interfaces through Corba

    Standard languages : C++, Java

    Based on commercial hardware

    (Soft) Real-time OSFault tolerance implemented in software

    Fully scalable architecture

    Includes powerful middleware: A database management system and

    functions for software managementFully compatible simulated environment for development on Unix/Linux/NT workstations

    5/3/13

  • 7/27/2019 Real Time Systems IX

    37/40

    CASE 3: RODAIN

    Real-Time Object-Oriented DatabaseArchitechture for Intelligent Networks

    Real-Time Main-Memory Database System Runs on Real-Time OS: Linux

    5/3/13

  • 7/27/2019 Real Time Systems IX

    38/40

    Rodain Cluster

    5/3/13

  • 7/27/2019 Real Time Systems IX

    39/40

    Rodain Database Node

    Distributed DatabaseSubsystem

    User RequestInterpreter Subsystem

    Watchdog Subsystem

    Fault-Tolerance andRecovery Subsystem

    Object-OrientedDatabaseManagementSubsystem

    Database Primary Unit

    User RequestInterpreter Subsystem

    Watchdog Subsystem

    Object-OrientedDatabaseManagementSubsystem

    Database Mirror Unit

    Distributed DatabaseSubsystem

    Fault-Tolerance andRecovery Subsystem

    shared disk

    5/3/13

  • 7/27/2019 Real Time Systems IX

    40/40

    Distributed DatabaseSubsystem

    User RequestInterpreter Subsystem

    Watchdog Subsystem

    Fault-Tolerance andRecovery Subsystem

    Object-OrientedDatabaseManagementSubsystem

    Database Primary Unit

    User RequestInterpreter Subsystem

    Watchdog Subsystem

    Object-OrientedDatabaseManagementSubsystem

    Database Mirror Unit

    Distributed DatabaseSubsystem

    Fault-Tolerance andRecovery Subsystem

    shared disk

    RODAIN Database Node II