[email protected] Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

62
[email protected] Exploiting HW Exploiting HW + + SW SW Partitioning Partitioning for Reliable Embedded for Reliable Embedded Systems Systems Part 2 Part 2
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    0

Transcript of [email protected] Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

Page 1: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

Exploiting HWExploiting HW++SW PartitioningSW Partitioning

for Reliable Embedded Systemsfor Reliable Embedded Systems

Part 2Part 2

Page 2: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

SummarySummary

1. Introduction: targeting the problem

2. The Possible Solution

2.1. SW-Based Fault Detection Mechanisms

2.2. Migrating SW-Based Fault Detection Mechanisms into

HW

3. Experimental Evaluation

4. Final Considerations

Page 3: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

1. Introduction: targeting the problem

The increasing # of computer-based

critical applications rises questions about

the techniques for guaranteeing sufficient

degrees of reliability and to keep reasonable

costs for design and manufacturing.

?

Page 4: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

? Techniques commonly used (on-chip and

system level): stand-alone solutions

Fault-Tolerance Techniques

(HW, SW, Time or Info domains)

Duplication/Voter, TMRLayout-Driven Fault Avoidance

Watch-Dogs

Consistency ChecksCapability Checks

Re-computation

EDAC

1. Introduction: targeting the problem

Page 5: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

Duplication/Voter, TMRLayout-Driven Fault Avoidance

Watch-Dog Timer

?? Techniques commonly used (on-chip and

system level): stand-alone solutions

Fault-Tolerance Techniques

(HW, SW, Time or Info domains)

Consistency ChecksCapability Checks

Re-computation

EDAC Impacts design:

performance, weight, size/volume,

power consumption, reliability.

Impacts design:

performance, weight, size/volume,

power consumption, reliability.

1. Introduction: targeting the problem

Page 6: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

Duplication/Voter, TMRLayout-Driven Fault Avoidance

Watch-Dog Timer

? Techniques commonly used (on-chip and

system level): stand-alone solutions

Fault-Tolerance Techniques

(HW, SW, Time or Info domains)

Consistency ChecksCapability Checks

Re-computation

EDAC Impacts design:

performance, weight, size/volume,

power consumption, reliability.

Impacts design:

performance, weight, size/volume,

power consumption, reliability.

1. Introduction: targeting the problem

Page 7: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

HW Techniques:

Disadvantages:

High area overheadHigh development/fab cost

SW Techniques:Disadvantages:

Significant performance degradationMemory overhead

1. Introduction: targeting the problem

Page 8: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

Development of a hybrid

methodology (HW+SW redundancies)

able to perform runtime detection of

errors in μprocessor-based SoCs may

have very good cost X benefit

returns.

2. The Possible Solution2. The Possible Solution

Page 9: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

Returns: Minimization of area overhead and fab/development costs

(benefits of SW-based redundancy techniques)

Improvement of performance and minimization of memory

overhead (benefits of HW-based redundancy techniques)

In summary: Minimize fab cost and performance degradation, while

improving reliability

Target faults:Control flow errors

Data handling errors

2. The Possible Solution2. The Possible Solution

Page 10: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

Hybrid methodology (HW+SW

redundancies) explores:

• I-IP Core Architecture

• Software-Based Techniques

2. The Possible Solution2. The Possible Solution

Page 11: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

HW+SW SoC FT Architecture:

P IP

MemoryIP

CustomIP

I/O port

WDTI-IP

bus

SoCSoCMismatchMismatch

signalsignal

Computes run-time and stores control flow

signatures and data read from memory

Stores a hardened program

Information flow traveling

on the bus

Information flow traveling

on the bus

2. The Possible Solution2. The Possible Solution

Page 12: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

Faults Affecting Data:CerberusCerberus (Matteo et al.)

Faults Affecting Control:ECCAECCA (Matteo et al.)CFCSSCFCSS (McCluskey et al.)ECIECI (Miremadi et al.)

2. The Possible Solution2. The Possible Solution

SW-BasedSW-Based Fault Detection Mechanisms Fault Detection Mechanisms

Page 13: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

 Original CodeOriginal Code:: Modified CodeModified Code::

a = b; a0 = b0;a1 = b1;if(b0 != b1)

error(); a = b + c; a0 = b0 + c0;

a1 = b1 + c1;if (b0 != b1) || (c0 != c1)

error(); Code modification for errors affecting data.

Faults Affecting Data:Cerberus (Matteo et al.)

2. The Possible Solution2. The Possible Solution

SW-BasedSW-Based Fault Detection Mechanisms Fault Detection Mechanisms

Page 14: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

 Original CodeOriginal Code:: Modified CodeModified Code::

res = search(a); search(a0, a1, &res0, &res1);… …int search(int p) void search(int p0, int p1, int *r0, int *r1){ int q; { int q0, q1;… …q = p + 1; q0 = p0 + 1;… q1 = p1 + 1;return(1); if(p0 != p1)}error();

…*r0 = 1;*r1 = 1;return;

} Code transformation for errors affecting procedure parameters.

Faults Affecting Data:Cerberus (Matteo et al.)

SW-BasedSW-Based Fault Detection Mechanisms Fault Detection Mechanisms

Page 15: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

 Original CodeOriginal Code:: Modified CodeModified Code::

/* Basic Block beginning */ /* Basic Block beginning #371 */… ecf = 371;/* Basic Block end */ …

if (ecf != 371)error ();

/* Basic Block end */ 

Example of detection of errors affecting not allowed branches

 

Faults Affecting Control:ECCA - (Error Control-Flown Checking using Assertions) (Matteo et al.)

SW-BasedSW-Based Fault Detection Mechanisms Fault Detection Mechanisms

Page 16: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

 Original CodeOriginal Code:: Modified CodeModified Code::

If (condition) If (condition){ /* Block A */ { /* Block A */… if (!condition)} error();else …{ /* Block B */ }… else} { /* Block B */

if (condition)error();

…}

Code transformation for a test statement 

SW-BasedSW-Based Fault Detection Mechanisms Fault Detection Mechanisms

Faults Affecting Control:ECCA - (Error Control-Flown Checking using Assertions) (Matteo et al.)

Page 17: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

In summaryIn summary

To harden a given program this approach defines the

following assertions introduced into each basic block v j:

• Test Assertion: it controls the signature of basic block vj

checking if vi belongs to pred(vj).

• Set Assertion: updates the signature setting it to the value Bj

associated to vj.

Bj = (Bi M1) M2

SW-BasedSW-Based Fault Detection Mechanisms Fault Detection Mechanisms

Faults Affecting Control:ECCA - (Error Control-Flown Checking using Assertions) (Matteo et al.)

Page 18: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

01: while(k1<DIM)

02: {

03: if( != M1 && != M2 )04: //Error detected

05: A1 = matrixA1[i1][k1];

06: B1 = matrixB1[k1][j1];

07: C1 += A1*B1;

08: matrixC1[i1][j1] = C1;

09: k1++;

10: j =(i ^M1)^M2;

11: }

SW-BasedSW-Based Fault Detection Mechanisms Fault Detection Mechanisms

Faults Affecting Control:ECCA - (Error Control-Flown Checking using Assertions) (Matteo et al.)

Page 19: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

Principle: Modification of a Basic Block

Faults Affecting Control:CFCSS (McCluskey et al.)

2. The Possible Solution2. The Possible Solution

SW-BasedSW-Based Fault Detection Mechanisms Fault Detection Mechanisms

Page 20: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

Faults Affecting Control:CFCSS (McCluskey et al.)

Basically, the approach consists of six steps:Basically, the approach consists of six steps:

 1)      DivideDivide the program into basicbasic blocksblocks. A basic block is a minimal set of ordered instructions in which its execution begins from the first instruction and terminates at the last instruction. There is no branching instruction in a basic block except possibly for the last one. A basic block terminates at either an instruction branching to another basic block or an instruction receiving transfer of control flow (CF) from two or more places in the program. Notations: (a) V = {vi: i = 1, 2,…, n}: set of vertices denoting basic blocks. (b) E: set of edges

denoting possible CF between basic blocks.

2)     ConstructConstruct a graphgraph for the program according to the instructions flow (each node represents a basic block). Note that a program can be represented by a program-graph, P, where bri,j are not necessarily explicit branch instructions; they also

represent fall-through execution paths, jumps, subroutine calls, and returns. Fig. 2.5 is an example. Notation: P: Program Graph {V, E}.

3)      ArbitrarilyArbitrarily assignassign a signaturesignature for eacheach nodenode (compilation time).

4)      ComputeCompute the signaturesignature differencedifference between the source and the destiny blocks.

5)      ComputeCompute the newnew signaturesignature for each nodeeach node (execution time).

6)      CompareCompare both signaturessignatures.

SW-BasedSW-Based Fault Detection Mechanisms Fault Detection Mechanisms

Page 21: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

Faults Affecting Control:CFCSS (McCluskey et al.)

Sequence of instructionsand its graph. Detection of illegal branch.

General Form f = f(G, di) = G XOR di

G2 = f(G1, d2) = G1 XOR d2 = s1 XOR (s1 XOR s2) = s2

G4 = f(G1, d4) = G1 XOR d4 = G1 XOR (s3 XOR s4) = s1 XOR s3 XOR s4 ≠ s4

SW-BasedSW-Based Fault Detection Mechanisms Fault Detection Mechanisms

Page 22: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

Faults Affecting Control:CFCSS (McCluskey et al.)

Detection of an illegal branch: a numerical example

SW-BasedSW-Based Fault Detection Mechanisms Fault Detection Mechanisms

Page 23: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

Faults Affecting Control:CFCSS (McCluskey et al.)

Node v1 and node v3 have the same signatures: Branch Fan-in Nodes

SW-BasedSW-Based Fault Detection Mechanisms Fault Detection Mechanisms

Page 24: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

Faults Affecting Control:CFCSS (McCluskey et al.)

Node v1 and node v3 have different signatures: Adjusting Signature D

SW-BasedSW-Based Fault Detection Mechanisms Fault Detection Mechanisms

Page 25: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

Faults Affecting Control:CFCSS (McCluskey et al.)

Node v1 and node v3 have different signatures: Adjusting Signature D

SW-BasedSW-Based Fault Detection Mechanisms Fault Detection Mechanisms

G5 = f(G1, d5, D1) = G1 XOR d5 XOR D1 = s1 XOR (s1 XOR s5) EXOR “000” = s5

G5 = f(G3, d5, D3) = G3 XOR d5 XOR D3 = s3 XOR (s1 XOR s5) EXOR “s1 EXOR s3” = s5

Page 26: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

Faults Affecting Control:ECI (Miremadi et al.)

Insertion of trap instructions in the program area, in the data area, and in the unused area of the memory.

The ECIs are inserted in the main memory locations that are not used by the CPU during normal execution. Thus, the execution of an ECI is a indication that a control flow error has occurred.

The task of an ECI is to initiate a recovery process.

SW-BasedSW-Based Fault Detection Mechanisms Fault Detection Mechanisms

Page 27: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

WDT / I-IPWDT / I-IP works in symbiosis with the

processor which is not modified.

WDT / I-IPWDT / I-IP continuously spies the information

execution flow on the bus (which is computed

to test and update signatures).

If a mismatch is detected, WDT / I-IPWDT / I-IP outputs a

mismatch signal.

2. The Possible Solution2. The Possible Solution

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

Page 28: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

01: while(k1<DIM)02: {03: IIPtest( BB1 );04: IIPtest( BB2 );05: A1 = matrixA1[i1][k1];06: B1 = matrixB1[k1][j1];07: C1 += A1*B1;08: matrixC1[i1][j1] = C1;09: k1++;10: IIPset( BB2);11: }

2. The Possible Solution2. The Possible Solution

Peace of code for control-flow faults detection (ECCA Partitioning):

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

03: if( != M1 && != M2 )04: //Error detected

10: j =(i ^M1)^M2;

Page 29: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

WDT / I-IP Architecture:WDT / I-IP Architecture:

• Three modules: - bus interface logic

- consistency check logic

- CAM memory

Bus InterfaceLogic

Consistency CheckLogic

bus

MismatchMismatchSignalSignalWDT / I-IP

adx, data Compares flow signatures

Detects signatures

passing on the bus

2. The Possible Solution2. The Possible Solution

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

CAM Memory

Stores flow signatures

Page 30: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

ClkReset

Instruction_inRam_data_in

Ram_address_in

WDT / I-IP

Modulo 1Bus Interface Logic

ClkResetInstrucion_inRam_data_inRam_address_in

Data_memory_in

Data_memory_outAdr_memory_out

Ctrl_rw_out

En_compare_outData_1_outData_2_out

Modulo 2CAM Memory

ClkReset

Data_memory_out

Data_memory_inAdr_memory_inCtrl_rw_in

Modulo 3Consistency Check

LogicClkresetEn_compare_outData_1_outData_2_out

Mismatch Signal

2. The Possible Solution2. The Possible Solution

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

WDT / I-IP Architecture:WDT / I-IP Architecture:

Page 31: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

Consider now that the µprocessor-based SoC runs under an Operating System …

2. The Possible Solution2. The Possible Solution

The application code is only a fragment of the total time allocated during system operation!

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

?

Page 32: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

• Critical applications need operating systems (OS) which guarantee a correct and safe behavior despite the occurrence of errors.

• Faults can affect OS calls as well as the OS kernel: How does the system react in front of invalid or corrupted values operated by the kernel?

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

Page 33: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

µProcessorµProcessor WDT / I-IPWDT / I-IP

ApplicationApplication

Address + Data BusAddress + Data Bus

Status RegisterStatus Register

SoC

Memory (Operating System)Memory (Operating System)DriverDriver

HW-SW Partitioning for Fault-Detection in Complex Systems

2. The Possible Solution2. The Possible Solution

Memory (Application Code + Data)

Memory (Application Code + Data)

Error Indication

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

Page 34: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

µProcessorµProcessor WDT / I-IPWDT / I-IP

ApplicationApplication

Address + Data BusAddress + Data Bus

Status RegisterStatus Register

SoC

Memory (Operating System)Memory (Operating System)DriverDriver

HW-SW Partitioning for Fault-Detection in Complex Systems

DragonBall, ARM, DragonBall, ARM, Pentium, 8086, 68KPentium, 8086, 68K

ProgrammableProgrammableLogicLogic

SW Part

HW Part

SW Part

2. The Possible Solution2. The Possible Solution

Memory (Application Code + Data)

Memory (Application Code + Data)

Error Indication

µCLinux, µµCLinux, µCOS-IICOS-II

SW Part

Com ChannelCom Channel

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

Page 35: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

MC68VZ328 Block Diagram

CGM&Power Control

Real-TimeClock

In-CircuitEmulation

InterruptController

MemoryController

BootstrapMode

8/16-Bit 68000 Bus Interface

FLX6800StaticCPU

16-BitTimers(2)

8-BitPWM1

16-BitPWM2

SPI 1

UART 2IrDA1.0

UART 1IrDA1.0

SPI 2

LCDController

GP

IO P

ort

s

GP

IO P

ort

s

6800

0 In

tern

al B

us

Special FunctionPins (CPU Space)

Status InformationStatus Information

Page 36: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

Status InformationStatus Information

Page 37: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

Special Function Pins (CPU Space): FC2, FC1, FC0

Function Code Output Processor Cycle Type

FC2 FC1 FC0

0 0 0 Undefined, reserved

0 0 1 User Data

0 1 0 User Program

0 1 1 Undefined, reserved

1 0 0 Undefined, reserved

1 0 1 Supervisor Data

1 1 0 Supervisor Program

1 1 1 CPU space (interrupt acknowledge)

Status InformationStatus Information

68000 Die68000 Die

Page 38: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

68010 – 68030 Dies68010 – 68030 Dies

A16 - A19 Pins

Status InformationStatus Information

FC2 = FC1 = FC0 = 1 indicate CPU operations other FC2 = FC1 = FC0 = 1 indicate CPU operations other than interrupt acknowledge cycles (e.g. than interrupt acknowledge cycles (e.g. co-processor communications). co-processor communications).

Then, different CPU spaces are indicated Then, different CPU spaces are indicated in in A16 - A19A16 - A19 pins, if properly decoded. pins, if properly decoded.

Page 39: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

Interrupt Control Pins: IPL2, IPL1, IPL0

Interrupt Processor Level Processor Cycle Type

IPL2 IPL1 IPL0

0 0 0 Lowest priority

0 0 1 |

|

|

|

|

|

|

|

|

0 1 0

0 1 1

1 0 0

1 0 1

1 1 0

1 1 1 Highest priority

Status InformationStatus Information

68000 Die68000 Die

Page 40: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

Event-Ticking Pins – ETPs: PM0, PM1

Status InformationStatus Information

Event-Ticking Pins – ETP associated with Model Specific Registers – MSR to monitor:

# cache memory misses, # committed instructions, # interruptions executed, # taken branches, ...

Model Specific Registers – MSRs: Counters CRT0 and CRT1 programmed through the Control and Events Selector Register - CESR

Pentium DiePentium Die

Page 41: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

Status InformationStatus Information

Instructions used to program counters CRT0 and CRT1 through the Control and Events Selector Register – CESR:

WRMSRRDMSR

The RDMSR instruction may be executed in all CPLs (Current Privileged Level), but the WRMSR instruction may only be executed in CPL0.

Page 42: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

Event-Ticking Pins – ETPs: d_i, s_u

Status InformationStatus Information

DragonBall CoreDragonBall Core

If “0”: data;If “1”: instruction;If “z”: undefined.

If “0”: supervisor mode; If “1”: user mode; If “z”: undefined.

These pins were added to the processor core to serve as interface with the I-IP (watch-dog).

Page 43: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

Event-Ticking Pins – ETPs: d_i, s_u

Status InformationStatus Information

Page 44: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

• OS error detection coverage has been measured and observations about OS critical data structures to be improved have been commented, in order to improve the final robustness of the µµCOS-IICOS-II operating system.

Juan Pardo, 2004Fault Tolerant Systems Group

Polytechnic University of Valencia Spain

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

Page 45: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

µC/OS-II Operating SystemµC/OS-II Operating System

• Selection came motivated from the perspective that it is a system widely used in particular for embedded applications since several years ago.

First Version µC/OS 1992

• Industrial robots, motor control, medical instruments, etc.

• It is 99% compliant with the Motor Industry Software Reliability Association (MISRA) C Coding Standards.

• All Modified Condition Decision Coverage (MCDC) code in µC/OS-II has been removed, improving code quality for RTCA / EUROCAE DO-178B Level A-certified environments for avionics applications.

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

Page 46: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

µC/OS-II: µC/OS-II: CharacteristicsCharacteristics

• Portable: uC/OS-II is written in highly portable ANSI C, with target microprocessor-specific code written in assembly language.

• ROMable: was designed for embedded applications. This means that if you have the proper tool chain (i.e., C compiler, assembler, and linker/locator), you can embed uC/OS-II as part of a product.

• Scalable: it’s possible to use only the services needed in the application. This allows to reduce the amount of memory (both RAM and ROM) needed. Scalability is accomplished with the use of conditional compilation (full version: 8KB).

• Preemptive: uC/OS-II is a fully preemptive real-time kernel. This means that uC/OS-II always runs the highest priority task that is ready.

• Multitasking: uC/OS-II can manage up to 64 tasks (Current version of the software reserves 8 of these tasks for system use. This leaves for application up to 56 tasks. Each task has a unique priority assigned to it, which means that uC/OS-II cannot do round-robin scheduling.)

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

Page 47: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

µC/OS-II: µC/OS-II: CharacteristicsCharacteristics

• Deterministic: Execution time of all uC/OS-II functions and services are deterministic. You can always know how much time uC/OS-II will take to execute a function or a service. Further more execution time of all uC/OS-II services do not depend on the number of tasks running in your application.

• Task Stacks: Each task requires its own stack (uC/OS-II allows each task to have a different stack size. This allows to reduce the amount of RAM needed for application).

• Services: system services such as mailboxes, queues, semaphores, fixed-sized memory partitions, time-related functions, etc.

• Interrupt Management: Interrupts can suspend the execution of a task. If a higher priority task is awakened as a result of the interrupt, the highest priority task will run as soon as all nested interrupts complete. Interrupts can be nested up to 255 levels deep.

• Robust and Reliable: uC/OS-II is based on uC/OS, which has been used in hundreds of commercial applications since 1992.

2. The Possible Solution2. The Possible Solution

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

Page 48: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

Workload DesignWorkload Design

CharacteristicsCharacteristics::

Worst case application: maximum maximum system calls consumesystem calls consume.

System calls: SynchronizationSynchronization, SemaphoresSemaphores, MemoryMemory, QueuesQueues, MessagesMessages, TasksTasks HandlingHandling, TimingTiming ManagementManagement, etc.

2. The Possible Solution2. The Possible Solution

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

Page 49: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

The system workload is The system workload is

continuously runningcontinuously running and consists and consists

of a series of tasks executing the of a series of tasks executing the

application. application.

Consistency checksConsistency checks are added are added

to the to the application codeapplication code and and kernelkernel

to detect faults and invalid values to detect faults and invalid values

at the at the kernel callskernel calls in order to in order to

improve system robustness.improve system robustness.

The WDT / I-IP is the monitormonitor.

Workload DesignWorkload Design

2. The Possible Solution2. The Possible Solution

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW

Addition of Consistency Checks

Page 50: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

void RandomNumberTask(void *pdata) void RandomNumberTask(void *pdata)

{ { // Declare as auto to ensure reentrancy. // Declare as auto to ensure reentrancy. auto OS_TCB data; auto OS_TCB data; auto INT8U err; auto INT8U err; auto INT16U RNum;auto INT16U RNum;OSTaskQuery(OS_PRIO_SELF, &data); OSTaskQuery(OS_PRIO_SELF, &data); while(1) while(1) { { // Rand is not reentrant, so access must be controlled // Rand is not reentrant, so access must be controlled // via a semaphore. // via a semaphore. OSSemPend(RandomSem, 0, &err);OSSemPend(RandomSem, 0, &err); RNum = (int)(rand() * 100); RNum = (int)(rand() * 100); OSSemPost(RandomSem);OSSemPost(RandomSem);printf("Task%02d's random #: %d\n",data.OSTCBPrio,RNum);printf("Task%02d's random #: %d\n",data.OSTCBPrio,RNum);// Wait 3 seconds in order to view output from each task. // Wait 3 seconds in order to view output from each task. OSTimeDlySec(3); OSTimeDlySec(3); } } }}

2. The Possible Solution2. The Possible Solution

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHW// 1. Define necessary configuration constants for uC/OS-II // 1. Define necessary configuration constants for uC/OS-II #define OS_MAX_EVENTS 2 #define OS_MAX_EVENTS 2 #define OS_MAX_TASKS 20 #define OS_MAX_TASKS 20 #define OS_MAX_QS 0 #define OS_MAX_QS 0 #define OS_Q_EN 0 #define OS_Q_EN 0 #define OS_MBOX_EN 0 #define OS_MBOX_EN 0 #define OS_TICKS_PER_SEC 32#define OS_TICKS_PER_SEC 32

// 2. Define necessary stack configuration constants // 2. Define necessary stack configuration constants #define STACK_CNT_512 1 // initial program stack #define STACK_CNT_512 1 // initial program stack #define STACK_CNT_1K OS_MAX_TASKS // task stacks#define STACK_CNT_1K OS_MAX_TASKS // task stacks// 3. This ensures that the above definitions are used // 3. This ensures that the above definitions are used #use "ucos2.lib“#use "ucos2.lib“

void RandomNumberTask(void *pdata);void RandomNumberTask(void *pdata);// Declare semaphore global so all tasks have access // Declare semaphore global so all tasks have access

OS_EVENT* RandomSem;OS_EVENT* RandomSem;void main(){ void main(){ int i;int i;// Initialize OS internals // Initialize OS internals OSInit();OSInit();for(i = 0; i < OS_MAX_TASKS; i++){for(i = 0; i < OS_MAX_TASKS; i++){// Create each of the system tasks // Create each of the system tasks OSTaskCreate(RandomNumberTask, NULL, 1024, i);OSTaskCreate(RandomNumberTask, NULL, 1024, i);} } // semaphore to control access to random number generator // semaphore to control access to random number generator RandomSem = OSSemCreate(1);RandomSem = OSSemCreate(1);// 4. Set number of system ticks per second // 4. Set number of system ticks per second OSSetTicksPerSec(OS_TICKS_PER_SEC);OSSetTicksPerSec(OS_TICKS_PER_SEC);// Begin multi-tasking // Begin multi-tasking OSStart(); OSStart(); }}

OS Call(task waits for signal)

OS Call

(task sends a signal)

Initializing Tasks

Starting Tasks

Workload DesignWorkload Design

Page 51: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHWWorkload DesignWorkload Design

OS_ENTER_CRITICAL

/*Code implemented for GNU-GAS*/ asm (" move.l #0x0100, -(%a0) | Write in “a0” the hexadecimal “0x0100” move.b #11, %a0 | Move the byte “11” to the address “a0” ");

asm (" move.l #0x0100, -(%a0) | Write in “a0” the hexadecimal “0x0100” move.b #00, %a0 | Move the byte “00” to the address “a0” ");

OS_EXIT_CRITICAL

Set an indication for the instant when the processor gets into the supervisor mode “OS_ENTER_CRITICAL”and when when it leaves this mode: “OS_EXIT_CRITICAL”. The signaling is done by writing to a specific memory address.

Page 52: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

2. The Possible Solution2. The Possible Solution

Migrating Migrating SW-BasedSW-Based Fault Detection Mechanism into Fault Detection Mechanism into HWHWWorkload DesignWorkload Design

/**************************************************************                     PEND ON SEMAPHORE*************************************************************/UBYTE OSSemPend(OS_SEM *psem, UWORD timeout){    UBYTE x, y, bitx, bity;

OS_ENTER_CRITICAL();

/*Code implemented for GNU-GAS*//*Code implemented for GNU-GAS*/     asm ("asm ("                                move.l  #0x0100, -(%a0)  | Write in “a0” the hexadecimal “0x0100”move.l  #0x0100, -(%a0)  | Write in “a0” the hexadecimal “0x0100”                                move.b  #4, %a0           | Move the byte “4” to the address “a0”move.b  #4, %a0           | Move the byte “4” to the address “a0”    ");");/*End*//*End*/    if (psem->OSSemCnt-- > 0) {

        OS_EXIT_CRITICAL();        return (OS_NO_ERR);} else {        OSTCBCur->OSTCBStat |= OS_STAT_SEM;        OSTCBCur->OSTCBDly   = timeout;        y                    = OSTCBCur->OSTCBPrio >> 3;        x                    = OSTCBCur->OSTCBPrio & 0x07;        bity                 = OSMapTbl[y];        bitx                 = OSMapTbl[x];     

Systems Calls performed by Pend and Post through Semaphore, Mailbox and QUEUE

if ((OSRdyTbl[y] &= ~bitx) == 0)            OSRdyGrp &= ~bity;        psem->OSSemTbl[y] |= bitx;        psem->OSSemGrp    |= bity;

        OS_EXIT_CRITICAL();        OSSched();

        OS_ENTER_CRITICAL();

        if (OSTCBCur->OSTCBStat & OS_STAT_SEM) {            if ((psem->OSSemTbl[y] &= ~bitx) == 0) {                psem->OSSemGrp &= ~bity;            }            OSTCBCur->OSTCBStat = OS_STAT_RDY;

            OS_EXIT_CRITICAL();            return (OS_TIMEOUT);        } else {

            OS_EXIT_CRITICAL();            return (OS_NO_ERR);        }    }}

Consistency Check

Consistency Check

Co

ns

iste

nc

y C

he

ck

Page 53: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

Matteo Sonza Reorda, 2002-05Fault Tolerant Systems Group

Politecnico di Torino

3. Experimental Evaluation3. Experimental Evaluation

• An Intel 8051-based SoC was inspected.

• PANDORA I-IP: VHDL (~1500 lines).

Page 54: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

3. Experimental Evaluation3. Experimental Evaluation

• Fault detection capabilities evaluated via HW-based

fault injection experiments (FPGA environment).

• Four benchmarks considered:

– Matrix multiplication, Elliptical Filter,

FIR Filter and Viterbi Algorithm.

Page 55: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

3. Experimental Evaluation3. Experimental Evaluation

Detection capabilities:• Transient faults (30,000 bit-flips)

• Number of wrong answers evaluated (escape detection).

Matrix 9.78 0.18 0.99 4.88

Ellipf 20.83 0 2.38 14.29

FIR 5.64 0 2.12 4.49

Viterbi 21.06 4.89 6.33 17.48

CFCSS [%]

Program Plain [%]Pandora

[%]ECCA [%]

Orig. SW IP (HW+SW) SW Sol. SW Sol.

Page 56: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

3. Experimental Evaluation3. Experimental Evaluation

Memory overhead:

• Additional code lines required to implement the

hybrid technique.

Matrix 223 385 902 456

Ellipf 303 361 640 347

FIR 194 364 701 320

Viterbi 436 707 1,115 725

ECCA [byte]

CFCSS [byte]

Prog.Plain [byte]

Pandora [byte]

Orig. SW IP (HW+SW) SW Sol. SW Sol.

Page 57: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

3. Experimental Evaluation3. Experimental Evaluation

Execution time overhead:

Matrix 31,211 41,462 102,356 43,791

Ellipf 16,268 17,815 25,635 17,611

FIR 43,434 71,994 153,458 57,357

Viterbi 286,364 328,150 349,111 314,244

Prog.Plain

[cycle]Pandora [cycle]

ECCA [cycle]

CFCSS [cycle]

Orig. SW IP (HW+SW) SW Sol. SW Sol.

Page 58: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

3. Experimental Evaluation3. Experimental Evaluation

Area overhead:

PANDORA size 992 gates

8051 size 30480 gates

PANDORA introduces about

3.2% of area overhead

Area overhead is expected to decrease when processor size increases.

Page 59: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

4. Final Considerations4. Final Considerations

Development of a hybrid

methodology (HW+SW redundancies)

able to perform runtime detection of

errors in μprocessor-based SoCs may

have very good cost X benefit

returns.

Page 60: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

Returns: Minimization of area overhead and fab/development costs

(benefits of SW-based redundancy techniques)

Improvement of performance and minimization of memory

overhead (benefits of HW-based redundancy techniques)

In summary: Minimize fab cost and performance degradation, while

improving reliability

Target faults:Control flow errors

Data handling errors

4. Final Considerations4. Final Considerations

Page 61: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

A hybrid methodology (HW+SW

redundancies) explores:

• I-IP Core Architecture

• Software-Based Techniques

4. Final Considerations4. Final Considerations

Page 62: Vargas@computer.org Exploiting HW+SW Partitioning for Reliable Embedded Systems Part 2.

[email protected]

4. Final Considerations4. Final Considerations

System architecture co-implemented in HW+SW to detect faults in

control-flow and application data. The main characteristics of this

architecture:

SW-embedded structures at the application code level.

Partial migration of the SW-embedded structures into HW:

specific I-IIP monitors application processor such as a “watch-dog”.

Communication channel between the HW+SW entities: driver

embedded in the OS Kernel and specific signals used to

communicate the I-IP with the application processor.