DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

20
DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang

Transcript of DESIGN AND EVALUATION OF HYBRID FAULT-DETECTION SYSTEMS Qing Xu Kevin Wang.

DESIGN AND EVALUATION OF HYBRID

FAULT-DETECTION SYSTEMS

Qing Xu Kevin Wang

OUTLINE

Background Motivation Key Ideas Introduction to CRAFT Summary and Discussion Points

0 1

BACKGROUND Smaller and Faster Transistors

Lower threshold voltage Tighter noise margins Less reliable

Results Incorrect program execution

Recovery

Alpha Particle Transie

nt Faults

Software OnlyHardware Only

REDUNDENCY

Int main(){ cout << “Hello\n”;}

Int main(){ cout << “Hello\n”;}

MOTIVATION AND GOAL

Software Only

Inadequate coverage

Slow

Hardware Only Large Overhead/Area High cost

Hybrid Solution

Better Reliability and PerformanceLower Hardware

Area and Cost

KEY IDEA: COMPILER ASSISTED FAULT TOLERANCE (CRAFT) Characteristics:

- Based on software technique

- Minimal hardware adaptations

- Take advantages from Software and Hardware solution

Benefits:

- Nearly perfect reliability

- Low performance degradation

- Low hardware cost

Software

Hardware

CRAFT: HYBRID OF EXISTING METHODS

Hardware Method Software Method Redundant

Multithreading Technique (RMT)

Error Correcting Codes (ECC)

Software Implemented Fault Tolerance (SWIFT)

Error Detection by Duplicating Instructions (EDDI)

Advantages Almost-perfect fault coverage Low performance cost

Advantages High fault coverage Modest performance cost Zero hardware cost

EXISTING METHOD: HARDWARERMT

RMT makes use of SMT resource through loosely synchronized redundant threads

Components not covered by redundant execution must employ alternative techniques, such as Error Correction Code (ECC)

Original Thread

Checker Thread

Redundant Multi-threading (RMT)

EXISTING METHOD: SOFTWARESWIFT A compiler based

transformation Store instruction is the

synchronization point Assumes that Error

Correction Code (ECC) guards correctness of memory subsystem

ld r3 = [r4]

add r1 = r2, r3

st m[r1] = r2

(Original Code)

ld r3 = [r4]mov r3’ = r3

add r1 = r2, r3add r1’ = r2’, r3’

br Fault, r1 != r1’br Fault, r2 != r2’br Fault, r3 != r3’

st m[r1] = r2

(SWIFT Code)

CRAFT: SUITE OF THREE DETECTION SYSTEM

Preliminaries List of the Suite:

1. Checking Store Buffer (CSB)

2. Load Value Queue (LVQ)

3. CSB + LVQ

Assume Single Event Upset fault model

Architecturally Correct Execution (ACE)

Detected Unrecoverable Error (DUE)

Silent Data Corruption (SDC)

SUITE 1: CHECKING STORE BUFFER (CSB)

Solution:• Add a Store Buffer to perform

checks

Problem to Improve:• SWIFT: Vulnerable to faults in the

time interval between the validation and use of a register value

Use of validated valuesValidated values

Vulnerable to Faults

CSB # 0 1 2 3

Address -- -- 0xFF 0xEE

Value -- -- 0x8 0x1

Validated -- -- N N

0xFF

0x8

0xEE

0x2

Compiler duplicates storesst [r1] = r2 st1 [r1] = r2

st2 [r1’] = r2’

Not match, not OK to go to MEM

CSB : IMPLEMENTATIONBasic Idea: Commit a store when two copies of store data match Method : Create CSB to keep track of all original and duplicated instructions

Table will fill up and structural hazard

Insn duplicate #1

Insn duplicate #2

Y N

Store Value Checks Out! Send to MEM.

CSB : ADVANTAGES/ DISADVANTAGES Checking implemented in hardware level

No longer need validation code; reduces code size

Store instructions are no longer synchronization points (SWIFT)

Exploit more dynamic scheduling

Advantages

Disadvantages Additional compiler requirements: distance

between duplicated instruction should not exceed size of CSB

SUITE 2: LOAD VALUE QUEUE (LVQ)

Problem to Improve:• SWIFT: Window of vulnerability

between load instruction and value duplication.

Solution:• Add a load value queue

Vulnerable to Faults

Copying valuesLoading values

LVQ : IMPLEMENTATION PROCEDURE

Threadmill: Branch to TEST1

Basic Idea: Duplicate load to enable redundant computation Method : LVQ provides redundant load instruction execution

LVQ # 0 1 2 3

Address -- -- -- --

Value -- -- -- --

0xAA 0xAACompiler duplicates loadsld [r1] = r2 ld1 [r1] = r2

ld2 [r1’] = r2’

ld insn ld insn duplicate

0xAA

0x2

0x2 0x2

LVQ : ADVANTAGES/ DISADVANTAGESAdvantages

Disadvantages Extra hardware to enforce loads and their duplicates

access same entry in LVQ

Reduces window of vulnerability by issuing duplicated load instruction Keep memory traffic low by bypassing load value

SUITE 3: CSB + LVQ Implements both CSB and LVQ simultaneously to software-only solutions like SWIFT

EXPERIMENTAL EVALUATION Evaluation Method – Performance vs. Reliability:

Inject randomly chosen faults to detailed microarchitectural simulation

Each chosen bit-flip is tracked until completion of program

Analyze final result to determine:

- How much SDC is converted to DUE

- How much work (# of application) did program complete before encountering SDC

EXPERIMENTAL EVALUATION Results: Measures # of applications the program completed before encountering an SDC

Implementation

Performance

CSB Enable better performance as it eliminates scheduling constraints

LVQ Impact varies by benchmark

SUMMARY AND CONCLUSION

CRAFT, as compared to:

Hybrid technique can provide better reliability with relatively low cost

Software-only Technique Hardware-only Technique

Execution time reduction by 5%

Significantly reduce area overhead

SDC to DUE conversion rate increase by 75%

Maintain comparable reliability

DISCUSSION POINTS

CRAFT detects fault when CSB is clogged

Tradeoff between detection latency and more flexible scheduling?

Recovery method? Evaluation in terms of coverage?