Towards Understanding The Effects of Intermittent Hardware Faults on Programs

Click here to load reader

download Towards Understanding The Effects of Intermittent Hardware Faults on Programs

of 28

description

Towards Understanding The Effects of Intermittent Hardware Faults on Programs . Layali Rashid , Karthik Pattabiraman and Sathish Gopalakrishnan Department of Electrical and Computer Engineering The University of British Columbia. Motivation: Why Intermittent Faults?. - PowerPoint PPT Presentation

Transcript of Towards Understanding The Effects of Intermittent Hardware Faults on Programs

Towards Understanding the Effects of Intermittent Hardware Faults on Programs

Layali Rashid, Karthik Pattabiraman and Sathish GopalakrishnanDepartment of Electrical and Computer EngineeringThe University of British ColumbiaTowards Understanding The Effects of Intermittent Hardware Faults on Programs

(the first project to study ifault effect on programs)SG: Emphasize that this work is unique.

1Motivation: Why Intermittent Faults?Intermittent faults are likely to be a significant concern in future processorsDo not persist forever unlike permanent faultsPersist for longer duration than transient faultsMay impact program more than transient faults

Assumption:An intermittent fault affects two or more consecutive instructions in the program.One reason for ifaults only.Define the ifualt and explain the assumption.Intermittent hardware faults are bursts of errors that occur at the same location (micro-architectural component) and last from a few cycles to a few seconds.

PVT fluctuations. In-progress wear-out. Manufacturing residuals.We believe that understanding intermittent error propagation is important in designing fault tolerance techniques.Intermittent hardware faults are bursts of errors that occur at the same location (micro-architectural component) and last from a few cycles to a few seconds

2ContributionsStudy the impact of intermittent faults on programs.Model the propagation of intermittent faults in programs at the instruction-level.Validate the model using fault injections.

Dont mention the previous work too much, dont say other studies about tfaults.3Motivation: Why Model Error Propagation?Fault injection experiments are prohibitively expensive.Intermittent faults vary in location and duration.An order of magnitude slower than modeling.

Modeling error propagation provides more insights that may help in tolerating faults.

Don't mention previous work.

4Primary Research QuestionsDo all intermittent faults lead to program crash?

How many instructions are executed before the program crashes?

How many variables are corrupted by the fault before the program crashes?Dont say the next question.Over 60% of the faults lead to program crash, 4% cause SDCs and 36% are benign. 99% of the faults crash within a few thousands of instructions.99% of the faults corrupt at most few dozens of variables.

5ApproachCrash ModelDynamic Dependency GraphFault ModelSimpleScalarsimulatorEvaluate using FI6ApproachCrash ModelFault Model Decoder ALU Unit Load/Store UnitSimpleScalarsimulatorEvaluate using FIDynamic Dependency Graph7ApproachFault ModelSimpleScalarsimulatorCrash ModelMemory addressBranch/jump addressFunction call addressEvaluate using FIDynamic Dependency Graph8ApproachCrash ModelDynamic Dependency Graph is a directed acyclic graph that models the dynamic dependencies between instructions. [Agrawal '90]Fault ModelSimpleScalarsimulatorEvaluate using FI9Code FragmentNodemov R1, #51mov R2, #62mov R3, #73ld R4, R1, Array_Addr4ld R5, R2, Array_Addr5ld R6, R3, Array_Addr6mult R7, R5, R47AA14257Array_Addr#5#636#7RR...AExampleThis program load 3 elements in the array and multiplies them without further details.10Code FragmentNodemov R1, #51mov R2, #62mov R3, #73ld R4, R1, Array_Addr4ld R5, R2, Array_Addr5ld R6, R3, Array_Addr6mult R7, R5, R4714257Array_Addr#5#636#7AAARR...A node is a value produced by a dynamic instruction Example11Code FragmentNodemov R1, #51mov R2, #62mov R3, #73ld R4, R1, Array_Addr4ld R5, R2, Array_Addr5ld R6, R3, Array_Addr6mult R7, R5, R47AA14257Array_Addr#5#636#7RR...AThe edges represent the instructions operands: A is an address operand R is a regular operand.

ExampleDDG MetricsIntermittent Propagation Set (IPS): set of program values to which an intermittent fault propagates, Crash Distance (CD): number of instructions that execute from the time an intermittent fault occurs until the program crashes (due to fault).

Don't mention the different uses of DDG, just mention that we are the first for this context.13ExampleCode FragmentNodemov R1, #51mov R2, #62mov R3, #73ld R4, R1, Array_Addr4ld R5, R2, Array_Addr5ld R6, R3, Array_Addr6mult R7, R5, R47AA1257Array_Addr#5#636#7RR...AIntermittent Error4Intermittent Propagation Set (1,2) = {?}Crash Distance (1, 2) = ?Ifaults can be viewed as a sequence of transient faults.14ExampleCode FragmentNodemov R1, #51mov R2, #62mov R3, #73ld R4, R1, Array_Addr4ld R5, R2, Array_Addr5ld R6, R3, Array_Addr6mult R7, R5, R47AA1257Array_Addr#5#636#7RR...ATransient Propagation Set (1) = {1, 4}Transient Crash Distance (1) = 4

4Transient ErrorCrash Node15ExampleCode FragmentNodemov R1, #51mov R2, #62mov R3, #73ld R4, R1, Array_Addr4ld R5, R2, Array_Addr5ld R6, R3, Array_Addr6mult R7, R5, R47AA1427Array_Addr#5#636#7RR...A5Transient Propagation Set (1) = {1, 4}Transient Crash Distance (1) = 4Transient Propagation Set (2) = {2, 5}Transient Crash Distance (2) = 4Transient ErrorExampleCode FragmentNodemov R1, #51mov R2, #62mov R3, #73ld R4, R1, Array_Addr4ld R5, R2, Array_Addr5ld R6, R3, Array_Addr6mult R7, R5, R47AA1257Array_Addr#5#636#7RR...AIntermittent Propagation Set (1,2) = {1, 2, 4}Crash Distance (1, 2) = 44Intermittent ErrorCrash NodeApproachCrash ModelFault ModelSimpleScalarsimulatorEvaluate using FIDynamic Dependency Graph18Experimental SetupEvaluating the Models AccuracyIntermittent fault injections in instruction level simulator (SimpleScalar)Measure the difference between the predicted and the actual CD for crashes

Computation of Intermittent Fault PropagationConstruct the DDG of each program.Find the IPS and the CD for each fault

We modify the ss simulator to inject faults19BenchmarksPreliminary results for two programs: Matrix Multiply and Insertion Sort.Each program has about 11,000 static MIPS instructions.

Matrix multiply reads two matrices from memory, find the product and store them back to memory.Insertion Sort arranges an array of integers.Programs contain loops, memory instructions and I/O instructions.2088% of the expected CD fall within 10 nodes from the actual ones and 97% fall within 100 nodes.Results: DDG Model Vs. SimpleScalar

Mention axis only here.Point to the bars when you mention them.This shows that the DDG model is accurate in predicting the crash distances.

2195% of the faults cause program to crash within 10 nodes of the faults start. Results: CD Absolute values

Larger crash distances are infrequent in the programs studied.

22

Results: Effect of Fault Length

Crash distances for intermittent faults do not exceed few hundreds for the program studied.The IPS increases while increasing the fault length.If you manage your time you can present this slide.

23Conclusions and DiscussionWe enhanced Dynamic Dependency Graph to model intermittent fault propagation in programs.88% of the expected faults' CDs fall within 10 nodes of the actual CDs.The majority of the intermittent faults cause programs to crash within few hundreds of dynamic instructions.DiscussionDetection using software-based techniques of intermittent faults can be efficient.Diagnosis of intermittent faults is possibly feasible using software-based techniques.Recovery using check-pointing techniques on the order of thousands of instructions will be effective.

Say that this is the first study to study ifault impacts. Say few hundreds not few thousands.24Thank YouBackup SlidesInsertion Sort CD

Insertion Sort IPS