Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

18
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng* Shantanu Gupta Amin Ansari Scott Mahlke David August University of Michigan *Currently with Northrop Grumman, Information Systems Sector

description

Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng * Shantanu Gupta Amin Ansari Scott Mahlke David August University of Michigan *Currently with Northrop Grumman, Information Systems Sector. “Failure to prepare is preparing to fail…”. - PowerPoint PPT Presentation

Transcript of Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

Page 1: Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science1

Encore: Low-Cost, Fine-Grained Transient Fault Recovery

Authors: Shuguang Feng*Shantanu GuptaAmin AnsariScott MahlkeDavid August

University of Michigan

*Currently with Northrop Grumman, Information Systems Sector

Page 2: Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science2

Negative Bias Temperature Instability

Oxide

Oxide Breakdown

GSI gs DIgd

B

N+N+

P-wellIgb

I gcsIgcd

ElectromigrationPackaging ImpuritiesCosmic Radiation

PVT Variation

[Gupta`09]

…many ways to fail

[Dreslinski`10]

NTC Computing

“Failure to prepare is preparing to fail…”- Benjamin Franklin

The distinction between a transient and permanent fault is becoming blurred

Transient (“soft”) FaultsRare ContinuousPeriodic

Permanent (“hard”) Faults

Many permanent faults, particularly wearout-induced faults, initially manifest as timing errors.

Page 3: Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science3

The Future of Soft Errors

Past Present Future

Aggressive voltage scaling(near-threshold computing)One failure per

MONTH per 100 chips

One failure per DAY per 100 chips

One failure per DAY per chip

Page 4: Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Diagnosis Repair

4

Realizing a Reliability “Pipeline”

Detection Recovery

VulnerableComputation

ReliableOutput

Detection Recovery

Recent interest in low-cost fault detection ReStore [DSN`05] SWAT [ASPLOS`08] Shoestring [ASPLOS`10]

Not perfect…but very low-cost

Generally involves some form of rollback/re-execution1) Identify fault site2) Restore processor to pre-fault state, before 1)3) Resume execution from 1)

Many low-cost detection techniques rely on hardware speculation support

VulnerableComputation

ReliableOutput

Commodity systems present both challenges and opportunities

Challenge: HW speculation support (if it exists) is limited

Challenge: Cannot afford expensive, heavyweight SW checkpointing

Opportunity: Typically not running mission-critical applications Sacrifice a small degree of reliability

Exploit (probabilistic) idempotence in program execution

Page 5: Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science5

The Role of Idempotence Mathematical Definition:

an operation that can be applied multiple times without changing the result

Computer Science Definition: a region of code without any

exposed write-after-read (WAR, anti-) dependencies

Non-idempotentIdempotent…

… = X

… = XX++

X++

X = …

X

Idempotent code regions can be safely re-executed without additional checkpointing

Page 6: Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science6

Does Idempotence Exist?

Selectively checkpointing a *few* offending stores

Page 7: Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science7

Challenges to Exploiting Idempotence Must identify where to resume execution

1) Control flow2) Rollback distance

Statically identifying optimal rollback distance is inherently intractable ↑ rollback dist. → ↑ Pr(recoverable) ↓ rollback dist. → ↑ Pr(idempotent)

Simplifying engineering solution based on single-entry, multiple-exit (SEME) regions

Execution Path

X

bb’

bb 7

bb 3

bb 4

bb 6

bb 5

bb 2

bb 1

bb 6

X

Xa

Page 8: Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Code Partitioning(CFG-based)

8

Encore VisionSo

urce

Cod

e

Idempotence Analysis(per region)

…= X X++

… = X

Idempotent

Non-id

empo

tent

X++…= XX++

… = X

Chkpt X

Recovery

Runtime Behavior(post-fault)

Recovery

Chkpt X

Instrumentation(per region)

Fault Detected

Redirect Control

Restore State

Page 9: Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science9

Identifying Idempotence (High-level)

bb 2

bb7

bb 1

bb 8

bb 6bb 5

bb 3 bb 4

With respect to a point, p, in the CFG… Reachable Stores (RS)

A store that may execute after p

Guarded Addresses (GA) An address that is guaranteed to be

overwritten before reaching p

Exposed Addresses (EA) An address that may be referenced by

an unguarded load prior to p

Idempotent IFF EA ∩ RS = Ø

bb 6

bb7

bb 8

bb 2

bb 3 bb 4bb 3 bb 4

bb 1

Additional Details…

1) Applies to both memory and registers Static, conservative alias analysis

2) Scalable hierarchical analysis Handles cyclic code

Page 10: Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

*Restore BRestore R1Restore R2 …Restore Rn

bb r

10

Code Instrumentation

MemCopy BSave Address[B]

“On-demand” Checkpointing

Recovery Code

*Restore B bb rSave R1Save R2

…Save Rn

Live-in Checkpointing

bb 0

Upon Fault Detection

bb 2

bb7

bb 1

bb 8

bb 6bb 5

bb 3 bb 4

…1: Store A

…6: Load B

…2: Store B

…3: Store C

…4: Load A

…5: Store C

… 9: Store A

…10: Store B

…11: Load C

…7: Load B

…8: Load C

…12: Store C

#

#

$

$

@

@+ +

Encore Heuristics

1) Selectively prune dynamically-dead code ↓ offending stores → ↑ Pr(idempotent)

2) Selectively fuse adjacent regions ↑ region size → ↑ Pr(recoverable)

3) Selectively instrument profitable regions

Page 11: Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science11

Lightweight Checkpointing

STACKdata_1

addr_1

data_Naddr_N

data_0addr_0

Live-in Registers

Local Variables

Return AddressInput Parameters

Traditional Call Stack

Encore Extensions

Frame Pointer

Stack Pointer

1 reg2mem store

1 reg2mem store1 mem2mem copy1 stack ptr incrementStack grows dynamically to

accommodate checkpoint storage

Page 12: Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science12

Evaluation Methodology

Program analysis/instrumentation performed in the LLVM compiler

In-order, single-issue, embedded-class processor Dynamic instruction model based on profiled execution

Reliability coverage Analytical model in lieu of traditional fault injection Decouples evaluation from microarchitectural details

Page 13: Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science13

Inherent Idempotence0% (dynamically-dead)<5%<10%

76% of application code is naturally idempotent

Page 14: Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science14

Dynamic Execution Breakdown

Impact of detection latency

If control has left the region containing the original fault site, re-execution cannot correct the error

91% of execution time is spent within recoverable regions

Page 15: Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Existing (~100 instrs)Future (~10 instrs)Future (~1000 instrs)

15

Full System “Coverage”

93% − 99.99% coverage, highly application dependent

Page 16: Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science16

Overheads

3% − 22% performance degradation

Page 17: Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science17

Summary Large portions of applications, across domains, are

(probabilistically) idempotent

Encore is a software-only solution that exploits this property to provide low-cost fault recovery

97% of faults on average are recoverable with current detection schemes

@ 15% performance penalty

Implementing Encore in a runtime system / virtual machine has the potential to yield even better results

Larger dynamic traces v. static intervals Dynamic v. static memory analysis

Page 18: Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors: Shuguang Feng *

University of MichiganElectrical Engineering and Computer Science

University of MichiganElectrical Engineering and Computer Science

Questions?

18

http://cccp.eecs.umich.edu

18