Hardware Fault Tolerance Through Simultaneous Multithreading (part 2)

1

CS717

Hardware Fault Tolerance Through Simultaneous Multithreading (part 2)

Jonathan Winter

2

CS717

3 SMT + Fault Tolerance Papers

• Eric Rotenberg, "AR-SMT - A Microarchitectural Approach to Fault Tolerance in Microprocessors", Symposium on Fault-Tolerant Computing, 1999.

• Steven K. Reinhardt and Shubhendu S. Mukherjee, "Transient Fault Detection via Simultaneous Multithreading", ISCA 2000.

• Shubhendu S. Mukherjee, Michael Kontz and Steven K. Reinhardt, "Detailed Design and Evaluation of Redundant Multithreading Alternatives", ISCA 2002.

3

CS717

Outline

1. Background• SMT• Hardware fault tolerance

2. AR-SMT• Basic mechanisms• Implementation issues• Simulation and Results

3. Transient Fault Detection via SMT• Sphere of replication• Basic mechanisms• Comparison to AR-SMT• Simulation and Results

4. Redundant Multithreading Alternatives• Realistic processor implementation• CRT• Simulation and Results

5. Fault Recovery6. Next Lecture

4

CS717

Transient Fault Detection via SMT

• More detailed analysis of Simultaneous and Redundant Threading (SRT)

• Introduces Sphere of Replication concept• Explores SRT design space• Discussion of input replication• Architecture for output comparison• Performance improving mechanisms• More depth in simulation

5

CS717

Sphere of Replication

• Components inside sphere are protected against faults using replication

• External components must use other means of fault tolerance (parity, ECC, etc.)

• Inputs to sphere must be duplicated for each of the redundant processes

• Outputs of the redundant processes are compared to detect faults

• Simple to understand in lockstepping• Larger sphere

– more state to replicate– less input replication and output comparison

6

CS717

Sphere of Replication (part 2)

• Size of sphere of replication– Two alternatives – with and without register file– Instruction and data caches kept outside

7

CS717

Input Replication

• Must ensure that both threads received same inputs to guarantee they follow the same path

• Instructions – assume no self-modification• Cached load data

– Out-of-order execution issue– Multiprocessor cache coherence issues

• Uncached load data – must synchronize• External interrupts

– Stall lead thread and deliver interrupt synchronously– Record interrupt delivery point and deliver later

8

CS717

Cached Load Data - ALAB

• Active Load Address Buffer (ALAB)– Delays cache block replacement or invalidation– ALAB is table with address tag, counter, and

pending-invalidate bit– Counter tracks trailing thread’s outstanding loads– Blocks cannot be replaced or invalidated until

counter is zero– Pending-invalidate set on unevictable block– Leading thread stalls when ALAB is full– Must detect and address deadlocks

9

CS717

Cached Load Data - LVQ

• Load Value Queue (LVQ)– Explicit designation of leading and trailing thread– Only leading thread issues loads and stores– Load addresses and values forward to trailing

thread via LVQ– Trailing thread executes loads in-order and non-

speculatively (why?)– Input replication guaranteed– Design simpler and less pressure on cache– Earlier fault detection– Constrains scheduling of trailing thread loads

10

CS717

Output Comparison

• Store buffer used to verify address and value of stores to be committed– Trailing thread searches for matching entry– Mismatch means fault occurred

• Cached load values require no checking• Uncached load values could have side effects

– Issue non-speculatively, so stall leading thread– Assumes uncached loads are always detected

• Register Check Buffer used to match register writebacks.– 3 register files required: future files + architectural file

11

CS717

Enhancing SRT Performance

• Slack Fetch– Maintain constant lag between thread’s execution– Lead thread updates branch and data predictors– Lead thread prefetches loads– Traditional SMT ICount fetch policy is modified to

maintain slack

• Branch Outcome Queue– Deliver branch outcomes directly to trailing thread– Trailing thread has no control speculation

12

CS717

AR-SMT verses SRT

• AR-SMT only has space redundancy in functional units• SRT can potentially have space redundancy across the pipeline• AR-SMT is trace processor-based while SRT is conventional• Register file of R-stream must be protected• AR-SMT forwards load data values• AR-SMT checks every instruction during fault detection• SRT requires no operating system modifications• AR-SMT doesn’t support uncached loads and stores or

multiprocessor coherence• Delay buffer performs function of register check buffer and

branch outcome queue• All of main memory is in AR-SMT sphere

– Better fault coverage but very costly

13

CS717

Simulation Environment

• Modified Simplescalar “sim-outorder”

• Long front-end pipeline because of out-of-order nature and SMT

• Simple approximation of trace cache

• Used 11 SPEC95 benchmarks

14

CS717

Results

• Again, this paper only analyzes the performance impact of fault tolerance

• Baseline Characterization– ORH-Dual two pipelines, each with half the resources– SMT-Dual replicated threads with no detection hardware

• ORH and SMT-Dual 32% slower than SMT-Single

15

CS717

Slack Fetch & Branch Outcome Queue

• 10%,14%, 15% (27% max) performance improvements for SF, BOQ, and SF + BOQ

• Reduced memory stalls through prefetching

• Prevents trailing thread from wasting resources by speculating

• Performance better with slack of 256 instructions over 32 or 128

16

CS717

Input Replication

• Assumes output comparison performed by oracle

• Almost no performance penalty paid for 64-entry ALAB or LVQ

• With a 16-entry ALAB and LVQ, benchmarks performance degraded 8% and 5% respectively

17

CS717

Output Comparison

• Assumes inputs replicated by oracle• Leading thread can stall if store queue is full• 64-entry store buffer eliminates almost all stalls• Register check buffer or size 32, 64, and 128 entries degrades

performance by 27%, 6%, and 1% respectively

18

CS717

Overall Results

• Speedup of SRT processor with 256 slack fetch, branch outcome queue with 128 entries, 64-entry store buffer, and 64-entry load value queue.

• SRT demonstrates a 16% speedup on average (up to 29%) over a lockstepping processor with the “same” hardware

19

CS717

Multi-cycle and Permanent Faults

• Transient faults could potentially persist for multiple cycles and affect both threads

• Increasing slack fetch decreases this possibility

• Spatial redundancy can be increased by partitioning function units and forcing threads to execute on different groups

• Performance loss for this approach is less than 2%

20

CS717

Conclusions

• Sphere of replication helps analysis of input replication and output comparison

• Keep register file in sphere• LVQ is superior to ALAB (simpler)• Slack fetch and branch outcome queue

mechanism enhance performance• SRT fault tolerance method performs 16%

better on average than lockstepping

Hardware Fault Tolerance Through Simultaneous Multithreading (part 2)

Documents

Transcript of Hardware Fault Tolerance Through Simultaneous Multithreading (part 2)