Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest...

46
Computer Architecture: Multithreading (III) Prof. Onur Mutlu Carnegie Mellon University

Transcript of Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest...

Page 1: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Computer Architecture:

Multithreading (III)

Prof. Onur Mutlu

Carnegie Mellon University

Page 2: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

A Note on This Lecture

These slides are partly from 18-742 Fall 2012, Parallel Computer Architecture, Lecture 13: Multithreading III

Video of that lecture:

http://www.youtube.com/watch?v=7vkDpZ1-hHM&list=PL5PHm2jkkXmh4cDkC3s1VBB7-njlgiG5d&index=13

2

Page 3: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Other Uses of Multithreading

Page 4: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Now that We Have MT Hardware …

… what else can we use it for?

Redundant execution to tolerate soft (and hard?) errors

Implicit parallelization: thread level speculation

Slipstream processors

Leader-follower architectures

Helper threading

Prefetching

Branch prediction

Exception handling4

Page 5: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

SMT for Transient Fault Detection

Transient faults: Faults that persist for a “short” duration

Also called “soft errors”

Caused by cosmic rays (e.g., neutrons)

Leads to transient changes in wires and state (e.g., 01)

Solution

no practical absorbent for cosmic rays

1 fault per 1000 computers per year (estimated fault rate)

Fault rate likely to increase in the feature

smaller feature size

reduced voltage

higher transistor count

reduced noise margin5

Page 6: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Need for Low-Cost Transient Fault Tolerance

The rate of transient faults is expected to increase significantly Processors will need some form of fault

tolerance.

However, different applications have different reliability requirements (e.g. server-apps vs. games) Users who do

not require high reliability may not want to pay the overhead.

Fault tolerance mechanisms with low hardware cost are attractive because they allow the designs to be used for a wide variety of applications.

6

Page 7: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Traditional Mechanisms for Transient Fault Detection

Storage structures

Space redundancy via parity or ECC

Overhead of additional storage and operations can be high in time-critical paths

Logic structures

Space redundancy: replicate and compare

Time redundancy: re-execute and compare

Space redundancy has high hardware overhead.

Time redundancy has low hardware overhead but high performance overhead.

What additional benefit does space redundancy have?

7

Page 8: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Lockstepping (Tandem, Compaq Himalaya)

Idea: Replicate the processor, compare the results of two processors before committing an instruction

8

R1 (R2)

Input

Replication

Output

Comparison

Memory covered by ECC

RAID array covered by parity

Servernet covered by CRC

R1 (R2)

microprocessor microprocessor

Page 9: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Transient Fault Detection with SMT (SRT)

Idea: Replicate the threads, compare outputs before committing an instruction

Reinhardt and Mukherjee, “Transient Fault Detection via Simultaneous Multithreading,” ISCA 2000.

Rotenberg, “AR-SMT: A Microarchitectural Approach to Fault Tolerance in Microprocessors,” FTCS 1999.

9

R1 (R2)

Input

Replication

Output

Comparison

Memory covered by ECC

RAID array covered by parity

Servernet covered by CRC

R1 (R2)

THREAD THREAD

Page 10: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Sim. Redundant Threading vs. Lockstepping SRT Advantages

+ No need to replicate the processor

+ Uses fine-grained idle FUs/cycles (due to dependencies, misses) to execute the same program redundantly on the same processor

+ Lower hardware cost, better hardware utilization

Disadvantages

- More contention between redundant threads higher

performance overhead (assuming unequal hardware)

- Requires changes to processor core for result comparison, value communication

- Must carefully fetch & schedule instructions from threads

- Cannot easily detect hard (permanent) faults

10

Page 11: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Sphere of Replication

Logical boundary of redundant execution within a system

Need to replicate input data from outside of sphere of replication to send to redundant threads

Need to compare and validate output before sending it out of the sphere of replication

11

Rest of System

Sphere of Replication

Output

Compariso

n

Input

Replication

Execution

Copy 1

Execution

Copy 2

Page 12: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Sphere of Replication in SRT

12

Fetch PC

Instruction

Cache

Decode Register Rename

FpRegs

Int .Regs

FpUnits

Ld /StUnits

Int .Units

Thread 0

Thread 1

R1 (R2)

R1 (R2)

R3 = R1 + R7

R8 = R7 * 2

RUU

Page 13: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Input Replication

How to get the load data for redundant threads pair loads from redundant threads and access the cache when

both are ready: too slow – threads fully synchronized

allow both loads to probe cache separately: false alarms with I/O or multiprocessors

Load Value Queue (LVQ)

pre-designated leading & trailing threads

13

add

load R1(R2)

subadd

load R1 (R2)

sub

probe cache

LVQ

Page 14: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Output Comparison

<address, data> for stores from redundant threads compare & validate at commit time

How to handle cached vs. uncacheable loads

Stores now need to live longer to wait for trailing thread

Need to ensure matching trailing store can commit

14

Store: ...

Store: R1 (R2)Store: ...

Store: R1 (R2)Store: ...Store: ...

Store: ...Store

Queue

Output

Comparison To Data Cache

Page 15: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

SRT Performance Optimizations

Many performance improvements possible by supplying results from the leading thread to the trailing thread: branch outcomes, instruction results, etc

Mukherjee et al., “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” ISCA 2002.

15

Page 16: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Recommended Reading

16

Mukherjee et al., “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” ISCA 2002.

Page 17: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Branch Outcome Queue

17

Page 18: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Line Prediction Queue

Line Prediction Queue

Alpha 21464 fetches chunks using line predictions

Chunk = contiguous block of 8 instructions

18

Page 19: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Handling of Permanent Faults via SRT

SRT uses time redundancy

Is this enough for detecting permanent faults?

Can SRT detect some permanent faults? How?

Can we incorporate explicit space redundancy into SRT?

Idea: Execute the same instruction on different resources in an SMT engine

Send instructions from different threads to different execution units (when possible)

19

Page 20: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

SRT Evaluation

SPEC CPU95, 15M instrs/thread

Constrained by simulation environment

120M instrs for 4 redundant thread pairs

Eight-issue, four-context SMT CPU

Based on Alpha 21464

128-entry instruction queue

64-entry load and store queues

Default: statically partitioned among active threads

22-stage pipeline

64KB 2-way assoc. L1 caches

3 MB 8-way assoc L2

20

Page 21: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Performance Overhead of SRT

Performance degradation = 30% (and unavailable thread context)

Per-thread store queue improves performance by 4%

21

Page 22: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Chip Level Redundant Threading

SRT typically more efficient than splitting one processor into two half-size cores

What if you already have two cores?

Conceptually easy to run these in lock-step

Benefit: full physical redundancy

Costs:

Latency through centralized checker logic

Overheads (e.g., branch mispredictions) incurred twice

We can get both time redundancy and space redundancy if we have multiple SMT cores

SRT for CMPs

22

Page 23: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Chip Level Redundant Threading

23

Page 24: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Some Other Approaches to Transient Fault Tolerance

Austin, “DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design,” MICRO 1999.

Qureshi et al., “Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors,” DSN 2005.

24

Page 25: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

DIVA

Idea: Have a “functional checker” unit that checks the correctness of the computation done in the “main processor”

Austin, “DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design,” MICRO 1999.

Benefit: Main processor can be prone to faults or sometimes incorrect (yet very fast)

How can checker keep up with the main processor?

Verification of different instructions can be performed in parallel (if an older one is incorrect all later instructions will be flushed anyway)

25

Page 26: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

DIVA (Austin, MICRO 1999)

Two cores

26

Page 27: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

DIVA Checker for One Instruction

27

Page 28: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

A Self-Tuned System using DIVA

28

Page 29: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

DIVA Discussion

Upsides?

Downsides?

29

Page 30: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Some Other Approaches to Transient Fault Tolerance

Austin, “DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design,” MICRO 1999.

Qureshi et al., “Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors,” DSN 2005.

30

Page 31: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Microarchitecture Based Introspection

Idea: Use cache miss stall cycles to redundantly execute the program instructions

Qureshi et al., “Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors,” DSN 2005.

Benefit: Redundant execution does not have high performance overhead (when there are stall cycles)

Downside: What if there are no/few stall cycles?

31

Page 32: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Introspection

32

Page 33: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

MBI (Qureshi+, DSN 2005)

33

Page 34: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

MBI Microarchitecture

34

Page 35: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Performance Impact of MBI

35

Page 36: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Food for Thought

Do you need to check that the result of every instruction is correct?

Do you need to check that the result of any instruction is correct?

What do you really need to check for to ensure correct operation?

Soft errors?

Hard errors?

36

Page 37: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Other Uses of Multithreading

Page 38: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

MT for Exception Handling

Exceptions cause overhead (especially if handled in software)

Some exceptions are recoverable from (TLB miss, unaligned access, emulated instructions)

Pipe flushes due to exceptions reduce thread performance

38

Page 39: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

MT for Exception Handling

Cost of software TLB miss handling

Zilles et al., “The use of multithreading for exception handling,” MICRO 1999.

39

Page 40: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

MT for Exception Handling

Observation:

The same application instructions are executed in the same order INDEPENDENT of the exception handler’s execution

The data dependences between the thread and exception handler are minimal

Idea: Execute the exception handler in a separate thread context; ensure appearance of sequential execution

40

Page 41: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

MT for Exception Handling

Better than pure software, not as good as pure hardware handling

41

Page 42: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Why These Uses?

What benefit of multithreading hardware enables them?

Ability to communicate/synchronize with very low latency between threads

Enabled by proximity of threads in hardware

Multi-core has higher latency to achieve this

42

Page 43: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Helper Threading for Prefetching

Idea: Pre-execute a piece of the (pruned) program solely for prefetching data

Only need to distill pieces that lead to cache misses

Speculative thread: Pre-executed program piece can be considered a “thread”

Speculative thread can be executed

On a separate processor/core

On a separate hardware thread context

On the same thread context in idle cycles (during cache misses)

43

Page 44: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Helper Threading for Prefetching

How to construct the speculative thread:

Software based pruning and “spawn” instructions

Hardware based pruning and “spawn” instructions

Use the original program (no construction), but

Execute it faster without stalling and correctness constraints

Speculative thread

Needs to discover misses before the main program

Avoid waiting/stalling and/or compute less

To get ahead, uses

Branch prediction, value prediction, only address generation computation

44

Page 45: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Generalized Thread-Based Pre-Execution

Dubois and Song, “Assisted Execution,” USC Tech Report 1998.

Chappell et al., “Simultaneous Subordinate Microthreading (SSMT),”ISCA 1999.

Zilles and Sohi, “Execution-based Prediction Using Speculative Slices”, ISCA 2001.

45

Page 46: Computer Architecture: Multithreading (III)ece740/f13/lib/exe/fetch.php?... · 2013-11-10 · Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy

Thread-Based Pre-Execution Issues

Where to execute the precomputation thread?

1. Separate core (least contention with main thread)

2. Separate thread context on the same core (more contention)

3. Same core, same context

When the main thread is stalled

When to spawn the precomputation thread?

1. Insert spawn instructions well before the “problem” load

How far ahead?

Too early: prefetch might not be needed

Too late: prefetch might not be timely

2. When the main thread is stalled

When to terminate the precomputation thread?

1. With pre-inserted CANCEL instructions

2. Based on effectiveness/contention feedback

46