9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity...

109
9/25/2006 eleg652-F06 1 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity...

Page 1: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 1

Topic 3

Exploitation of Instruction Level Parallelism

The secret to creativity is knowing how to hide your sources. Albert Einstein

Page 2: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 2

Reading List

• Slides: Topic3x

• Henn&Patt: Chapters 3 & 4

• Other assigned readings from homework and classes

Page 3: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 3

Instruction Level Parallelism

• Parallelism that is found between instructions (or intra instruction)

• Dynamic and Static Exploitation– Dynamic: Hardware related. – Static: Software related (compiler and system

software)

• VLIW and Superscalar

• Micro-Dataflow and Tomasulo’s Algorithm

Page 4: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 4

RISC Concepts: Revisit

• Reduced Instruction Set Architecture– “Internal Computing Architecture in which processor instructions

are pared down so that most of them can be executed in one clock cycle, theoretically improving computing efficiency” Black Box Pocket Glossary of Computer Terms

• Characteristics:– Uniform instruction encoding– Homogenous Register Banks– Simplified Addressing Modes– Simplified data structures– Branch delay slot– Cache– Pipeline

Page 5: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 5

RISC Concepts: Revisited

• What prevents one instruction per cycle (CPI = 1)?– Hazards– Dependencies– Long Latency ops

• Cache Trashing

Page 6: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 6

Pipeline: A Review

• Hazards– Any situation that will prevent the smooth flow of the

instructions along the pipeline– Types

• Structural– Due to limited resources and contention among them

• Control– Instructions that change the PC (program counter)

• Data– Variables depends on values from previous instruction

– Stall• Hazards will “stall” the pipeline• Serious: It can hold up many instructions for many cycles

Page 7: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 7

RISC Pipeline & Instruction Issue

• Instruction Issue– The process of letting an instruction move from ID to

EXEC– Issue V.S. Execution

• In DLX– ID Check all data hazards, stall if any exists

Typical RISC Pipeline:

Instruction Fetch Instruction Decode Execute Memory Op Register Update

Page 8: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 8

Hazards

• Structural Hazards– Non Pipelining Function Units– One Port Register Bank and one port memory

bank

• Data Hazards– For some

• Forwarding

– For others• Pipeline Interlock

LD R1 A+ R4 R1 R7

Need Bubble / Stall

Page 9: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 9

Instruction Clock cycle number

1 2 3 4 5 6 7 8 9

Load instruction IF ID EX MEM WB

Instruction i+1 IF ID EX MEM WB

Instruction i+2 IF ID EX MEM WB

Instruction i+3 IF ID EX MEM WB

Instruction i+4 IF ID EX MEM

Structural Hazard

A single memory bank for insts and data

Page 10: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 10

Data Hazards

Instruction 1 2 3 4 5 6

ADD IF ID EX MEM WB

SUB IF ID EX MEM WB

Stage

Stage

Data is read here

Data is written here

The ADD instruction writes a register that is a source operand for the SUB instruction. But the ADD doesn’t finish writing the data into the register file until three clock cycles after SUB begins reading it!

The SUB instruction may read the incorrect value. Result may be non-deterministic. Solved by forwarding

Page 11: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 11

Data Dependency: A Review

B + C A

A + D E

Flow DependencyRAW Conflicts

A + C B

E + D A

Anti DependencyWAR Conflicts

B + C A

E + D A

Output DependencyWAW Conflicts

RAR are not really a problem

Page 12: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 12

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

ADD R1,R2,R3

SUB R4,R1,R5

AND R6,R1,R7

OR R8,R1,R9

XOR R10,R1,R11

Forwarding Example

Page 13: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 13

Bypassing Pitfalls

LW R1, 32 (R6)ADD R4, R1, R7SUB R5, R1, R8AND R6, R1, R7

IF ID EX MEM WBIF ID STALL EX MEM WB

IF STALL ID EX MEM WBSTALL IF ID EX MEM WB

Load Delay Slot cannot be eliminated by forwarding alone

Pipeline Interlock: Stall / Bubble for hazards that cannot be solved by forwarding

The Pipeline

The Code

Page 14: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 14

Pipelining

• Issue Pass the Instruction Decode stage

• DLX: Only issue instruction if there is no hazard

• Detect interlock early in the pipeline has the advantage that it never needs to suspend an instruction and undo state changes.

Page 15: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 15

Instruction Level Parallelism

• Static Scheduling– Simple Scheduling– Loop Unrolling– Loop Unrolling + Scheduling– Software Pipelining

• Dynamic Scheduling– Out of order execution– Data Flow computers

• Speculation

Page 16: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 16

Constraint Graph

• Directed-edges: data-dependence

• Undirected-edges: Resources constraint

• An edge (u,v) (directed or undirected) of length e

represent an interlock between node u and v, and they

must be separated by e time.

S1

S6

S5S4

S3S2

12

62

1 1

operation latencies4

3

Page 17: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 17

Code Scheduling For Single Pipeline

• Input: A constraint graph G = (V, E)

• Output: A sequence of operations in G (v1, v2, v3, v4, v5 ….vn) plus a number of no-op, such that:– If the no-op are deleted then the sequence is

a topological sort of G.– Any two nodes in the sequence (x, y) is

separated by a distance greater or equal d(x,y) in graph G

Page 18: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 18

Advanced Pipelining

• Instruction Reordering and scheduling within loop body

• Loop Unrolling– Code size suffers

• Superscalar– Compact code– Multiple issued of different instruction types

• VLIW

Page 19: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 19

VLIW

• Very Long Instruction Word

• Compiler has all responsibility to schedule instructions

• Make hardware simpler– Move complexity to software

• Concept developed by John Fisher at Yale’s University in early 1980

Page 20: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 20

An ExampleX[i] + a Loop: LD F0, 0 (R1) ; load the vector element

ADDD F4, F0, F2 ; add the scalar in F2SD 0 (R1), F4 ; store the vector elementSUB R1, R1, #8 ; decrement the pointer by

; 8 bytes (per DW)BNEZ R1, Loop ; branch when it’s not zero

Instruction Producer Instruction Consumer Latency

FP ALU op FP ALU op 3

FP ALU op Store Double 2

Load Double FP ALU op 1

Load Double Store Double 0

Load can by-pass the storeAssume that latency for Integer ops is zero and latency for Integer load is 1

Page 21: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 21

An ExampleX[i] + a

Loop: LD F0, 0 (R1) 1STALL 2ADDD F4, F0, F2 3STALL 4STALL 5SD 0 (R1), F4 6SUB R1, R1, #8 7BNEZ R1, Loop 8STALL 9

Load Latency

FP ALU Latency

Load Latency

This requires 9 Cycles per iteration

LD ADDD SD SUB BNEZ1 2 0 0 1

Constrain Graph

Page 22: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 22

An ExampleX[i] + a

Loop: LD F0, 0 (R1) 1STALL 2ADDD F4, F0, F2 3SUB R1, R1, #8 4BNEZ R1, Loop 5SD 8 (R1), F4 6

This requires 6 Cycles per iteration

LD ADDD SD SUB BNEZ1 2 0 0 1

Constrain Graph

Scheduling

Page 23: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 23

An ExampleX[i] + a

Loop : LD F0, 0 (R1) 1NOP 2ADDD F4, F0, F2 3NOP 4NOP 5SD 0 (R1), F4 6LD F6, -8 (R1) 7NOP 8ADDD F8, F6, F2 9NOP 10NOP 11SD -8 (R1), F8 12LD F10, -16 (R1) 13NOP 14ADDD F12, F10, F2 15NOP 16NOP 17SD -16 (R1), F12 18LD F14, -24 (R1) 19NOP 20ADDD F16, F14, F2 21NOP 22NOP 23SD -24 (R1), F16 24SUB R1, R1, #32 25BNEZ R1, LOOP 26NOP 27

This requires 6.8 Cycles per iteration

Unrolling

Page 24: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 24

An Example

X[i] + a

Loop : LD F0, 0 (R1) 1LD F6, - 8 (R1) 2LD F10, -16 (R1) 3LD F14, -24 (R1) 4ADDD F4, F0, F2 5ADDD F8, F6, F2 6ADDD F12, F10, F2 7ADDD F16, F14, F2 8SD 0 (R1), F4 9SD -8 (R1), F8 10SD -16 (R1), F12 11SUB R1, R1, #32 12 BNEZ R1, LOOP 13SD 8 (R1), F16 14

This requires 3.5 Cycles per iteration

Unrolling + Scheduling

Page 25: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 25

Topic 3a

Multi Issue Architectures

Beyond Simple RISC

Page 26: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 26

ILP

• ILP of a program– Average Number of Instructions that a superscalar

processor might be able to execute at the same time• Data dependencies• Latencies and other processor difficulties

• ILP of a machine– The ability of a processor to take advantage of the ILP

• Number of instructions that can be fetched and executed at the same time by such processor

Page 27: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 27

Multi Issue Architectures

• Super Scalar– Machines that issue multiple independent instructions

per clock cycle when they are properly scheduled by the compiler and runtime scheduler

• Very Long Instruction Word– A machine where the compiler has complete

responsibility for creating a package of instructions that can be simultaneously issued, and the hardware does not dynamically make any decisions about multiple issue

Patterson & Hennessy P317 and P318

Page 28: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 28

Multiple Instruction Issue

• Multiple Issue + Static Scheduling VLIW• Dynamic Scheduling

– Tomasulo– Scoreboarding

• Multiple Issue + Dynamic Scheduling Superscalar

• Decoupled Architectures– Static Scheduling of R-R Instructions– Dynamic Scheduling of Memory Ops

• Buffers

Page 29: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 29

Five Primary Approaches

Common Name

Issue Structure

Hazard Detection

Scheduling Distinguishing characteristics

Examples

Superscalar (static)

Dynamic Hardware Static In order execution Sun UltraSPARC II and III

Superscalar (dynamic)

Dynamic hardware Dynamic Some out of order execution

IBM Power2

Superscalar (speculative)

Dynamic Hardware Dynamic With speculation

Speculative out of order execution

Pentium 3 and 4

VLIW / LIW Static Software Static No hazards between issues packets

Trimedia, i860

EPIC Mostly Static Mostly Software

Mostly Static Explicit Dependences marked by compiler

Itanium

Page 30: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 30

Integer instruction FP instruction Clock cycle

Loop: LD F0, 0 (R1) 1LD F6, -8 (R1) 2LD F10, -16 (R1) ADDD F4, F0, F2 3LD F14, -24 (R1) ADDD F8, F6, F2 4LD F18, -32 (R1) ADDD F12, F10, F2 5SD 0 (R1), F4 ADDD F16, F14, F2 6SD -8 (R1), F8 ADDD F20, F18, F2 7SD -16 (R1), F12 8SD -24 (R1), F16 9SUB R1, R1, #40 10BNEZ R1, LOOP 11SD 8 (R1), F20 12

Two-Issue ArchitectureUnrolled and Scheduled Code

The unrolled and scheduled code 2.4 cycles per iteration (5 iters in 12 cycles)

Page 31: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 31

Memory Memory FP FP Integer operation reference 1 reference 2 operation 1 operation 2 /branch

LD F0, 0 (R1) LD F6, - 8 (R1)LD F10, -16 (R1) LD F14, -24 (R1) LD F18, -32 (R1) LD F22, -40 (R1) ADDD F4, F0, F2 ADDD F8, F6, F2 LD F26, -48 (R1) ADDD F12, F10, F2 ADDD F16, F14, F2

ADDD F20, F18, F2 ADDD F24, F22, F2SD 0 (R1), F4 SD - 8 (R1), F8 ADDD F28, F26, F2SD -16 (R1), F12 SD -24 (R1), F16SD -32 (R1), F20 SD -40 (R1), F24 SUB R1, R1, #48SD - 0 (R1), F28 BNEZ R1, LOOP

Unrolling 6 times

F0

+

a

F4

LD

SD

F6

+

a

F8

LD

SD

F10

+

a

F12

LD

SD

F14

+

a

F16

LD

SD

F18

+

a

F20

LD

SD

F22

+

a

F24

LD

SD

F26

+

a

F28

LD

SD

A VLIW Code Sequence

7 iterations in 9 cycles 1.28 cycle per iter

Page 32: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 32

Trace Scheduling

• First Used for VLIW architecture• Trace

– A straight line sequence of instructions executed in some data or a sequence of ops which constitute a possible path based on “predicted” branches.

• Trace Scheduling– Identify a “most possible” sequence of instructions

and then “compact” the instructions in such path

• Tools– For Loops: Unrolling– For Branches: Static Branch prediction

Page 33: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 33

An ExampleTraces

A;B;C;if(D){ E; F;}else{ G;}H;I;

Basic BlockAn instruction sequence which has only one entry point and one exit point (no target for branches or branches in the middle)

ABC

br D

EF

G

HI

Trace 1 Trace 2

Page 34: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 34

Code Motion & Compensation Code

ABC

br D

EF G

HI

AB

br D

CE

CG

FHI

ABCE

br D

FH

Undo EGH

I

Original Code Code Move to the Succeeding Block

Code Move to the Preceding Block

Page 35: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 35

Trace Scheduling

• Similar to Basic Block Scheduling– Their unit is traces not Basic Blocks

• Reduce execution time of likely traces– Using Profiling

Page 36: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 36

Software Pipeline

• Reorganizing loops such that each iteration is composed of instruction sequences chosen from different iterations

• Use less code size– Compared to Unrolling

• Some Architecture has specific software support– Rotating register banks– Predicated Instructions

Page 37: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 37

Software Pipelining

• Overlap instructions without unrolling the loop• Give the vector M in memory, and ignoring the start-up and finishing

code, we have:

Loop: SD 0 (R1), F4 ;stores into M[i]ADDD F4, F0, F2 ;adds to M[i +1]LD F0, -8 (R1) ;loads M[i + 2]BNEZ R1, LOOP

SUB R1, R1, #8 ;subtract indelay slot

This loop can be run at a rate of 5 cycles per result, ignoring the start-up and clean-up portions.

Page 38: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 38

Software Pipelining

1 2 3 4 5 6 7

1 LD

2 LD

3 ADDD LD

4 ADDD LD

5 ADDD LD

6 SD ADDD LD

7 BNEZ SD ADDD LD

8 BNEZ SD ADDD

9 BNEZ SD ADDD

10 BNEZ SD

11 BNEZ SD

Tim

e

Iter

Page 39: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 39

Software Pipeline

Overhead for Software Pipeline: Two times cost One for Prolog and one for epilog

Overhead for Unrolled Loop: M / N times cost M Loop Executions and N unrolling

Software Pipeline CodePrologue Epilog

Unrolled

Number of Overlapped instructions

Number of Overlapped instructions

Time

Time

Page 40: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 40

Loop Unrolling V.S. Software Pipelining

• When not running at maximum rate– Unrolling: Pay m/n times overhead when m

iteration and n unrolling– Software Pipelining: Pay two times

• Once at prologue and once at epilog• Moreover

– Code compactness– Optimal runtime– Storage constrains

Page 41: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 41

Comparison of Static Methods

w/o scheduling

scheduling unrolling Unrolling + Scheduling

2 issue 4 issue SP 1-issue

SP 5-Issue

Cycles per iterations

9 6 6.8 3.5 2.4 1.28 5 1

Page 42: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 42

On a Final Note

Loop unrolling, trace scheduling, and software pipelining all aim at exposing fine grain parallelism.

“The effectiveness of these techniques and their suitability for various architectural approaches are among the most open research areas in pipelined processor design”

- Henn & Patt

Page 43: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 43

Limitations of VLIW

• Limited parallelism (statically schedule) code– Basic Blocks may be too small– Global Code Motion is difficult

• Limited Hardware Resources• Code Size• Memory Port limitations• A Stall is serious• Cache is difficult to be used (effectively)

– i-cache misses have the potential to multiply the miss rate by a factor of n where n is the issue width

– Cache miss penalty is increased since the length of instruction word

Page 44: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 44

An Open Question

“...Whether there are large classes of applications that are not suitable for vector machines, but still offer enough parallelism to justify the VLIW approach rather than a simpler one, such as a superscalar machine?”

Henn & Patt 1990

Page 45: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 45

An VLIW ExampleT

MS

32C

62x/

C67

Blo

ck D

iagr

am

Source: TMS320C600 Technical Brief. February 1999

Page 46: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 46

An VLIW Example

TMS32C62x/C67 Data Paths

Source: TMS320C600 Technical Brief. February 1999

Assembly Example

Page 47: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 47

Introduction to SuperScalar

Topic 3b

Page 48: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 48

Instruction Issue Policy

• It determinates the processor look ahead policy– Ability to examine instructions beyond the

current PC

• Look Ahead must ensure correctness at all costs

• Issue policy – Protocol used to issue instructions

• Note: Issue, execution and completion

Page 49: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 49

Issues in Out of Order Execution & Completion

R3 := R3 op R5 (1)R4 := R3 + 1 (2)R3 := R5 + 1 (3)R7 := R3 op R4 (4)

1

3

2

4

Flow DependencyAnti DependencyOutput Dependency

(2), (3) cannot be completed out-of order, otherwise, the anti-dependence may be violated, or R3 in (2) may be incorrectly written by (3) – [when (2) was stalled for some reason]

Page 50: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 50

Issues in Out of Order Execution & Completion

R3 := R3 op R5 (1)R4 := R3 + 1 (2)R3 := R5 + 1 (3)R7 := R3 op R4 (4)

1

3

2

4

Flow DependencyAnti DependencyOutput Dependency

(1), (3) cannot be completed out-of-order!Output-dependence has to be checked with all preceding instructions which are already in exec pipes, before an inst is issued, and ensure results to be written in correct order. Otherwise R3 in (4) may get a wrong value.

Page 51: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 51

Issues in Out of Order Execution & Completion

R := (1) := R (2)R := (3)

Note that the anti-dependence between (2) and (3) is handled correctly by stalling (3)’s issue if (1) has not completed.

Page 52: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 52

Achieve High Performance in Multiple Issued Instruction Machines

• Detection and resolution of storage conflicts– Extra “Shadow” registers– Special bit for reservation

• Organization and control of the buses between the various units in the PU– Special controllers to detect write backs and

read

Page 53: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 53

How to Detect Data Dependencies

X1 = X2 + X3

Y1 = Y2 + Y3

How many dependencies between these two instruction?

Five Possible Dependencies

A Total of 5 * O(n2) for n instructions

Page 54: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 54

A Super Scalar Architecture

Inst Fetch

Inst Decode

Issue Window

Wake Up Select

Register File

Exec

Write Back

New!!!!!Holds the instructions that are ready and the one that are waiting for dependencies

Page 55: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 55

Data Dependencies & SuperScalar

• Hardware Mechanism (dynamic scheduling)

- Scoreboarding

- limited out-of-order issue/completion

- centralized control

- Renaming with reorder buffer is a another attractive approach

(based on Tomasulo Alg.)

- Micro dataflow

- Advantage: exact runtime information

- Load/cache miss

- resolve storage location related dependence

Page 56: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 56

Scoreboarding• Named after CDC 6600

• Effective when there are enough resources and no data dependencies

• Out-of-order execution

• Issue: checking scoreboard and WAW will cause a stall

• Read operand- checking availability of operand and resolve RAW dynamically at this step

- WAR will not cause stall

• EX

• Write result- WAR will be checked and will cause stall

Page 57: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 57. . . . .

Registers

Integer unit

FP add

FP divide

FP multFP mult

Scoreboard

Data buses

Control/status

Control/status

The basic structure of a DLX processor with a scoreboard

Page 58: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 58

Scoreboarding

[CDC6600, Thorton70], [WeissSmith84]• A bit (called “scoreboard bit”) is associated with each

register bit = 1: the register is reserved by a write• An instruction has a source operand with bit = 1will be

issued, but put into an instruction window, with the register identifier to denote the “to-be-written” operand

• Copies of valid operands also be read with pending inst (solve anti-dependence)

• When the missing operand is finally written, the register id in the pending inst will be compared and value written, so it can be issued

• An inst has result R reserved - will stall so the output-dependence (WAW) will be correctly handled by stall!

Page 59: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 59

Example

(1) DIVF F0, F2, F4

(2) ADDF F10, F0, F8

(3) SUBF F8, F8, F14

(3) is allowed to be issued and executed when (2)

is waiting some operand

(3) cannot write its result to F8 (stalls!) if (2) has

not read F8: a stall - since no renaming of the Rs.

Page 60: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 60

Motivation of Dynamic Scheduling

Q1

– How to issue S2 without waiting S1 to complete?

(scoreboard)

– How to issue S3 without waiting S2 to complete?

– How to issue S3 without S1 to complete?

S1 X =

S2 = X

S3 X =

Page 61: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 61

• It permits out-of-order “issue” of instructions which do not related to each other.

• It permits out-of-order “completion” of insts which do not related to each other.

• It prevents “execution” of an inst if flow-dependence is violated.

• It prevents “issue” of an inst if output-dependence is violated.

• It prevents “completion” of an inst if anti-dependence is violated.

Features of Scoreboarding

Page 62: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 62

Advantage

• single bit – simple

• only one pending write per reg. so do

not need identify which is the latest

Scoreboarding

Page 63: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 63

Micro Data Flow

• Fundamental Concepts– “Data Flow”

• Instructions can only be fired when operands are available

– Single assignment and register renaming

• Implementation– Tomasulo’s Algorithm– Reorder Buffer

Page 64: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 64

Renaming/Single Assignment

R0 = R2 / R4; (1)R6 = R0 + R8 (2)R1[0] = R6 (3)R8 = R10 – R14 (4)R6 = R10 * R8 (5)

12

34

5

R0 = R2 / R4; (1)S = R0 + R8 (2)R1[0] = S (3)T = R10 – R14 (4)R6 = R10 * T (5)

12

34

5

Page 65: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 65

Principles of Register Renaming

• Additional R’s reestablish a one to one correspondence between values and registers.

• Extra registers Scheduled by hardware and associated with values

• A new value New (hardware) register• Anti and Output Dependencies are avoided• Registers are reused according to program

needs

Page 66: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 66

Baseline Superscalar Model

Inst Fetch

Inst Decode

Wake Up Select

Register File

ExecData Cache

Bypass

Renaming

Issue Window

Execution BypassData Cache Access

Register Write &Instruction Commit

Page 67: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 67

Micro Data FlowConceptual Model

A R1R1 * B R2R2 / C R1R4 + R1 R4

A

Load

*

/

+

B

C

R1OR4

OR3

OR5OR1

OR6

R2

R4 R1

R4

R1

R2

R3

R4

Page 68: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 68

Register Types

• Two Kinds of registers– Forwarding Registers

• Program / Instruction Visible• Compiler and programmer scheduled

– Physical Operand Registers• Not Visible• Scheduled and assigned by the hardware

Page 69: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 69

Reorder Buffer & Instruction Commit

• Instruction Commit– When an instruction is allowed to update memory

and/or registers– Concept used a lot in speculation

• Instruction Commit V.S. Instruction Execution– When speculation is used, Inst. Commit may not

happen immediately after inst. execution. – Reorder Buffer– A hardware buffer holding completed instructions but

not committed– Execute out-of-order but commit in-order– Extend the register set with extra registers

Page 70: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 70

ROB Stages

• Issue– Dispatch an instruction from the instruction queue– Reserved ROB entry and a reservation station

• Execute– Stall for operands– RAW resolved

• Write Result – Write back to any reservation stations waiting for it and to the

ROB• Commit

– Normal Commit: Update Registers– Store Commit: Update Memory– False Branch: Flush the ROB and re-begin execution

Page 71: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 71

ROB High Level Overview

Fetcher

InstructionQueue

Decoder

ReorderBuffer

InstructionWindow

RegisterFile

FunctionalUnit

FunctionalUnit

FunctionalUnit

Page 72: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 72

ROB Organization

• Content Addressable– X = A + B, X is renamed by a ROB register

(X’) and all references will be replaced by it– If an X’ is needed as an operand then ROB[X]

and• The value (if available) or a tag (if not) is returned if

X’ exists in the ROB, or …• A search to the “Visible” register bank is executed

if X’ is not found in the ROB

Page 73: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 73

ROB Organization

• If there is more than one ROB[X], then the most recent “entry” is fetched from the ROB

• When a result is produced:– All reservation stations that have a tag for that

result are updated

• When an instruction commits:– Update register banks and memory – Flush the ROB in case of a false branch

Page 74: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 74

Reg R-buffer

inst results

inst operands inst operands

R1 = R0 + 5 (1)

R2 = R1 + 6 (2)

R1 = R1 + 3 (3)

R4 = R1 + 9 (4)

op op1 op2 destR-buffer

R-name Value full

B5 R1 0

B8 R2 0

(1)+(2) issued

* + R0 = 1 5 B5

+ B5 6 B8

Assume R0 =1initially

enab

led

B5

B8

B11

R1

R1

0

0

0

+ 1 5 B5

+ B5 3 B11

+ B5 6 B8

(3) issued in-flight

R2

An

Exa

mpl

e

Page 75: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 75

+ 6 6 B8

+ 6 3 B11

*

*

B5

B8

B11

R1

R2

R1

6 1

0

0

(1) completed

An

Exa

mpl

eR1 = R0 + 5 (1)

R2 = R1 + 6 (2)

R1 = R1 + 3 (3)

R4 = R1 + 9 (4)

Page 76: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 76

B5 R1

R2

R1

6 1

0

0

(4) issued

R4

0

enabled op op1 op2 dest

Note: this is directly from B11 (not “6”), so, flow dependence is handled!

Also note these 2 instructors (e.g. (2) and (3)) can be completed out-of-order, but “6” is not affected so anti-dep : is resolved properly.

B8

B11

B13

* + 6 6 B8

+ B11 9 B13

* + 6 3 B11

An

Exa

mpl

eR1 = R0 + 5 (1)

R2 = R1 + 6 (2)

R1 = R1 + 3 (3)

R4 = R1 + 9 (4)

Page 77: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 77

Questions

• Memory Renaming– Not as attractive as R-R Data Flow– Load and Store are less frequent– Memory locations are less reused (in the register alloc

sense)– Memory ops have only one memory operand

• Store Buffer– Give Load priority to access the data cache– In order stores– Ensure that all instructions are performed before a

store has completed

Page 78: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 78

Memory Dataflow

• More difficult– Memory address is longer– Memory address may not be available at

decode stage

• Note:– In order-cache state: All stores must

performed in program order• All previous operations should have completed

– No cache reorder / check mechanism

Page 79: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 79

Load / Store Policy

• Loads may by-pass Stores if there are NO true dependencies among them• A check should be performed to ensure correctness• it cannot by-pass a store with the same destination. If one is

detected, the load is satisfied directly from the store buffer

• Loads are performed in program order at data cache, with respect to other loads• Simplicity• Out of Order doesn’t help much anyway

• At data cache interface• No Anti: No Store can by pass a Load• No Output: No Store can by pass each other

Page 80: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 80

Memory Dependencies & Event Ordering

• In case that a store target cannot be resolved – All subsequent loads are withheld until the

address can be resolved

• If two loads are in the instruction window– Do they need to wait for each other to be

resolved?

Page 81: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 81

Out-Of-Order Architectures

Fetch

DecodeRename

i0: R2 * R3i1: load@[R1 + R4]…i2: load@[R5]

INT FP L/S ROB

Mem

Independent Loads can execute in parallel

If i1 and i2 are independent, then they can executed at the same time

Loads do NOT need to wait for each other, even when addressed to the same memory location

Page 82: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 82

Summary

• Reorder Buffer– The most powerful scheme from the complex

dynamic scheduling techniques

• Simplest: Scoreboarding

• Hardware implementation is complex– Worth its returns?

Page 83: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 83

Tomasulo’s Algorithm

• Tomasulo, R.M. “An Efficient Algorithm for Exploiting Multiple Arithmetic Units”, IBM J. of R&D 11:1 (Jan, 1967, p.p.232-233)

• IBM 360/91 (three year after CDC 6600 and just before caches)

• Features:• CDB: Common Data Bus• Reservation Units: Hardware features which allow the

fetch, use and reuse of data as soon as it becomes available. It allows register renaming and it is decentralized in nature (as opposed as Scoreboarding)

Page 84: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 84

Tomasulo’s Algorithm

• Control and Buffers distributed with Functional Units.• HW renaming of registers• CDB broadcasting• Load / Store buffers Functional Units• Reservation Stations:

– Hazard detection and Instruction control– 4-bit tag field to specify which station or buffer will produce the

result

• Register Renaming– Tag Assigned on IS– Tag discarded after write back

Page 85: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 85

Comparison

• Scoreboarding– Centralized Data structure

and control– Register bit

• Simple, low cost

– Structural hazards solved by FU

– Solve RAW by register bit– Solve WAR in write – Solve WAW stalls on issue

• Tomasulo’s Algoritjm– Distributed control– Tagged Registers +

register renaming– Structural Hazard stalls on

Reservation Station– Solve RAW by CDB– Solve WAR by copying

operand to Reservation Station

– Solve WAW by renaming– Limited: CDB

• Broadcast• 1 per cycle

Page 86: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 86

Reservation Station Fields

• Op: Operation to perform in the unit• Vj, Vk: Value of Source Operands

– Store Buffers has V field Result to be stored• Qj, Qk: Reservation Stations producing the source

registers– Zero means ready

• Busy• A: Memory address calculation• Register file:

– Qi: The number of reservation stations that will write to this register

Page 87: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 87

The Architecture

654321

Formmemory

Load buffers

From instruction unitFloating-pointoperations FP registers

FP adders FP multipliers

Store buffers

tomemory

Common data bus (CDB)

321

321

Operation bus

21

ReservationStations

Operandbus

- 3 Adders- 2 Multipliers- Load buffers (6)- Store buffers (3)- FP Queue- FP registers- CDB: Common Data Bus

Page 88: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 88

Tomasulo’s Algorithm’s Steps

• Issue- Issue if empty reservation station is found, fetch operands if they are in

registers, otherwise assign a tag- If no empty reservation is found, stall and wait for one to get free- Renaming is performed here and WAW and WAR are resolved

• Execute– If operands are not ready, monitor the CDB for them– RAWs are resolved– When they are ready, execute the op in the FU

• Write Back– Send the results to CDB and update registers and the Store buffers– Store Buffers will write to memory during this step

• Exception Behavior– During Execute: No instructions are allowed to be issued until all

branches before it have been completed

Page 89: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 89

Tomasulo’s Algorithm

• Note that:• Upon Entering a reservation station, source operands

are either filled with values or renamed• The new names are 1-to-1 correspondence to FU

names

• Question:• How the output dependencies are resolved?

• Two pending writes to a register• How to determinate that a read will get the most

recent value if they complete out of order

Page 90: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 90

Features of T. Alg.

• The value of an operand (for any inst already issued in a reservation station) will be read from CDB. it will not be read from the reg. field.

• Instructions can be issued without even the operands produced (but know they are coming from CDB)

Page 91: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 91

An Example

LD F6, 34 (R2) (1)LD F2, 45 (R3) (2)MULD F0, F2, F4 (3)SUBD F8, F2, F6 (4)DIVD F10, F0, F6 (5)ADDD F6, F8, F2 (6)

1

2

3

4

5

6

Page 92: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 92

An Example

OP Vj Vk Qj Qk Busy Addr3OP Vj Vk Qj Qk Busy Addr2OP Vj Vk Qj Qk Busy Addr1

OP Vj Vk Qj Qk Busy Addr2OP Vj Vk Qj Qk Busy Addr1

OP Vj Vk Qj Qk Busy Addr3OP Vj Vk Qj Qk Busy Addr2L 0 0 Yes 34+[R2]1

0

1

2

3

4

5

6 L1

7

8

9

10

11

01

2

3

4

5

6

7

89

10

11

Memory Unit FP Adders FP Mult

ADDD F6,F8,F2

DIVD F10,F0,F6

SUBD F8.F2.F6

MULD F0,F2,F4

LD F2,45(R3)

Page 93: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 93

An Example

OP Vj Vk Qj Qk Busy Addr3OP Vj Vk Qj Qk Busy Addr2OP Vj Vk Qj Qk Busy Addr1

OP Vj Vk Qj Qk Busy Addr2OP Vj Vk Qj Qk Busy Addr1

OP Vj Vk Qj Qk Busy Addr3L 0 0 Yes 45+[R3]2L 0 0 Yes 34+[R2]1

0

1

2 L2

3

4

5

6 L1

7

8

9

10

11

01

2

3

4

5

6

7

89

10

11

Memory Unit FP Adders FP Mult

ADDD F6,F8,F2

DIVD F10,F0,F6

SUBD F8.F2.F6

MULD F0,F2,F4

Page 94: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 94

An Example

OP Vj Vk Qj Qk Busy Addr3OP Vj Vk Qj Qk Busy Addr2OP Vj Vk Qj Qk Busy Addr1

OP Vj Vk Qj Qk Busy Addr2* 45+[R3] [F4] = 4 L2 0 Yes Addr1

OP Vj Vk Qj Qk Busy Addr3L 0 0 Yes 45+[R3]2L 0 0 Yes 34+[R2]1

0 M1

1

2 L2

3

4

5

6 L1

7

8

9

10

11

01

2

3

4

5

6

7

89

10

11

Memory Unit FP Adders FP Mult

ADDD F6,F8,F2

DIVD F10,F0,F6

SUBD F8.F2.F6

Page 95: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 95

An Example

OP Vj Vk Qj Qk Busy Addr3OP Vj Vk Qj Qk Busy Addr2

- 45+[R3] 34+[R2] L2 L1 Yes Addr1

OP Vj Vk Qj Qk Busy Addr2* 45+[R3] [F4] = 4 L2 0 Yes Addr1

OP Vj Vk Qj Qk Busy Addr3L 0 0 Yes 45+[R3]2L 0 0 Yes 34+[R2]1

0 M1

1

2 L2

3

4

5

6 L1

7

8 A1

9

10

11

01

2

3

4

5

6

7

89

10

11

Memory Unit FP Adders FP Mult

ADDD F6,F8,F2

DIVD F10,F0,F6

Page 96: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 96

An Example

OP Vj Vk Qj Qk Busy Addr3OP Vj Vk Qj Qk Busy Addr2

- 45+[R3] 34+[R2] L2 L1 Yes Addr1

/ 34+[R2] M1 L1 Yes Addr2* 45+[R3] [F4] = 4 L2 0 Yes Addr1

OP Vj Vk Qj Qk Busy Addr3L 0 0 Yes 45+[R3]2L 0 0 Yes 34+[R2]1

0 M1

1

2 L2

3

4

5

6 L1

7

8 A1

9

10 M2

11

01

2

3

4

5

6

7

89

10

11

Memory Unit FP Adders FP Mult

ADDD F6,F8,F2

Page 97: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 97

An Example

OP Vj Vk Qj Qk Busy Addr3+ 45+[R3] A1 L2 Yes Addr2- 45+[R3] 34+[R2] L2 L1 Yes Addr1

/ 34+[R2] M1 L1 Yes Addr2* 45+[R3] [F4] = 4 L2 0 Yes Addr1

OP Vj Vk Qj Qk Busy Addr3L 0 0 Yes 45+[R3]2L 0 0 Yes 34+[R2]1

0 M1

1

2 L2

3

4

5

6 A2

7

8 A1

9

10 M2

11

01

2

3

4

5

6

7

89

10

11

Memory Unit FP Adders FP Mult

Page 98: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 98

An Example

OP Vj Vk Qj Qk Busy Addr3+ 45+[R3] A1 L2 Yes Addr2- 45+[R3] 40 L2 0 Yes Addr1

/ 40 M1 0 Yes Addr2* 45+[R3] [F4] = 4 L2 0 Yes Addr1

OP Vj Vk Qj Qk Busy Addr3L 0 0 Yes 45+[R3]2

1

0 M1

1

2 L2

3

4

5

6 A2

7

8 A1

9

10 M2

11

01

2

3

4

5

6

7

89

10

11

Memory Unit FP Adders FP Mult

Some time later L1 returns 40 and commits

Page 99: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 99

An Example

OP Vj Vk Qj Qk Busy Addr3+ 32 A1 0 Yes Addr2- 32 40 0 0 Yes Addr1

/ 40 M1 0 Yes Addr2* 32 [F4] = 4 0 0 Yes Addr1

OP Vj Vk Qj Qk Busy Addr3

2

1

0 M1

1

32

3

4

5

6 A2

7

8 A1

9

10 M2

11

01

2

3

4

5

6

7

89

10

11

Memory Unit FP Adders FP Mult

Some time later L2 returns 32 and commits

Page 100: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 100

An Example

OP Vj Vk Qj Qk Busy Addr3+ -8 32 0 0 Yes Addr2

1

/ 128 40 0 0 Yes Addr2

1

OP Vj Vk Qj Qk Busy Addr3

2

1

128

1

32

3

4

5

6 A2

7

-8

9

10 M2

11

01

2

3

4

5

6

7

89

10

11

Memory Unit FP Adders FP Mult

A1 and M1 Complete and commit

Page 101: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 101

An Example

OP Vj Vk Qj Qk Busy Addr3

2

1

2

1

OP Vj Vk Qj Qk Busy Addr3

2

1

128

1

32

3

4

5

24

7

-8

9

3.2

11

01

2

3

4

5

6

7

89

10

11

Memory Unit FP Adders FP Mult

A2 and M2 Complete and commit

Page 102: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 102

ROB and Tomasulo’s Alg.

• Many elements of Tomasulo’s algorithm are already included• Major difference?

How WAW is handled?

In Tomasulo: this is by keeping a “tag” with each register x and the tag is updated, each time a

x “+”

is issued, i.e. X-tag “+3” means 3rd + unit is reserved.

when write back via CDB, the tag of FU is compare with tag of R:

if

tag of FU = X-tag, overwritten the R (e.g. X)

else

ignore the result

Page 103: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 103

Tomasulo’s Algorithm

• Advantages• Distribution of Hazard

detection logic • R-renaming and

reservation stations take care of all data hazards.

• Disadvantages- Hardware cost: high-

speed associative M for tags + complex control logic

- One single CDB may be a bottleneck, while multiple CDB may be too costly (all associative - M must be duplicated)

Page 104: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 104

Conclusions

- Good for pipelined architecture which is difficult to schedule code and it is short in “visible” registers

- Future:- Hybrid between Software and hardware

techniques- Static schedule of R-R’s- Dynamic Schedule of Load and stores

Page 105: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 105

Page 106: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 106

ExampleDynamic Scheduling in Pentium 4

• Fetch up 3 IA-32 Instruction per cycle

• Decode them into micro code and send them to the out of order execution engine.

• Commit up to 3 micro ops per cycle

• Pipeline takes 20 cycles

• Register Renaming files– Potentially 128 outstanding results

• Seven Integer execution units

Page 107: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 107

An Example of an OoO Engine

Intel Xeon Out of Order Engine Pipeline Picture Courtesy of Intel from “Hyper-Threading Technology Architecture and Microarchitecture”

Page 108: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 108

VLIW vs Superscalar

• Superscalar• Advantages

– Better Code density– Code compatible

• Difference– Dynamic scheduling

• Disadvantages– More IF and ID– More delay slots are

needed– Different FU

• VLIW• Advantages

– Fixed Instruction format– Explicit parallelism

exposed• Trace scheduling

• Difference:– Static Scheduling

• Disadvantages– Static scheduling– No dynamic decision– Code explosion– Caches are difficult to use

Page 109: 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity is knowing how to hide your sources. Albert Einstein.

9/25/2006 eleg652-F06 109

Bibliography

• Texas Instruments, “TMS320C600 Technical Brief.” February 1999. www.ti.com

• Intel Pentium 4 Northwood. www.chip-architect.com. April 2003.

• “Hyper-Threading Technology Architecture and Microarchitecture.” Intel Technology Journal, Volume 6, Issue 1, February 2002, p4-15