9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity...

9/25/2006 eleg652-F06 1

Topic 3

Exploitation of Instruction Level Parallelism

The secret to creativity is knowing how to hide your sources. Albert Einstein

9/25/2006 eleg652-F06 2

Reading List

• Slides: Topic3x

• Henn&Patt: Chapters 3 & 4

• Other assigned readings from homework and classes

9/25/2006 eleg652-F06 3

Instruction Level Parallelism

• Parallelism that is found between instructions (or intra instruction)

• Dynamic and Static Exploitation– Dynamic: Hardware related. – Static: Software related (compiler and system

software)

• VLIW and Superscalar

• Micro-Dataflow and Tomasulo’s Algorithm

9/25/2006 eleg652-F06 4

RISC Concepts: Revisit

• Reduced Instruction Set Architecture– “Internal Computing Architecture in which processor instructions

are pared down so that most of them can be executed in one clock cycle, theoretically improving computing efficiency” Black Box Pocket Glossary of Computer Terms

• Characteristics:– Uniform instruction encoding– Homogenous Register Banks– Simplified Addressing Modes– Simplified data structures– Branch delay slot– Cache– Pipeline

9/25/2006 eleg652-F06 5

RISC Concepts: Revisited

• What prevents one instruction per cycle (CPI = 1)?– Hazards– Dependencies– Long Latency ops

• Cache Trashing

9/25/2006 eleg652-F06 6

Pipeline: A Review

• Hazards– Any situation that will prevent the smooth flow of the

instructions along the pipeline– Types

• Structural– Due to limited resources and contention among them

• Control– Instructions that change the PC (program counter)

• Data– Variables depends on values from previous instruction

– Stall• Hazards will “stall” the pipeline• Serious: It can hold up many instructions for many cycles

9/25/2006 eleg652-F06 7

RISC Pipeline & Instruction Issue

• Instruction Issue– The process of letting an instruction move from ID to

EXEC– Issue V.S. Execution

• In DLX– ID Check all data hazards, stall if any exists

Typical RISC Pipeline:

Instruction Fetch Instruction Decode Execute Memory Op Register Update

9/25/2006 eleg652-F06 8

Hazards

• Structural Hazards– Non Pipelining Function Units– One Port Register Bank and one port memory

bank

• Data Hazards– For some

• Forwarding

– For others• Pipeline Interlock

LD R1 A+ R4 R1 R7

Need Bubble / Stall

9/25/2006 eleg652-F06 9

Instruction Clock cycle number

1 2 3 4 5 6 7 8 9

Load instruction IF ID EX MEM WB

Instruction i+1 IF ID EX MEM WB



Instruction i+4 IF ID EX MEM

Structural Hazard

A single memory bank for insts and data

9/25/2006 eleg652-F06 10

Data Hazards

Instruction 1 2 3 4 5 6

ADD IF ID EX MEM WB

SUB IF ID EX MEM WB

Stage

Stage

Data is read here

Data is written here

The ADD instruction writes a register that is a source operand for the SUB instruction. But the ADD doesn’t finish writing the data into the register file until three clock cycles after SUB begins reading it!

The SUB instruction may read the incorrect value. Result may be non-deterministic. Solved by forwarding

9/25/2006 eleg652-F06 11

Data Dependency: A Review

B + C A

A + D E

Flow DependencyRAW Conflicts

A + C B

E + D A

Anti DependencyWAR Conflicts

B + C A

E + D A

Output DependencyWAW Conflicts

RAR are not really a problem

9/25/2006 eleg652-F06 12

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

IF ID EX MEM WB

ADD R1,R2,R3

SUB R4,R1,R5

AND R6,R1,R7

OR R8,R1,R9

XOR R10,R1,R11

Forwarding Example

9/25/2006 eleg652-F06 13

Bypassing Pitfalls

LW R1, 32 (R6)ADD R4, R1, R7SUB R5, R1, R8AND R6, R1, R7

IF ID EX MEM WBIF ID STALL EX MEM WB

IF STALL ID EX MEM WBSTALL IF ID EX MEM WB

Load Delay Slot cannot be eliminated by forwarding alone

Pipeline Interlock: Stall / Bubble for hazards that cannot be solved by forwarding

The Pipeline

The Code

9/25/2006 eleg652-F06 14

Pipelining

• Issue Pass the Instruction Decode stage

• DLX: Only issue instruction if there is no hazard

• Detect interlock early in the pipeline has the advantage that it never needs to suspend an instruction and undo state changes.

9/25/2006 eleg652-F06 15

Instruction Level Parallelism

• Static Scheduling– Simple Scheduling– Loop Unrolling– Loop Unrolling + Scheduling– Software Pipelining

• Dynamic Scheduling– Out of order execution– Data Flow computers

• Speculation

9/25/2006 eleg652-F06 16

Constraint Graph

• Directed-edges: data-dependence

• Undirected-edges: Resources constraint

• An edge (u,v) (directed or undirected) of length e

represent an interlock between node u and v, and they

must be separated by e time.

S1

S6

S5S4

S3S2

12

62

1 1

operation latencies4

3

9/25/2006 eleg652-F06 17

Code Scheduling For Single Pipeline

• Input: A constraint graph G = (V, E)

• Output: A sequence of operations in G (v1, v2, v3, v4, v5 ….vn) plus a number of no-op, such that:– If the no-op are deleted then the sequence is

a topological sort of G.– Any two nodes in the sequence (x, y) is

separated by a distance greater or equal d(x,y) in graph G

9/25/2006 eleg652-F06 18

Advanced Pipelining

• Instruction Reordering and scheduling within loop body

• Loop Unrolling– Code size suffers

• Superscalar– Compact code– Multiple issued of different instruction types

• VLIW

9/25/2006 eleg652-F06 19

VLIW

• Very Long Instruction Word

• Compiler has all responsibility to schedule instructions

• Make hardware simpler– Move complexity to software

• Concept developed by John Fisher at Yale’s University in early 1980

9/25/2006 eleg652-F06 20

An ExampleX[i] + a Loop: LD F0, 0 (R1) ; load the vector element

ADDD F4, F0, F2 ; add the scalar in F2SD 0 (R1), F4 ; store the vector elementSUB R1, R1, #8 ; decrement the pointer by

; 8 bytes (per DW)BNEZ R1, Loop ; branch when it’s not zero

Instruction Producer Instruction Consumer Latency

FP ALU op FP ALU op 3

FP ALU op Store Double 2

Load Double FP ALU op 1

Load Double Store Double 0

Load can by-pass the storeAssume that latency for Integer ops is zero and latency for Integer load is 1

9/25/2006 eleg652-F06 21

An ExampleX[i] + a

Loop: LD F0, 0 (R1) 1STALL 2ADDD F4, F0, F2 3STALL 4STALL 5SD 0 (R1), F4 6SUB R1, R1, #8 7BNEZ R1, Loop 8STALL 9

Load Latency

FP ALU Latency

Load Latency

This requires 9 Cycles per iteration

LD ADDD SD SUB BNEZ1 2 0 0 1

Constrain Graph

9/25/2006 eleg652-F06 22

An ExampleX[i] + a

Loop: LD F0, 0 (R1) 1STALL 2ADDD F4, F0, F2 3SUB R1, R1, #8 4BNEZ R1, Loop 5SD 8 (R1), F4 6

This requires 6 Cycles per iteration

LD ADDD SD SUB BNEZ1 2 0 0 1

Constrain Graph

Scheduling

9/25/2006 eleg652-F06 23

An ExampleX[i] + a

Loop : LD F0, 0 (R1) 1NOP 2ADDD F4, F0, F2 3NOP 4NOP 5SD 0 (R1), F4 6LD F6, -8 (R1) 7NOP 8ADDD F8, F6, F2 9NOP 10NOP 11SD -8 (R1), F8 12LD F10, -16 (R1) 13NOP 14ADDD F12, F10, F2 15NOP 16NOP 17SD -16 (R1), F12 18LD F14, -24 (R1) 19NOP 20ADDD F16, F14, F2 21NOP 22NOP 23SD -24 (R1), F16 24SUB R1, R1, #32 25BNEZ R1, LOOP 26NOP 27

This requires 6.8 Cycles per iteration

Unrolling

9/25/2006 eleg652-F06 24

An Example

X[i] + a

Loop : LD F0, 0 (R1) 1LD F6, - 8 (R1) 2LD F10, -16 (R1) 3LD F14, -24 (R1) 4ADDD F4, F0, F2 5ADDD F8, F6, F2 6ADDD F12, F10, F2 7ADDD F16, F14, F2 8SD 0 (R1), F4 9SD -8 (R1), F8 10SD -16 (R1), F12 11SUB R1, R1, #32 12 BNEZ R1, LOOP 13SD 8 (R1), F16 14

This requires 3.5 Cycles per iteration

Unrolling + Scheduling

9/25/2006 eleg652-F06 25

Topic 3a

Multi Issue Architectures

Beyond Simple RISC

9/25/2006 eleg652-F06 26

ILP

• ILP of a program– Average Number of Instructions that a superscalar

processor might be able to execute at the same time• Data dependencies• Latencies and other processor difficulties

• ILP of a machine– The ability of a processor to take advantage of the ILP

• Number of instructions that can be fetched and executed at the same time by such processor

9/25/2006 eleg652-F06 27

Multi Issue Architectures

• Super Scalar– Machines that issue multiple independent instructions

per clock cycle when they are properly scheduled by the compiler and runtime scheduler

• Very Long Instruction Word– A machine where the compiler has complete

responsibility for creating a package of instructions that can be simultaneously issued, and the hardware does not dynamically make any decisions about multiple issue

Patterson & Hennessy P317 and P318

9/25/2006 eleg652-F06 28

Multiple Instruction Issue

• Multiple Issue + Static Scheduling VLIW• Dynamic Scheduling

– Tomasulo– Scoreboarding

• Multiple Issue + Dynamic Scheduling Superscalar

• Decoupled Architectures– Static Scheduling of R-R Instructions– Dynamic Scheduling of Memory Ops

• Buffers

9/25/2006 eleg652-F06 29

Five Primary Approaches

Common Name

Issue Structure

Hazard Detection

Scheduling Distinguishing characteristics

Examples

Superscalar (static)

Dynamic Hardware Static In order execution Sun UltraSPARC II and III

Superscalar (dynamic)

Dynamic hardware Dynamic Some out of order execution

IBM Power2

Superscalar (speculative)

Dynamic Hardware Dynamic With speculation

Speculative out of order execution

Pentium 3 and 4

VLIW / LIW Static Software Static No hazards between issues packets

Trimedia, i860

EPIC Mostly Static Mostly Software

Mostly Static Explicit Dependences marked by compiler

Itanium

9/25/2006 eleg652-F06 30

Integer instruction FP instruction Clock cycle

Loop: LD F0, 0 (R1) 1LD F6, -8 (R1) 2LD F10, -16 (R1) ADDD F4, F0, F2 3LD F14, -24 (R1) ADDD F8, F6, F2 4LD F18, -32 (R1) ADDD F12, F10, F2 5SD 0 (R1), F4 ADDD F16, F14, F2 6SD -8 (R1), F8 ADDD F20, F18, F2 7SD -16 (R1), F12 8SD -24 (R1), F16 9SUB R1, R1, #40 10BNEZ R1, LOOP 11SD 8 (R1), F20 12

Two-Issue ArchitectureUnrolled and Scheduled Code

The unrolled and scheduled code 2.4 cycles per iteration (5 iters in 12 cycles)

9/25/2006 eleg652-F06 31

Memory Memory FP FP Integer operation reference 1 reference 2 operation 1 operation 2 /branch

LD F0, 0 (R1) LD F6, - 8 (R1)LD F10, -16 (R1) LD F14, -24 (R1) LD F18, -32 (R1) LD F22, -40 (R1) ADDD F4, F0, F2 ADDD F8, F6, F2 LD F26, -48 (R1) ADDD F12, F10, F2 ADDD F16, F14, F2

ADDD F20, F18, F2 ADDD F24, F22, F2SD 0 (R1), F4 SD - 8 (R1), F8 ADDD F28, F26, F2SD -16 (R1), F12 SD -24 (R1), F16SD -32 (R1), F20 SD -40 (R1), F24 SUB R1, R1, #48SD - 0 (R1), F28 BNEZ R1, LOOP

Unrolling 6 times

F0

+

a

F4

LD

SD

F6

+

a

F8

LD

SD

F10

+

a

F12

LD

SD

F14

+

a

F16

LD

SD

F18

+

a

F20

LD

SD

F22

+

a

F24

LD

SD

F26

+

a

F28

LD

SD

A VLIW Code Sequence

7 iterations in 9 cycles 1.28 cycle per iter

9/25/2006 eleg652-F06 32

Trace Scheduling

• First Used for VLIW architecture• Trace

– A straight line sequence of instructions executed in some data or a sequence of ops which constitute a possible path based on “predicted” branches.

• Trace Scheduling– Identify a “most possible” sequence of instructions

and then “compact” the instructions in such path

• Tools– For Loops: Unrolling– For Branches: Static Branch prediction

9/25/2006 eleg652-F06 33

An ExampleTraces

A;B;C;if(D){ E; F;}else{ G;}H;I;

Basic BlockAn instruction sequence which has only one entry point and one exit point (no target for branches or branches in the middle)

ABC

br D

EF

G

HI

Trace 1 Trace 2

9/25/2006 eleg652-F06 34

Code Motion & Compensation Code

ABC

br D

EF G

HI

AB

br D

CE

CG

FHI

ABCE

br D

FH

Undo EGH

I

Original Code Code Move to the Succeeding Block

Code Move to the Preceding Block

9/25/2006 eleg652-F06 35

Trace Scheduling

• Similar to Basic Block Scheduling– Their unit is traces not Basic Blocks

• Reduce execution time of likely traces– Using Profiling

9/25/2006 eleg652-F06 36

Software Pipeline

• Reorganizing loops such that each iteration is composed of instruction sequences chosen from different iterations

• Use less code size– Compared to Unrolling

• Some Architecture has specific software support– Rotating register banks– Predicated Instructions

9/25/2006 eleg652-F06 37

Software Pipelining

• Overlap instructions without unrolling the loop• Give the vector M in memory, and ignoring the start-up and finishing

code, we have:

Loop: SD 0 (R1), F4 ;stores into M[i]ADDD F4, F0, F2 ;adds to M[i +1]LD F0, -8 (R1) ;loads M[i + 2]BNEZ R1, LOOP

SUB R1, R1, #8 ;subtract indelay slot

This loop can be run at a rate of 5 cycles per result, ignoring the start-up and clean-up portions.

9/25/2006 eleg652-F06 38

Software Pipelining

1 2 3 4 5 6 7

1 LD

2 LD

3 ADDD LD

4 ADDD LD

5 ADDD LD

6 SD ADDD LD

7 BNEZ SD ADDD LD

8 BNEZ SD ADDD

9 BNEZ SD ADDD

10 BNEZ SD

11 BNEZ SD

Tim

e

Iter

9/25/2006 eleg652-F06 39

Software Pipeline

Overhead for Software Pipeline: Two times cost One for Prolog and one for epilog

Overhead for Unrolled Loop: M / N times cost M Loop Executions and N unrolling

Software Pipeline CodePrologue Epilog

Unrolled

Number of Overlapped instructions

Number of Overlapped instructions

Time

Time

9/25/2006 eleg652-F06 40

Loop Unrolling V.S. Software Pipelining

• When not running at maximum rate– Unrolling: Pay m/n times overhead when m

iteration and n unrolling– Software Pipelining: Pay two times

• Once at prologue and once at epilog• Moreover

– Code compactness– Optimal runtime– Storage constrains

9/25/2006 eleg652-F06 41

Comparison of Static Methods

w/o scheduling

scheduling unrolling Unrolling + Scheduling

2 issue 4 issue SP 1-issue

SP 5-Issue

Cycles per iterations

9 6 6.8 3.5 2.4 1.28 5 1

9/25/2006 eleg652-F06 42

On a Final Note

Loop unrolling, trace scheduling, and software pipelining all aim at exposing fine grain parallelism.

“The effectiveness of these techniques and their suitability for various architectural approaches are among the most open research areas in pipelined processor design”

- Henn & Patt

9/25/2006 eleg652-F06 43

Limitations of VLIW

• Limited parallelism (statically schedule) code– Basic Blocks may be too small– Global Code Motion is difficult

• Limited Hardware Resources• Code Size• Memory Port limitations• A Stall is serious• Cache is difficult to be used (effectively)

– i-cache misses have the potential to multiply the miss rate by a factor of n where n is the issue width

– Cache miss penalty is increased since the length of instruction word

9/25/2006 eleg652-F06 44

An Open Question

“...Whether there are large classes of applications that are not suitable for vector machines, but still offer enough parallelism to justify the VLIW approach rather than a simpler one, such as a superscalar machine?”

Henn & Patt 1990

9/25/2006 eleg652-F06 45

An VLIW ExampleT

MS

32C

62x/

C67

Blo

ck D

iagr

am

Source: TMS320C600 Technical Brief. February 1999

9/25/2006 eleg652-F06 46

An VLIW Example

TMS32C62x/C67 Data Paths

Source: TMS320C600 Technical Brief. February 1999

Assembly Example

9/25/2006 eleg652-F06 47

Introduction to SuperScalar

Topic 3b

9/25/2006 eleg652-F06 48

Instruction Issue Policy

• It determinates the processor look ahead policy– Ability to examine instructions beyond the

current PC

• Look Ahead must ensure correctness at all costs

• Issue policy – Protocol used to issue instructions

• Note: Issue, execution and completion

9/25/2006 eleg652-F06 49

Issues in Out of Order Execution & Completion

R3 := R3 op R5 (1)R4 := R3 + 1 (2)R3 := R5 + 1 (3)R7 := R3 op R4 (4)

1

3

2

4

Flow DependencyAnti DependencyOutput Dependency

(2), (3) cannot be completed out-of order, otherwise, the anti-dependence may be violated, or R3 in (2) may be incorrectly written by (3) – [when (2) was stalled for some reason]

9/25/2006 eleg652-F06 50


R3 := R3 op R5 (1)R4 := R3 + 1 (2)R3 := R5 + 1 (3)R7 := R3 op R4 (4)

1

3

2

4

Flow DependencyAnti DependencyOutput Dependency

(1), (3) cannot be completed out-of-order!Output-dependence has to be checked with all preceding instructions which are already in exec pipes, before an inst is issued, and ensure results to be written in correct order. Otherwise R3 in (4) may get a wrong value.

9/25/2006 eleg652-F06 51


R := (1) := R (2)R := (3)

Note that the anti-dependence between (2) and (3) is handled correctly by stalling (3)’s issue if (1) has not completed.

9/25/2006 eleg652-F06 52

Achieve High Performance in Multiple Issued Instruction Machines

• Detection and resolution of storage conflicts– Extra “Shadow” registers– Special bit for reservation

• Organization and control of the buses between the various units in the PU– Special controllers to detect write backs and

read

9/25/2006 eleg652-F06 53

How to Detect Data Dependencies

X1 = X2 + X3

Y1 = Y2 + Y3

How many dependencies between these two instruction?

Five Possible Dependencies

A Total of 5 * O(n2) for n instructions

9/25/2006 eleg652-F06 54

A Super Scalar Architecture

Inst Fetch

Inst Decode

Issue Window

Wake Up Select

Register File

Exec

Write Back

New!!!!!Holds the instructions that are ready and the one that are waiting for dependencies

9/25/2006 eleg652-F06 55

Data Dependencies & SuperScalar

• Hardware Mechanism (dynamic scheduling)

- Scoreboarding

- limited out-of-order issue/completion

- centralized control

- Renaming with reorder buffer is a another attractive approach

(based on Tomasulo Alg.)

- Micro dataflow

- Advantage: exact runtime information

- Load/cache miss

- resolve storage location related dependence

9/25/2006 eleg652-F06 56

Scoreboarding• Named after CDC 6600

• Effective when there are enough resources and no data dependencies

• Out-of-order execution

• Issue: checking scoreboard and WAW will cause a stall

• Read operand- checking availability of operand and resolve RAW dynamically at this step

- WAR will not cause stall

• EX

• Write result- WAR will be checked and will cause stall

9/25/2006 eleg652-F06 57. . . . .

Registers

Integer unit

FP add

FP divide

FP multFP mult

Scoreboard

Data buses

Control/status

Control/status

The basic structure of a DLX processor with a scoreboard

9/25/2006 eleg652-F06 58

Scoreboarding

[CDC6600, Thorton70], [WeissSmith84]• A bit (called “scoreboard bit”) is associated with each

register bit = 1: the register is reserved by a write• An instruction has a source operand with bit = 1will be

issued, but put into an instruction window, with the register identifier to denote the “to-be-written” operand

• Copies of valid operands also be read with pending inst (solve anti-dependence)

• When the missing operand is finally written, the register id in the pending inst will be compared and value written, so it can be issued

• An inst has result R reserved - will stall so the output-dependence (WAW) will be correctly handled by stall!

9/25/2006 eleg652-F06 59

Example

(1) DIVF F0, F2, F4

(2) ADDF F10, F0, F8

(3) SUBF F8, F8, F14

(3) is allowed to be issued and executed when (2)

is waiting some operand

(3) cannot write its result to F8 (stalls!) if (2) has

not read F8: a stall - since no renaming of the Rs.

9/25/2006 eleg652-F06 60

Motivation of Dynamic Scheduling

Q1

– How to issue S2 without waiting S1 to complete?

(scoreboard)

– How to issue S3 without waiting S2 to complete?

– How to issue S3 without S1 to complete?

S1 X =

S2 = X

S3 X =

9/25/2006 eleg652-F06 61

• It permits out-of-order “issue” of instructions which do not related to each other.

• It permits out-of-order “completion” of insts which do not related to each other.

• It prevents “execution” of an inst if flow-dependence is violated.

• It prevents “issue” of an inst if output-dependence is violated.

• It prevents “completion” of an inst if anti-dependence is violated.

Features of Scoreboarding

9/25/2006 eleg652-F06 62

Advantage

• single bit – simple

• only one pending write per reg. so do

not need identify which is the latest

Scoreboarding

9/25/2006 eleg652-F06 63

Micro Data Flow

• Fundamental Concepts– “Data Flow”

• Instructions can only be fired when operands are available

– Single assignment and register renaming

• Implementation– Tomasulo’s Algorithm– Reorder Buffer

9/25/2006 eleg652-F06 64

Renaming/Single Assignment

R0 = R2 / R4; (1)R6 = R0 + R8 (2)R1[0] = R6 (3)R8 = R10 – R14 (4)R6 = R10 * R8 (5)

12

34

5

R0 = R2 / R4; (1)S = R0 + R8 (2)R1[0] = S (3)T = R10 – R14 (4)R6 = R10 * T (5)

12

34

5

9/25/2006 eleg652-F06 65

Principles of Register Renaming

• Additional R’s reestablish a one to one correspondence between values and registers.

• Extra registers Scheduled by hardware and associated with values

• A new value New (hardware) register• Anti and Output Dependencies are avoided• Registers are reused according to program

needs

9/25/2006 eleg652-F06 66

Baseline Superscalar Model

Inst Fetch

Inst Decode

Wake Up Select

Register File

ExecData Cache

Bypass

Renaming

Issue Window

Execution BypassData Cache Access

Register Write &Instruction Commit

9/25/2006 eleg652-F06 67

Micro Data FlowConceptual Model

A R1R1 * B R2R2 / C R1R4 + R1 R4

A

Load

*

/

+

B

C

R1OR4

OR3

OR5OR1

OR6

R2

R4 R1

R4

R1

R2

R3

R4

9/25/2006 eleg652-F06 68

Register Types

• Two Kinds of registers– Forwarding Registers

• Program / Instruction Visible• Compiler and programmer scheduled

– Physical Operand Registers• Not Visible• Scheduled and assigned by the hardware

9/25/2006 eleg652-F06 69

Reorder Buffer & Instruction Commit

• Instruction Commit– When an instruction is allowed to update memory

and/or registers– Concept used a lot in speculation

• Instruction Commit V.S. Instruction Execution– When speculation is used, Inst. Commit may not

happen immediately after inst. execution. – Reorder Buffer– A hardware buffer holding completed instructions but

not committed– Execute out-of-order but commit in-order– Extend the register set with extra registers

9/25/2006 eleg652-F06 70

ROB Stages

• Issue– Dispatch an instruction from the instruction queue– Reserved ROB entry and a reservation station

• Execute– Stall for operands– RAW resolved

• Write Result – Write back to any reservation stations waiting for it and to the

ROB• Commit

– Normal Commit: Update Registers– Store Commit: Update Memory– False Branch: Flush the ROB and re-begin execution

9/25/2006 eleg652-F06 71

ROB High Level Overview

Fetcher

InstructionQueue

Decoder

ReorderBuffer

InstructionWindow

RegisterFile

FunctionalUnit

FunctionalUnit

FunctionalUnit

9/25/2006 eleg652-F06 72

ROB Organization

• Content Addressable– X = A + B, X is renamed by a ROB register

(X’) and all references will be replaced by it– If an X’ is needed as an operand then ROB[X]

and• The value (if available) or a tag (if not) is returned if

X’ exists in the ROB, or …• A search to the “Visible” register bank is executed

if X’ is not found in the ROB

9/25/2006 eleg652-F06 73

ROB Organization

• If there is more than one ROB[X], then the most recent “entry” is fetched from the ROB

• When a result is produced:– All reservation stations that have a tag for that

result are updated

• When an instruction commits:– Update register banks and memory – Flush the ROB in case of a false branch

9/25/2006 eleg652-F06 74

Reg R-buffer

inst results

inst operands inst operands

R1 = R0 + 5 (1)

R2 = R1 + 6 (2)

R1 = R1 + 3 (3)

R4 = R1 + 9 (4)

op op1 op2 destR-buffer

R-name Value full

B5 R1 0

B8 R2 0

(1)+(2) issued

* + R0 = 1 5 B5

+ B5 6 B8

Assume R0 =1initially

enab

led

B5

B8

B11

R1

R1

0

0

0

+ 1 5 B5

+ B5 3 B11

+ B5 6 B8

(3) issued in-flight

R2

An

Exa

mpl

e

9/25/2006 eleg652-F06 75

+ 6 6 B8

+ 6 3 B11

*

*

B5

B8

B11

R1

R2

R1

6 1

0

0

(1) completed

An

Exa

mpl

eR1 = R0 + 5 (1)

R2 = R1 + 6 (2)

R1 = R1 + 3 (3)

R4 = R1 + 9 (4)

9/25/2006 eleg652-F06 76

B5 R1

R2

R1

6 1

0

0

(4) issued

R4

0

enabled op op1 op2 dest

Note: this is directly from B11 (not “6”), so, flow dependence is handled!

Also note these 2 instructors (e.g. (2) and (3)) can be completed out-of-order, but “6” is not affected so anti-dep : is resolved properly.

B8

B11

B13

* + 6 6 B8

+ B11 9 B13

* + 6 3 B11

An

Exa

mpl

eR1 = R0 + 5 (1)

R2 = R1 + 6 (2)

R1 = R1 + 3 (3)

R4 = R1 + 9 (4)

9/25/2006 eleg652-F06 77

Questions

• Memory Renaming– Not as attractive as R-R Data Flow– Load and Store are less frequent– Memory locations are less reused (in the register alloc

sense)– Memory ops have only one memory operand

• Store Buffer– Give Load priority to access the data cache– In order stores– Ensure that all instructions are performed before a

store has completed

9/25/2006 eleg652-F06 78

Memory Dataflow

• More difficult– Memory address is longer– Memory address may not be available at

decode stage

• Note:– In order-cache state: All stores must

performed in program order• All previous operations should have completed

– No cache reorder / check mechanism

9/25/2006 eleg652-F06 79

Load / Store Policy

• Loads may by-pass Stores if there are NO true dependencies among them• A check should be performed to ensure correctness• it cannot by-pass a store with the same destination. If one is

detected, the load is satisfied directly from the store buffer

• Loads are performed in program order at data cache, with respect to other loads• Simplicity• Out of Order doesn’t help much anyway

• At data cache interface• No Anti: No Store can by pass a Load• No Output: No Store can by pass each other

9/25/2006 eleg652-F06 80

Memory Dependencies & Event Ordering

• In case that a store target cannot be resolved – All subsequent loads are withheld until the

address can be resolved

• If two loads are in the instruction window– Do they need to wait for each other to be

resolved?

9/25/2006 eleg652-F06 81

Out-Of-Order Architectures

Fetch

DecodeRename

i0: R2 * R3i1: load@[R1 + R4]…i2: load@[R5]

INT FP L/S ROB

Mem

Independent Loads can execute in parallel

If i1 and i2 are independent, then they can executed at the same time

Loads do NOT need to wait for each other, even when addressed to the same memory location

9/25/2006 eleg652-F06 82

Summary

• Reorder Buffer– The most powerful scheme from the complex

dynamic scheduling techniques

• Simplest: Scoreboarding

• Hardware implementation is complex– Worth its returns?

9/25/2006 eleg652-F06 83

Tomasulo’s Algorithm

• Tomasulo, R.M. “An Efficient Algorithm for Exploiting Multiple Arithmetic Units”, IBM J. of R&D 11:1 (Jan, 1967, p.p.232-233)

• IBM 360/91 (three year after CDC 6600 and just before caches)

• Features:• CDB: Common Data Bus• Reservation Units: Hardware features which allow the

fetch, use and reuse of data as soon as it becomes available. It allows register renaming and it is decentralized in nature (as opposed as Scoreboarding)

9/25/2006 eleg652-F06 84


• Control and Buffers distributed with Functional Units.• HW renaming of registers• CDB broadcasting• Load / Store buffers Functional Units• Reservation Stations:

– Hazard detection and Instruction control– 4-bit tag field to specify which station or buffer will produce the

result

• Register Renaming– Tag Assigned on IS– Tag discarded after write back

9/25/2006 eleg652-F06 85

Comparison

• Scoreboarding– Centralized Data structure

and control– Register bit

• Simple, low cost

– Structural hazards solved by FU

– Solve RAW by register bit– Solve WAR in write – Solve WAW stalls on issue

• Tomasulo’s Algoritjm– Distributed control– Tagged Registers +

register renaming– Structural Hazard stalls on

Reservation Station– Solve RAW by CDB– Solve WAR by copying

operand to Reservation Station

– Solve WAW by renaming– Limited: CDB

• Broadcast• 1 per cycle

9/25/2006 eleg652-F06 86

Reservation Station Fields

• Op: Operation to perform in the unit• Vj, Vk: Value of Source Operands

– Store Buffers has V field Result to be stored• Qj, Qk: Reservation Stations producing the source

registers– Zero means ready

• Busy• A: Memory address calculation• Register file:

– Qi: The number of reservation stations that will write to this register

9/25/2006 eleg652-F06 87

The Architecture

654321

Formmemory

Load buffers

From instruction unitFloating-pointoperations FP registers

FP adders FP multipliers

Store buffers

tomemory

Common data bus (CDB)

321

321

Operation bus

21

ReservationStations

Operandbus

- 3 Adders- 2 Multipliers- Load buffers (6)- Store buffers (3)- FP Queue- FP registers- CDB: Common Data Bus

9/25/2006 eleg652-F06 88

Tomasulo’s Algorithm’s Steps

• Issue- Issue if empty reservation station is found, fetch operands if they are in

registers, otherwise assign a tag- If no empty reservation is found, stall and wait for one to get free- Renaming is performed here and WAW and WAR are resolved

• Execute– If operands are not ready, monitor the CDB for them– RAWs are resolved– When they are ready, execute the op in the FU

• Write Back– Send the results to CDB and update registers and the Store buffers– Store Buffers will write to memory during this step

• Exception Behavior– During Execute: No instructions are allowed to be issued until all

branches before it have been completed

9/25/2006 eleg652-F06 89


• Note that:• Upon Entering a reservation station, source operands

are either filled with values or renamed• The new names are 1-to-1 correspondence to FU

names

• Question:• How the output dependencies are resolved?

• Two pending writes to a register• How to determinate that a read will get the most

recent value if they complete out of order

9/25/2006 eleg652-F06 90

Features of T. Alg.

• The value of an operand (for any inst already issued in a reservation station) will be read from CDB. it will not be read from the reg. field.

• Instructions can be issued without even the operands produced (but know they are coming from CDB)

9/25/2006 eleg652-F06 91

An Example

LD F6, 34 (R2) (1)LD F2, 45 (R3) (2)MULD F0, F2, F4 (3)SUBD F8, F2, F6 (4)DIVD F10, F0, F6 (5)ADDD F6, F8, F2 (6)

1

2

3

4

5

6

9/25/2006 eleg652-F06 92

An Example

OP Vj Vk Qj Qk Busy Addr3OP Vj Vk Qj Qk Busy Addr2OP Vj Vk Qj Qk Busy Addr1

OP Vj Vk Qj Qk Busy Addr2OP Vj Vk Qj Qk Busy Addr1

OP Vj Vk Qj Qk Busy Addr3OP Vj Vk Qj Qk Busy Addr2L 0 0 Yes 34+[R2]1

0

1

2

3

4

5

6 L1

7

8

9

10

11

01

2

3

4

5

6

7

89

10

11

Memory Unit FP Adders FP Mult

ADDD F6,F8,F2

DIVD F10,F0,F6

SUBD F8.F2.F6

MULD F0,F2,F4

LD F2,45(R3)

9/25/2006 eleg652-F06 93

An Example



OP Vj Vk Qj Qk Busy Addr3L 0 0 Yes 45+[R3]2L 0 0 Yes 34+[R2]1

0

1

2 L2

3

4

5

6 L1

7

8

9

10

11

01

2

3

4

5

6

7

89

10

11


ADDD F6,F8,F2

DIVD F10,F0,F6

SUBD F8.F2.F6

MULD F0,F2,F4

9/25/2006 eleg652-F06 94

An Example


OP Vj Vk Qj Qk Busy Addr2* 45+[R3] [F4] = 4 L2 0 Yes Addr1


0 M1

1

2 L2

3

4

5

6 L1

7

8

9

10

11

01

2

3

4

5

6

7

89

10

11


ADDD F6,F8,F2

DIVD F10,F0,F6

SUBD F8.F2.F6

9/25/2006 eleg652-F06 95

An Example


- 45+[R3] 34+[R2] L2 L1 Yes Addr1

OP Vj Vk Qj Qk Busy Addr2* 45+[R3] [F4] = 4 L2 0 Yes Addr1


0 M1

1

2 L2

3

4

5

6 L1

7

8 A1

9

10

11

01

2

3

4

5

6

7

89

10

11


ADDD F6,F8,F2

DIVD F10,F0,F6

9/25/2006 eleg652-F06 96

An Example


- 45+[R3] 34+[R2] L2 L1 Yes Addr1

/ 34+[R2] M1 L1 Yes Addr2* 45+[R3] [F4] = 4 L2 0 Yes Addr1


0 M1

1

2 L2

3

4

5

6 L1

7

8 A1

9

10 M2

11

01

2

3

4

5

6

7

89

10

11


ADDD F6,F8,F2

9/25/2006 eleg652-F06 97

An Example

OP Vj Vk Qj Qk Busy Addr3+ 45+[R3] A1 L2 Yes Addr2- 45+[R3] 34+[R2] L2 L1 Yes Addr1

/ 34+[R2] M1 L1 Yes Addr2* 45+[R3] [F4] = 4 L2 0 Yes Addr1


0 M1

1

2 L2

3

4

5

6 A2

7

8 A1

9

10 M2

11

01

2

3

4

5

6

7

89

10

11


9/25/2006 eleg652-F06 98

An Example

OP Vj Vk Qj Qk Busy Addr3+ 45+[R3] A1 L2 Yes Addr2- 45+[R3] 40 L2 0 Yes Addr1

/ 40 M1 0 Yes Addr2* 45+[R3] [F4] = 4 L2 0 Yes Addr1

OP Vj Vk Qj Qk Busy Addr3L 0 0 Yes 45+[R3]2

1

0 M1

1

2 L2

3

4

5

6 A2

7

8 A1

9

10 M2

11

01

2

3

4

5

6

7

89

10

11


Some time later L1 returns 40 and commits

9/25/2006 eleg652-F06 99

An Example

OP Vj Vk Qj Qk Busy Addr3+ 32 A1 0 Yes Addr2- 32 40 0 0 Yes Addr1

/ 40 M1 0 Yes Addr2* 32 [F4] = 4 0 0 Yes Addr1

OP Vj Vk Qj Qk Busy Addr3

2

1

0 M1

1

32

3

4

5

6 A2

7

8 A1

9

10 M2

11

01

2

3

4

5

6

7

89

10

11


Some time later L2 returns 32 and commits

9/25/2006 eleg652-F06 100

An Example

OP Vj Vk Qj Qk Busy Addr3+ -8 32 0 0 Yes Addr2

1

/ 128 40 0 0 Yes Addr2

1


2

1

128

1

32

3

4

5

6 A2

7

-8

9

10 M2

11

01

2

3

4

5

6

7

89

10

11


A1 and M1 Complete and commit

9/25/2006 eleg652-F06 101

An Example


2

1

2

1


2

1

128

1

32

3

4

5

24

7

-8

9

3.2

11

01

2

3

4

5

6

7

89

10

11


A2 and M2 Complete and commit

9/25/2006 eleg652-F06 102

ROB and Tomasulo’s Alg.

• Many elements of Tomasulo’s algorithm are already included• Major difference?

How WAW is handled?

In Tomasulo: this is by keeping a “tag” with each register x and the tag is updated, each time a

x “+”

is issued, i.e. X-tag “+3” means 3rd + unit is reserved.

when write back via CDB, the tag of FU is compare with tag of R:

if

tag of FU = X-tag, overwritten the R (e.g. X)

else

ignore the result

9/25/2006 eleg652-F06 103


• Advantages• Distribution of Hazard

detection logic • R-renaming and

reservation stations take care of all data hazards.

• Disadvantages- Hardware cost: high-

speed associative M for tags + complex control logic

- One single CDB may be a bottleneck, while multiple CDB may be too costly (all associative - M must be duplicated)

9/25/2006 eleg652-F06 104

Conclusions

- Good for pipelined architecture which is difficult to schedule code and it is short in “visible” registers

- Future:- Hybrid between Software and hardware

techniques- Static schedule of R-R’s- Dynamic Schedule of Load and stores

9/25/2006 eleg652-F06 105

9/25/2006 eleg652-F06 106

ExampleDynamic Scheduling in Pentium 4

• Fetch up 3 IA-32 Instruction per cycle

• Decode them into micro code and send them to the out of order execution engine.

• Commit up to 3 micro ops per cycle

• Pipeline takes 20 cycles

• Register Renaming files– Potentially 128 outstanding results

• Seven Integer execution units

9/25/2006 eleg652-F06 107

An Example of an OoO Engine

Intel Xeon Out of Order Engine Pipeline Picture Courtesy of Intel from “Hyper-Threading Technology Architecture and Microarchitecture”

9/25/2006 eleg652-F06 108

VLIW vs Superscalar

• Superscalar• Advantages

– Better Code density– Code compatible

• Difference– Dynamic scheduling

• Disadvantages– More IF and ID– More delay slots are

needed– Different FU

• VLIW• Advantages

– Fixed Instruction format– Explicit parallelism

exposed• Trace scheduling

• Difference:– Static Scheduling

• Disadvantages– Static scheduling– No dynamic decision– Code explosion– Caches are difficult to use

9/25/2006 eleg652-F06 109

Bibliography

• Texas Instruments, “TMS320C600 Technical Brief.” February 1999. www.ti.com

• Intel Pentium 4 Northwood. www.chip-architect.com. April 2003.

• “Hyper-Threading Technology Architecture and Microarchitecture.” Intel Technology Journal, Volume 6, Issue 1, February 2002, p4-15

http://www.ti.com/

http://www.chip-architect.com/

9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity...

Documents

Transcript of 9/25/2006eleg652-F061 Topic 3 Exploitation of Instruction Level Parallelism The secret to creativity...