A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential...

Post on 23-Dec-2015

220 views 0 download

Tags:

Transcript of A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential...

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Instruction Level Parallel Processing• Sequential Execution Semantics• Superscalar Execution

– Interruptions• Out-of-Order Execution

– How it can help– Issues:

• Maintaining Sequential Semantics• Scheduling

– Scoreboard• Register Renaming

• Initially, we’ll focus on Registers, Memory later on

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Sequential Semantics - Review

• Instructions appear as if they executed:– In the order they appear in the program– One after the other

• Pipelining: Partial Overlap of Instructions– Initiate one instruction per cycle– Subsequent instructions overlap partially– Commit one instruction per cycle

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Can we do better than pipelining?

loop: ld r2, 10(r1)add r3, r3, r2sub r1, r1, 1bne r1, r0, loop

Pipelining:

sum += a[i--]

fetch decode ldfetch decode add

fetch decode subfetch decode bne

time

Superscalar:fetch decode ld

fetch decode addfetch decode sub

fetch decode bne

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Superscalar - In-order (initial def.)

• Two or more consecutive instructions (in the original program order) can execute in parallel

• Is this much better than pipelining?– What if all instructions were dependent?

• Superscalar buys us nothing

• Again key is typical program behavior– Some parallelism exists– Pipelining “drains” on dependences– Superscalar consumes “fill-up” time

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Practicalities

• Issue mechanism– At decode check:

• Dependences • Input operand availability

– Check against Instructions:• Simultaneously Decoded• In-progress in the pipeline (i.e., previously issued)

– Recall the register vector from pipelining

• Increasingly Complex with degree of superscalarity– 2-way, 3-way, …, n-way

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Issue Rules

• Stall at decode if:– RAW dependence and no data available

• Source registers against previous targets

– WAR or WAW dependence• Target register against previous targets + sources

– No resource available• This check is done in program order

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Issue Mechanism

• Assume 2 source & 1 target max per instr.– comparators for 2-way:

• 3 for tgt and 2 for src (tgt: WAW + WAR, src: RAW)

– comparators for 4-way:• 2nd instr: 3 tgt and 2 src• 3rd instr: 6 tgt and 4 src• 4th instr: 9 tgt and 6 src

tgt src1 src1

tgt src1 src1

tgt src1 src1

� simplifications may be possible� resource checking not shown

Pro

gra

m o

rder

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Implications

• Need to multiport some structures– Register File

• Multiple Reads and Writes per cycle

– Register Availability Vector• Multiple Reads and Writes per cycle

– From Decode and Commit– Also need to worry about WAR and WAW

• Resource tracking– Additional issue conditions

• Many Superscalars had additional restrictions– E.g., execute one integer and one floating point op– one branch, or one store/load

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Preserving Sequential Semantics• In principle not much different than

pipelining• Program order is preserved in the pipeline• Some instructions proceed in parallel

– But order is clearly defined• Defer interrupts to commit stage (i.e.,

writeback)– Flush all subsequent instructions

• may include instructions committing simultaneously

– Allow all preceding instructions to commit• Recall comparisons are done in program

order• Must have sufficient time in clock cycle to

handle these

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Interrupts Example

fetch decode ldfetch decode addfetch decode div

fetch decode bne

Exceptionraised

Exceptiontaken

fetch decode bne

fetch decode ldfetch decode addfetch decode div

fetch decode bne

Exceptionraised

Exceptiontaken

fetch decode bne

Exceptionraised

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Superscalar vs. Pipelining

• In principle they are orthogonal– Superscalar non-pipelined machine– Pipelined non-superscalar– Superscalar and Pipelined (common)

• Additional functionality needed by Superscalar:– Another bound on clock cycle– At some point it limits the number of

pipeline stages

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Superscalar vs. Superpipelining• Superpipelining:

– Vaguely defined as deep pipelining, i.e., lots of stages• Superscalar issue complexity: limits super-

pipelining• How do they compare?

– Conceptually, not by much.fetch decode instfetch decode inst

fetch decode instfetch decode inst

fetch decode instfetch decode inst

fetch decode instfetch decode inst

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Case Study: Alpha 21164

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

21164: Int. Pipe

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

21164: Memory Pipeline

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

21164: Floating-Point Pipe

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

80486 Pipeline• Fetch

– Load 16-bytes from into prefetch buffer• Decode 1

– Determine instruction length and type• Decode 2

– Compute memory address– Generate immediate operands

• Execute– Register Read– ALU– Memory read/write

• Write-back– Update register file

• (source: CS740 CMU, ’97, all slides on 486)

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

80486 Pipeline detail• Fetch

– Moves 16 bytes of instruction stream into code queue

– Not required every time– About 5 instructions fetched at once (avg. length 2.5

bytes)– Only useful if don’t branch– Avoids need for separate instruction cache

• D1– Determine total instruction length– Signals code queue aligner where next instruction

begins– May require two cycles

• When multiple operands must be decoded• About 6% of “typical” DOS program

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

80486 Pipeline• D2

– Extract memory displacements and immediate operands

– Compute memory addresses– Add base register, and possibly scaled index register– May require two cycles

• If index register involved, or both address & immediate operand

• Approx. 5% of executed instructions• EX

– Read register operands– Compute ALU function– Read or write memory (data cache)

• WB– Update register result

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Out-of-Order Execution

• Also known as dynamic scheduling– Compilers do static scheduling

• We will start by considering register only– Register interface helps a lot

• Later on we will expand to memory– Tricky: Memory interface is more powerful

than registers – Makes it harder to figure out dependences

• In principle the same method will be used for both

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Beyond Superscalar Execution

do {sum += a[++m]; i--;

} while (i != 0);

out-of-order

loop: add r4, r4, 1ld r2, 10(r4)add r3, r3, r2sub r1, r1, 1bne r1, r0, loop

Superscalar:

fetch decode

fetch decode

sub

bne

fetch decode add

fetch decode ld

fetch decode add

fetch decode

fetch decode

sub

bne

fetch decode add

fetch decode ld

fetch decode add

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

fetch decode

fetch decode

sub

bne

fetch decode add

fetch decode ld

fetch decode add

Sequential Semantics?

• Execution does NOT adhere to sequential semantics

• To be precise: Eventually it may• Simplest solution: Define problem away

– Imprecise interrupts• On interrupt some instr. committed some not• software we’ll have to figure out what is going on• Horrible for debugging and programming

inconsistent

consistent

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Interrupt

• Recall we use the term interrupt to signify the need to observe the machine’s state after any instruction– This can be indeed the result of interrupt in

the classical sense– But, it could be a debugger (still uses

interrupts)

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Out-of-Order vs. Pipelining and Superscalar

• Definition: two or more instructions can execute in any order if they have no dependences (RAW, WAW, WAR)– beware of transitive dependences

• Is this better than pipelining or superscalar exec?– If all are independent: not– if all dependent: not– Programs have some parallelism– Pipelining “drain” and “fill-up” overheads– Superscalar, parallelism only when adjacent– OoO exploits par. even when not adjacent

• OoO Orthogonal to pipelining and Superscalar

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Out-of-order Execution Issues

• Preserving Sequential Semantics

• Stalling Instructions w/ dependences

• Issuing Instructions when dependences are satisfied

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Back to Sequential Semantics

• Instr. exec. in 3 phases:– In-progress, Completed (NEW), Committed– OOO for in-progress and Completed– In-order Commits

• Completed - out-of-order: ”Visible only inside”– Results visible to subsequent instructions– Results not visible to outsiders

• On interrupts completed results are discarded• Committed - in-order: ”Visible to all”

– Results visible to subsequent instructions– Results visible to outsiders

• On interrupt committed results are preserved

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

How Completes Help w/ Performance

Tim

e

DIV R3, _, _ADD R1, _, _ADD _, R1, _

In-ordercommits

in-ordercompletes

out-of-order completesin-order commits

complete

fetch decode

fetch decode

sub

bne

fetch decode add

fetch decode ld

fetch decode add

commit

commit

commit

commit

commit

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Implementing Completes/Commits

• Key idea:– Maintain sufficient state around to be able to

roll-back when necessary– Roll-back:

• Discard (aka Squash) all not committed

• One solution (conceptual):– Upon Complete instruction records previous

value of target register– Upon Discard, instruction restores target

value– Upon Commit, nothing to do

• We will return to this shortly • Focus on scheduling mechanisms

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Out-of-Order Execution the Big PictureProgram Form Processing Phase

Static program

dynamic inst.Stream (trace)

execution window

completed instructions

Dispatch/ dependences

inst. Issue

inst execution

inst. Reorder & commit

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Out-of-Order Execution: Stages

• Fetch: get instruction from memory• Decode/Dispatch: what is it? What are the

dependences• Issue: Go – all dependences satisfied• Execute: perform operation• Complete: result available to other insts.• Commit: result available to outsiders

• We’ll start w/ Decode/Dispatch• Then we’ll consider Issue

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

OOO Scheduling• Instruction @ Decode:

– Do I have dependences yet to be satisfied?– Yes, stall until they are– No, clear to issue

• Wakeup Instructions Stalled:– Dependences satisfied– Allow instruction to issue

• Dependence:– (later instruction, earlier instruction) & type

• We’ll first consider RAW and then move on to WAW and WAR

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Stalling @ Decode for RAW

• Are there unsatisfied dependences?– RAW: have to wait for register value– We don’t really care who is producing the

value– Only whether it is available

• Can use the Register Availability Vector as in pipelining/superscalar– Also known as scoreboard

• At Decode– Reset bit corresponding to your target– At writeback set– Check all bits for source regs: if any is 0 stall

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Issuing Instructions: Scheduling• Determine when an instruction can issue

– Ignore resources for the time being• Stalled because of RAW w/ preceding instruction• Concept:

– Producer (write) notifies consumers (read)• Requirements:

– Consumers need to be able to identify producer– The register name is one possible link

• Mechanism– Consumer placed in a reservation station – Producers on complete broadcasts identity– Waiting instructions observe– Update Operand Availability – Issue if all operands now available

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Reservation Station

• State pertaining to an instruction– What registers it reads– Whether they are available– What is the destination register– What state is the instruction in

• Waiting• Executing

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Out-Of-Order Exec. Exampleloop: add r4, r4, 4

ld r2, 10(r4) 4 cycles latadd r3, r3, r2sub r1, r1, 1bne r1, r0, loop

1 1 1 1

r1 r2 r3 r4

RAVop src1 src2 tgt

Cycle 0

status

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Out-Of-Order Exec. Example: Cycle 0

1 1 1 0

r1 r2 r3 r4

RAVop src1 src2 tgt

Cycle 0

add r4/1 NA/1 r4/0 Rdy

status

loop: add r4, r4, 4ld r2, 10(r4) 5 cycles latadd r3, r3, r2sub r1, r1, 1bne r1, r0, loop

Ready to be executed

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Cycle 1loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 0 1 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Exec

status

ld r4/1 NA/1 r2 RdyR4 gets produced now

Notify those waiting for R4

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Cycle 2loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 0 0 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 ExecWait for r2

Result available @ cycle 6

add r3/1 r2/0 r3 Wait

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Cycle 3loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

0 0 0 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 ExecWait for r2

Result available @ cycle 6

add r3/1 r2/0 r3 Wait

sub r1/1 NA/1 r1 RdyNo dependences

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Cycle 4loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 0 0 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 ExecWait for r2

Result available @ cycle 6

add r3/1 r2/0 r3 Wait

sub r1/1 NA/1 r1 Execr1 produced nowNotify consumers

bne r1/1 r0/1 NA Rdyr1 will be available next cycle

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Cycle 5loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 0 0 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 ExecWait for r2

Result available @ cycle 6

add r3/1 r2/0 r3 Wait

sub r1/1 NA/1 r1 ComplCompleted

bne r1/1 r0/1 NA Execexecuting

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Cycle 6loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 1 0 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 ExecWait for r2

Result available @ cycle 6Notify consumers

add r3/1 r2/1 r3 Rdy

sub r1/1 NA/1 r1 ComplCompleted

bne r1/1 r0/1 NA Execexecuting

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Cycle 7loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 1 1 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 CmtdExecuting

Notify consumers

add r3/1 r2/1 r3 Exec

sub r1/1 NA/1 r1 Compl

Completedbne r1/1 r0/1 NA Compl

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Cycle 8loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 1 1 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 Cmtd

add r3/1 r2/1 r3 Cmtd

sub r1/1 NA/1 r1 Cmtd

bne r1/1 r0/1 NA Cmtd

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Notifying Consumers

• Identity of Producer• Uniquely Identify the Instruction• Easily retrievable @ decode by others

– Target Register• Recall we stall on WAR or WAW

– Functional Unit • If not pipelined

– Place in instruction window– PC? not. Why?

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Name Dependences and OOO

• WAW or WAR: We need to update register but others are still using it– add r1, r1, 10– sw r1, 20(r2)– add r1, r3, 30– sub r2, r1, 40

• There is only one r1– sw needs to see the value of 1st add– sub needs to wait for 2nd add and not 1st

• Solution: Stall decode when WAW or WAR

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Detecting WAW and WAR

• WAW? Look at Scoreboard– If bit is 0 then there is a pending write– Stall

• WAR? Need to know whether all preceding consumers have read the value– Keep a count per register– Increase at decode for all reads– Decrease on issue

• More elegant solution via register renaming– Soon

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Window vs. Scheduler• Window

– Distance between oldest and youngest instruction that can co-exist inside the CPU

– Larger window Potential for more ILP• Scheduler

– Number of instructions that are waiting to be issued

• Window– Instructions enter at Fetch– Exit at Commit

• Scheduler– Instructions enter at Decode– Leave at writeback

• Window >= Scheduler– Can be the same structure

• In window but not in scheduler completed

inst

ruct

ion

s

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Scoreboarding• Schedule based on RAW dependences• WAW and WAR cause stalls

– WAW at decode– WAR at writeback

• Optimization: Why is this OK?

• Implemented in the CDC 6600 in ‘64– 18 non-pipelined FUs

• 4 FP: 2 mul, 1 add, 1 div• 7 MEM: 5 load, 2 store• 7 INT: add, shift, logical etc.

• Centralized Control Scheme– Controls all Instruction Issue– Detects all hazards

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

MIPS/DLX w/ Scoreboarding

RegisterFile

FP mul

FP mul

FP divide

FP add

FP integer

scoreboard

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Scoreboarding Overview• Ignore IF and MEM for simplicity• 4-stage execution

– Issue Check for structural hazardsCheck for WAW hazardsStall until all clear

– ReadOp Check for RAW hazardsWait until all operands readyRead Registers

– Execute Execute OperationsNotify scoreboard when complete

– Write Check for WAR hazardsStall Write until all clear

• A completing instruction cannot write dest if an earlier instruction has not read dest.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Scoreboarding Optimizations/Tricks

• WAW as in original OOO• WAR is optimized

– Second Producer is allowed to execute up to complete

– It is stalled there until preceding consumers complete

• No Commit– No precise interrupts

• Window is implemented in the scoreboard• One entry per Functional Unit

– Recall not pipelined– Instructions identified by FU id

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Scoreboarding Organization• Three structures

– Instruction Status– Functional Unit Status– Register Result Status

• Instruction Status– Which stage the instruction is currently in

• Functional Unit Status: scheduling– Busy– OP Operation– Fi Dest. Reg.– Fj, Fk Source Regs– Qj, Qk FUs producing sources– Rj, Rk Ready bits for sources

• Register Result Status: dep. determination– Which FU will produce a register

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Scoreboarding explained

• Register status reg:– Which FU produces the register

• Use at decode– Source reg match is a RAW– Target reg macth is a WAW stall

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Functional Unit Status• Busy:

– resource allocation• OP:

– what to do once issued (e.g., add, sub)• Dest. Reg.:

– Where to write result– To find WAR

• Fj, Fk Source Regs– for WAR: can’t write if consumers pending for

previous value of register (if FU not the same)• Qj, Qk FUs producing sources

– To wait for appropriate producer• Rj, Rk Ready bits for sources

– To determine when ready: all ready

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Scoreboarding ExampleInstruction status Read Execution WriteInstruction j k Issue operandscomplete ResultLD F6 34+ R2LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Functional Unit Statusdest S1 S2 FU for j FU for k Fj? Fk?

Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2 NoAdd NoDivide No

Register result status

ClockF0 F2 F4 F6 F8 F10 F12 ... F30

FU

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Example: Cycle 0Instruction status Read Execution WriteInstruction j k Issue operandscomplete ResultLD F6 34+ R2 1LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Functional Unit Statusdest S1 S2 FU for j FU for k Fj? Fk?

Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger yes LD F6Mult1 NoMult2 NoAdd NoDivide No

Register result status

ClockF0 F2 F4 F6 F8 F10 F12 ... F30

FU integer

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Example, contd.

• The rest you’ll find on the web site• Go through it• Source: Patterson

• Summary:– Execution proceeds in an order dictated by

dependences– RAW, WAR and WAW force ordering– Tricks may be possible

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Beyond Simple OoO

A B

CD

E

A: LF F6, 34(R2)B: LF F2, 45(R3)C: MULF F0, F2, F4D:SUBF F8, F2, F6E: ADDF F2, F7, F4

• E will wait for B, C and D. • WAR w/ C and D• WAW w/ B• Can we do better?

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

What if we had infinite registersA: LF F6, 34(R2)B: LF F2, 45(R3)C: MULF F0, F2, F4D:SUBF F8, F2, F6E: ADDF F2, F7, F4

A: LF F6, 34(R2)B: LF F2, 45(R3)C: MULF F0, F2, F4D:SUBF F8, F2, F6E: ADDF F9, F7, F4

No false dependences anymoreSince we do not reuse a name we can’t have WAW

and WAR

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Why we can’t have Infinite Registers

• False/Name dependences (WAR and WAW)– Artifact of having finite registers

• There is no such thing as infinite• There is no such thing as large enough

– Well there is (in a sec.)– Computers execute Billions of Instructions

per sec. Even a multi-billion register file would soon be exhausted

• Want to exploit parallelism across several instances of the same code– Loops, recursive functions (most frequent

part)

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Yes, there is “large enough”

� At any given point there will be a finite number of instructions in the window

� if each instruction has a single register target

� if there are N instructions� How many registers do we need?

� N?� N + X?

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Register Renaming• Register Version

– Every Write creates a new version– Uses read the last version– Need to keep a version until all uses have read it.

• Register Renaming:– Architectural vs. Physical Registers

• more phys. than arch.– Maintain a map of arch. to phys. regs.– Use in-order decoding to properly identify

dependences.– Instructions wait only for input op. availability.– Only last version is written to reg. file.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Register RenamingA: DIVF F3, F1, F0 r1, -, -B: SUBF F2, F1, F0 r2, -, -C: MULF F0, F2, F4 r3, r2, -D: SUBF F6, F2, F3 r4, r2, r1E: ADDF F2, F5, F4 r5, -, -F: ADDF F0, F0, F2 r6, r3, r5

Register Rename TableF0 F1 F2 F3 F5 F6 F7 ... F30

A R1B R2 R1C R3 R2 R1D R3 R2 R1 R4E R3 R5 R1 R4F R6 R5 R1 R4

Need more physical registers than architecturalIgnore control flow for the time being.

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Register Renaming Process

• Only need to remember last producer of each architectural register– Vector

• At decode– Find the most recent producers for all

source registers– After: declare self as most recent producer

of target register• Complication:

– May have to retract• Speculative Execution, e.g., interrupts

– Need to be able to restore the mapping state

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Register Renaming Support Structures

• Register Rename Table– f(aR) = pR– one entry per architectural Register

• Free Register List– Lists not used Physical Registers

• At Decode– grab a new register from the free list– Change mapping in rename table

• At Commit– Release Register? Not… Why?– Could release previous version

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

How Many Physical Registers?

• Correctness:– At least as many architectural plus?

• Performance:– As many as possible– Not correctness– Recall not all instructions produce register

results• stores and branches

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Dynamic Scheduling

A: DIVF F3, F1, F0 r1, -, -

B: SUBF F2, F1, F0 r2, -, -

C: MULF F0, F2, F4 r3, r2, -

D: SUBF F6, F2, F3 r4, r2, r1

E: ADDF F2, F5, F4 r5, -, -

F: ADDF F0, F0, F2 r6, r3, r5

Name Value- Values and Names flow together- Writeback specifies both value and name- A waiting instruction inspects all results- It is allowed to execute when all inputs are available

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Physical Registers

• Physical register file is just one option• What we need is separate storage

– Consumers could keep values in their reservation station

– Tomasulo’s next

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Tomasulo’s Algorithm• IBM 360/91 - Fast 360 for scientific code

– Completed in 1967– Dynamic scheduling– Predates cache memories

• Pipelined FUs– Adder up to 3 instructions– Multiplier up to 2 instructions

• Tomasulo vs. Scoreboard– Distributed hazard detection and control– Results are bypassed to FUs– Common Data Bus (CDB) for results

• All results visible to all instead of via a register

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

DLX w/ Tomasulo• Tomasulo’s Algorithm

– Use “tags” to identify data values– Reservation stations distributed control– CDB broadcasts all results to all RSs

• Extend DLX as example– Assume multiple FUs than pipelined– Main difference is Register-Memory Insts.

• I.e., DLX does not have them• But that’s really a detail :-)

• Physical Registers?– Not really. What we need is different storage and

name for every version.– Here it’s the producing reservation station

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Dynamic DLX

adders Mults

Load buffers Store buffers

CDB

RSRS

Operation Stack Registers

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Tomasulo’s Algorithm• 3 major steps

– Dispatch• Get instruction from fetch queue• ALU op: check for available RS• Load: Check for available load buffer• If available: dispatch and copy read regs to RS or

load buffer• if not: stall - structural hazard

– Issue• If all ops are available: issue• If not monitor CDB for operands

– Complete• If CDB available, broadcast result• else stall

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Tomasulo’s Algorithm contd.• Reservation stations

– Handle distributed hazard detection and instruction control

• Everything receiving data get its tag– 4-bit tag specifies reservation station or load buffer– Also which FU will produce result

• Register specifier is used to assign tags– Then they are discarded– Input register specifiers are ONLY used in dispatch.

(Rename table)• Common Data Bus:

– value + “tag” = where this comes from– vs. typical bus: value + “tag” = where this goes to

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Tomasulo’s Algorithm Contd.

• Reservation Stations– Op Opcode– Qj, Qk Tag Fields (source ops)– Vj, VkOperand values (source ops)– Busy Currently in use

• Register file and Store Buffer– Qi Tag field– Busy Currently in use– Vi Value

• Load Buffers– Busy Currently in Use

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Arch.Reg. Name

Tomasulo’s: Understanding Speculative vs. Architectural State

• add r1, r2, 10• sub r4, r1, 20• add r1, r3, 30

Value of r1I have it

Value of r2I have it

Value of r3I have it

Value of r4I have it

Register file

Whe

re is

the

regi

ster

?

Can be: “I have it”, “reservation station id”

Value of Src1NA NA Value of Src2NA

tgt src2

Value of Src1NA NA Value of Src2NA

Reservation Stations

Reg Arch. name

src1

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Renaming 1st Instruction

• add r1, r2, 10• sub r4, r1, 20• add r1, r3, 30

-----RS0

Value of r2I have it

Value of r3I have it

Value of r4I have it

Register file

Value of R2r1 I have it 10I have it

tgt src2

Value of Src1NA NA Value of Src2NA

Reservation Stations

src1

Value of Src1NA NA Value of Src2NA

RS0

• Read sources (r2)• Rename r1 to RS0

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Renaming 2nd Instruction

-----RS0

Value of r2I have it

Value of r3I have it

----RS1

Register file

Value of R2r1 I have it 10I have it

tgt src2

----r4 RS0 20I have it

Reservation Stations

src1

Value of Src1NA NA Value of Src2NA

RS1

• Sources: r1 in RS0 NYA• Rename r4 to RS1

• add r1, r2, 10• sub r4, r1, 20• add r1, r3, 30

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Renaming 3rd Instruction

-----RS2

Value of r2I have it

Value of r3I have it

----RS1

Register file

Value of R2r1 I have it 10I have it

tgt src2

----r4 RS0 20I have it

Reservation Stations

src1

Value of R3r1 I have it 30I have itRS2

• Sources: r3 Avail.• Rename r1 to RS2

• add r1, r2, 10• sub r4, r1, 20• add r1, r3, 30

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Example: cycle 0Instruction status Execution Write

Instruction j k Issue complete Result

LD F6 34+ R2LD F2 45+ R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk

0 Add1 No0 Add2 No0 Add3 No0 Mult1 No0 Mult2 No

Register result status

F0 F2 F4 F6 F8 F10 ...

FU

Busy AddressLoad1 NoLoad2 NoLoad3 No

load buffers

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Example: cycle 1Instruction status Execution Write

Instruction j k Issue complete Result

LD F6 34+ R2 1LD F2 45+ R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk

0 A1 No0 A2 No0 A3 No0 M1 No0 M2 No

Register result status

F0 F2 F4 F6 F8 F10 ...

FU L1

Busy AddressL1 yesL2 NoL3 No

load buffers

34+R2

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Example: cycle 3Instruction status Execution Write

Instruction j k Issue complete Result

LD F6 34+ R2 1 3LD F2 45+ R3 2MULTDF0 F2 F4 3SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk

0 A1 No0 A2 No0 A3 No0 M1 Yes Mul R(F4) L20 M2 No

Register result status

F0 F2 F4 F6 F8 F10 ...

FU M1 L2 L1

Busy AddressL1 yesL2 NoL3 No

load buffers

34+R245+R3

- Mul is issued vs. scoreboard- What’s waiting for L1?

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Example…

• Check the web site…• Too much for in-class• Summary:

– Execution proceeds in any order that does not violate RAW dependences

– WAR and WAW are removed

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Tomasulo’s vs. ScoreboardInstruction status Execution WriteInstruction j k Issue complete ResultLD F6 34+ R2 1 3 4LD F2 45+ R3 2 4 5MULTDF0 F2 F4 3 15 16SUBDF8 F6 F2 4 7 8DIVDF10 F0 F6 5 56 57ADDDF6 F8 F2 6 10 11

- In-order issue- Out-of-order execution- Out-of-order completion

Instruction status Read Execution WriteInstruction j k Issue operandscomplete ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21 61 62ADDD F6 F8 F2 13 14 16 22

Scoreboard:

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Tomasulo’s• Out-of-order loads and stores?

– What about WAW, RAW and WAR?– Compare all load addresses against the addresses of

all preceding store buffers– Stall if they match

• CDB is a bottleneck– One write per cycle– Could duplicate– But, come at a cost– Datapath + duplicated tags and control

• Complex Implementation– Scalability?– All results to all sources– What if we want 128 instrs?

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Tomasulo’s• Advantages

– Distribution of hazard detection– Elimination of WAR and WAW stalls

• Common Data Bus– Broadcasts result to multiple instrs (+)– Bottleneck

• Register Renaming– Removes WAR and WAW hazards– More interesting when same code appears twice

• Think of loops• More on this later

– BUT: Associative lookups– RECALL: direct map is faster

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

In SummaryFeature Scoreboarding Tomasulo's

CDC6600 IBM 360

Structural Stall in Issue for Stall in DispatchFU for RS

Stall in RS for FURAW Via Registers From CDB

WAR Stall in WB Copy Value to RS

WAW Stall in Issue Register Renaming

Logic Centralized Distributed

Bottlenecks No Register One CDBBypassStall in issue block