A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

68
A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling

Transcript of A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

Page 1: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-of-Order ExecutionScheduling

Page 2: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Instruction Level Parallel Processing

• Sequential Execution Semantics• Out-of-Order Execution

– How it can help– Issues:

• Maintaining Sequential Semantics• Scheduling

– Scoreboard• Register Renaming

• Initially, we’ll focus on Registers, Memory later on

Page 3: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Sequential Semantics - Review

• Instructions appear as if they executed:– In the order they appear in the program– One after the other

ProgramOrder

Pipelining Superscalar Out-of-Order

Page 4: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

fetch decode

fetch decode

sub

bne

fetch decode add

fetch decode ld

fetch decode add

Out-of-Order Execution

do {sum += a[++m]; i--;

} while (i != 0);

out-of-order

loop: add r4, r4, 1ld r2, 10(r4)add r3, r3, r2sub r1, r1, 1bne r1, r0, loop

fetch decode

fetch decode

sub

bne

fetch decode add

fetch decode ld

fetch decode add

Superscalar

Page 5: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

fetch decode

fetch decode

sub

bne

fetch decode add

fetch decode ld

fetch decode add

Sequential Semantics?

• Execution does NOT adhere to sequential semantics

• To be precise: Eventually it may• Simplest solution: Define problem away• Not acceptable today: e.g., Virtual Memory• Three-phase Instruction execution

– In-Progress, Completed and Committed

inconsistent

consistent

Page 6: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-of-order Execution Issues

• Preserving Sequential Semantics

• Stalling Instructions w/ dependences

• Issuing Instructions when dependences are satisfied

Page 7: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Back to Sequential Semantics

• Instr. exec. in 3 phases:– In-progress, Completed, Committed– OOO for in-progress and Completed– In-order Commits

• Completed - out-of-order: ”Visible only inside”– Results visible to subsequent instructions– Results not visible to outsiders

• On interrupts completed results are discarded• Committed - in-order: ”Visible to all”

– Results visible to subsequent instructions– Results visible to outsiders

• On interrupt committed results are preserved

Page 8: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

How Completes Help w/ Performance

Tim

e

DIV R3, _, _ADD R1, _, _ADD _, R1, _

In-ordercommits

in-ordercompletes

out-of-order completesin-order commits

complete

fetch decode

fetch decode

sub

bne

fetch decode add

fetch decode ld

fetch decode add

commit

commit

commit

commit

commit

Page 9: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Implementing Completes/Commits

• Key idea:– Maintain sufficient state around to be able to

roll-back when necessary– Roll-back:

• Discard (aka Squash) all not committed

• One solution (conceptual):– Upon Complete instruction records previous

value of target register– Upon Discard, instruction restores target

value– Upon Commit, nothing to do

• We will return to this shortly • Focus on scheduling mechanisms

Page 10: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-of-Order Execution Overview

Program Form Processing Phase

Static program

dynamic inst.Stream (trace)

execution window

completed instructions

Dispatch/ dependences

inst. Issue

inst execution

inst. Reorder & commit

In-P

rog

ress

Com

ple

ted

Com

mitte

d

Page 11: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-of-Order Execution: Stages

• Fetch: get instruction from memory• Decode/Dispatch: what is it? What are the

dependences• Issue: Go – all dependences satisfied• Execute: perform operation• Complete: result available to other insts.• Commit: result available to outsiders

• We’ll start w/ Decode/Dispatch• Then we’ll consider Issue

Page 12: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

OOO Scheduling• Instruction @ Decode:

– Do I have dependences yet to be satisfied?– Yes, stall until they are– No, clear to issue

• Wakeup Instructions Stalled:– Dependences satisfied– Allow instruction to issue

• Dependence:– (later instruction, earlier instruction) & type

• We’ll first consider RAW and then move on to WAW and WAR

Page 13: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Stalling @ Decode for RAW

• Are there unsatisfied dependences?– RAW: have to wait for register value– We don’t really care who is producing the

value– Only whether it is available

• Can use the Register Availability Vector as in pipelining/superscalar– Also known as scoreboard

• At Decode– Reset bit corresponding to your target– At writeback set– Check all bits for source regs: if any is 0 stall

Page 14: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Issuing Instructions: Scheduling• Determine when an instruction can issue

– Ignore resources for the time being• Stalled because of RAW w/ preceding instruction• Concept:

– Producer (write) notifies consumers (read)• Requirements:

– Consumers need to be able to identify producer– The register name is one possible link

• Mechanism– Consumer placed in a reservation station – Producers on complete broadcasts identity– Waiting instructions observe– Update Operand Availability – Issue if all operands now available

Page 15: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Reservation Station

• State pertaining to an instruction– What registers it reads– Whether they are available– What is the destination register– What state is the instruction in

• Waiting• Executing

Page 16: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-Of-Order Exec. Exampleloop: add r4, r4, 4

ld r2, 10(r4) 4 cycles latadd r3, r3, r2sub r1, r1, 1bne r1, r0, loop

1 1 1 1

r1 r2 r3 r4

RAVop src1 src2 tgt

Cycle 0

status

Page 17: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-Of-Order Exec. Example: Cycle 0

1 1 1 0

r1 r2 r3 r4

RAVop src1 src2 tgt

Cycle 0

add r4/1 NA/1 r4/0 Rdy

status

loop: add r4, r4, 4ld r2, 10(r4) 5 cycles latadd r3, r3, r2sub r1, r1, 1bne r1, r0, loop

Ready to be executed

Page 18: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 1loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 0 1 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Exec

status

ld r4/1 NA/1 r2 RdyR4 gets produced now

Notify those waiting for R4

Page 19: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 2loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 0 0 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 ExecWait for r2

Result available @ cycle 6

add r3/1 r2/0 r3 Wait

Page 20: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 3loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

0 0 0 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 ExecWait for r2

Result available @ cycle 6

add r3/1 r2/0 r3 Wait

sub r1/1 NA/1 r1 RdyNo dependences

Page 21: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 4loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 0 0 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 ExecWait for r2

Result available @ cycle 6

add r3/1 r2/0 r3 Wait

sub r1/1 NA/1 r1 Execr1 produced nowNotify consumers

bne r1/1 r0/1 NA Rdyr1 will be available next cycle

Page 22: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 5loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 0 0 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 ExecWait for r2

Result available @ cycle 6

add r3/1 r2/0 r3 Wait

sub r1/1 NA/1 r1 ComplCompleted

bne r1/1 r0/1 NA Execexecuting

Page 23: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 6loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 1 0 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 ExecWait for r2

Result available @ cycle 6Notify consumers

add r3/1 r2/1 r3 Rdy

sub r1/1 NA/1 r1 ComplCompleted

bne r1/1 r0/1 NA Execexecuting

Page 24: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 7loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 1 1 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 CmtdExecuting

Notify consumers

add r3/1 r2/1 r3 Exec

sub r1/1 NA/1 r1 Compl

Completedbne r1/1 r0/1 NA Compl

Page 25: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 8loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 1 1 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 Cmtd

add r3/1 r2/1 r3 Cmtd

sub r1/1 NA/1 r1 Cmtd

bne r1/1 r0/1 NA Cmtd

Page 26: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Notifying Consumers

• Identity of Producer• Uniquely Identify the Instruction• Easily retrievable @ decode by others

– Target Register• Recall we stall on WAR or WAW

– Functional Unit • If not pipelined

– Place in instruction window– PC? not. Why?

Page 27: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Name Dependences and OOO

• WAW or WAR: We need to update register but others are still using it– add r1, r1, 10– sw r1, 20(r2)– add r1, r3, 30– sub r2, r1, 40

• There is only one r1– sw needs to see the value of 1st add– sub needs to wait for 2nd add and not 1st

• Solution: Stall decode when WAW or WAR

Page 28: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Detecting WAW and WAR

• WAW? Look at Scoreboard– If bit is 0 then there is a pending write– Stall

• WAR? Need to know whether all preceding consumers have read the value– Keep a count per register– Increase at decode for all reads– Decrease on issue

• More elegant solution via register renaming– Soon

Page 29: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Window vs. Scheduler• Window

– Distance between oldest and youngest instruction that can co-exist inside the CPU

– Larger window Potential for more ILP• Scheduler

– Number of instructions that are waiting to be issued

• Window– Instructions enter at Fetch– Exit at Commit

• Scheduler– Instructions enter at Decode– Leave at writeback

• Window >= Scheduler– Can be the same structure

• In window but not in scheduler completed

inst

ruct

ion

s

Page 30: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Scoreboarding• Schedule based on RAW dependences• WAW and WAR cause stalls

– WAW at decode– WAR at writeback

• Optimization: Why is this OK?

• Implemented in the CDC 6600 in ‘64– 18 non-pipelined FUs

• 4 FP: 2 mul, 1 add, 1 div• 7 MEM: 5 load, 2 store• 7 INT: add, shift, logical etc.

• Centralized Control Scheme– Controls all Instruction Issue– Detects all hazards

Page 31: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

MIPS/DLX w/ Scoreboarding

RegisterFile

FP mul

FP mul

FP divide

FP add

FP integer

scoreboard

Page 32: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Scoreboarding Overview• Ignore IF and MEM for simplicity• 4-stage execution

– Issue Check for structural hazardsCheck for WAW hazardsStall until all clear

– ReadOp Check for RAW hazardsWait until all operands readyRead Registers

– Execute Execute OperationsNotify scoreboard when complete

– Write Check for WAR hazardsStall Write until all clear

• A completing instruction cannot write dest if an earlier instruction has not read dest.

Page 33: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Scoreboarding Optimizations/Tricks

• WAW as in original OOO• WAR is optimized

– Second Producer is allowed to execute up to complete

– It is stalled there until preceding consumers complete

• No Commit– No precise interrupts

• Window is implemented in the scoreboard• One entry per Functional Unit

– Recall not pipelined– Instructions identified by FU id

Page 34: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Scoreboarding Organization• Three structures

– Instruction Status– Functional Unit Status– Register Result Status

• Instruction Status– Which stage the instruction is currently in

• Functional Unit Status: scheduling– Busy– OP Operation– Fi Dest. Reg.– Fj, Fk Source Regs– Qj, Qk FUs producing sources– Rj, Rk Ready bits for sources

• Register Result Status: dep. determination– Which FU will produce a register

Page 35: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Scoreboarding explained

• Register status reg:– Which FU produces the register

• Use at decode– Source reg match is a RAW– Target reg macth is a WAW stall

Page 36: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Functional Unit Status• Busy:

– resource allocation• OP:

– what to do once issued (e.g., add, sub)• Dest. Reg.:

– Where to write result– To find WAR

• Fj, Fk Source Regs– for WAR: can’t write if consumers pending for

previous value of register (if FU not the same)• Qj, Qk FUs producing sources

– To wait for appropriate producer• Rj, Rk Ready bits for sources

– To determine when ready: all ready

Page 37: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Scoreboarding ExampleInstruction status Read Execution WriteInstruction j k Issue operandscomplete ResultLD F6 34+ R2LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Functional Unit Statusdest S1 S2 FU for j FU for k Fj? Fk?

Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2 NoAdd NoDivide No

Register result status

ClockF0 F2 F4 F6 F8 F10 F12 ... F30

FU

Page 38: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Example: Cycle 0Instruction status Read Execution WriteInstruction j k Issue operandscomplete ResultLD F6 34+ R2 1LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Functional Unit Statusdest S1 S2 FU for j FU for k Fj? Fk?

Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger yes LD F6Mult1 NoMult2 NoAdd NoDivide No

Register result status

ClockF0 F2 F4 F6 F8 F10 F12 ... F30

FU integer

Page 39: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Example, contd.

• The rest you’ll find on the web site• Go through it• Source: Patterson

• Summary:– Execution proceeds in an order dictated by

dependences– RAW, WAR and WAW force ordering– Tricks may be possible

Page 40: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Beyond Simple OoO

A B

CD

E

A: LF F6, 34(R2)B: LF F2, 45(R3)C: MULF F0, F2, F4D:SUBF F8, F2, F6E: ADDF F2, F7, F4

• E will wait for B, C and D. • WAR w/ C and D• WAW w/ B• Can we do better?

Page 41: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

What if we had infinite registersA: LF F6, 34(R2)B: LF F2, 45(R3)C: MULF F0, F2, F4D:SUBF F8, F2, F6E: ADDF F2, F7, F4

A: LF F6, 34(R2)B: LF F2, 45(R3)C: MULF F0, F2, F4D:SUBF F8, F2, F6E: ADDF F9, F7, F4

No false dependences anymoreSince we do not reuse a name we can’t have WAW

and WAR

Page 42: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Why we can’t have Infinite Registers

• False/Name dependences (WAR and WAW)– Artifact of having finite registers

• There is no such thing as infinite• There is no such thing as large enough

– Well there is (in a sec.)– Computers execute Billions of Instructions

per sec. Even a multi-billion register file would soon be exhausted

• Want to exploit parallelism across several instances of the same code– Loops, recursive functions (most frequent

part)

Page 43: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Yes, there is “large enough”

� At any given point there will be a finite number of instructions in the window

� if each instruction has a single register target

� if there are N instructions� How many registers do we need?

� N?� N + X?

Page 44: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Register Renaming• Register Version

– Every Write creates a new version– Uses read the last version– Need to keep a version until all uses have read it.

• Register Renaming:– Architectural vs. Physical Registers

• more phys. than arch.– Maintain a map of arch. to phys. regs.– Use in-order decoding to properly identify

dependences.– Instructions wait only for input op. availability.– Only last version is written to reg. file.

Page 45: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Register RenamingA: DIVF F3, F1, F0 r1, -, -B: SUBF F2, F1, F0 r2, -, -C: MULF F0, F2, F4 r3, r2, -D: SUBF F6, F2, F3 r4, r2, r1E: ADDF F2, F5, F4 r5, -, -F: ADDF F0, F0, F2 r6, r3, r5

Register Rename TableF0 F1 F2 F3 F5 F6 F7 ... F30

A R1B R2 R1C R3 R2 R1D R3 R2 R1 R4E R3 R5 R1 R4F R6 R5 R1 R4

Need more physical registers than architecturalIgnore control flow for the time being.

Page 46: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Register Renaming Process

• Only need to remember last producer of each architectural register– Vector

• At decode– Find the most recent producers for all

source registers– After: declare self as most recent producer

of target register• Complication:

– May have to retract• Speculative Execution, e.g., interrupts

– Need to be able to restore the mapping state

Page 47: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Register Renaming Support Structures

• Register Rename Table– f(aR) = pR– one entry per architectural Register

• Free Register List– Lists not used Physical Registers

• At Decode– grab a new register from the free list– Change mapping in rename table

• At Commit– Release Register? Not… Why?– Could release previous version

Page 48: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

How Many Physical Registers?

• Correctness:– At least as many architectural plus?

• Performance:– As many as possible– Not correctness– Recall not all instructions produce register

results• stores and branches

Page 49: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Dynamic Scheduling

A: DIVF F3, F1, F0 r1, -, -

B: SUBF F2, F1, F0 r2, -, -

C: MULF F0, F2, F4 r3, r2, -

D: SUBF F6, F2, F3 r4, r2, r1

E: ADDF F2, F5, F4 r5, -, -

F: ADDF F0, F0, F2 r6, r3, r5

Name Value- Values and Names flow together- Writeback specifies both value and name- A waiting instruction inspects all results- It is allowed to execute when all inputs are available

Page 50: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Physical Registers

• Physical register file is just one option• What we need is separate storage

– Consumers could keep values in their reservation station

– Tomasulo’s next

Page 51: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Tomasulo’s Algorithm• IBM 360/91 - Fast 360 for scientific code

– Completed in 1967– Dynamic scheduling– Predates cache memories

• Pipelined FUs– Adder up to 3 instructions– Multiplier up to 2 instructions

• Tomasulo vs. Scoreboard– Distributed hazard detection and control– Results are bypassed to FUs– Common Data Bus (CDB) for results

• All results visible to all instead of via a register

Page 52: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

DLX w/ Tomasulo• Tomasulo’s Algorithm

– Use “tags” to identify data values– Reservation stations distributed control– CDB broadcasts all results to all RSs

• Extend DLX as example– Assume multiple FUs than pipelined– Main difference is Register-Memory Insts.

• I.e., DLX does not have them• But that’s really a detail :-)

• Physical Registers?– Not really. What we need is different storage and

name for every version.– Here it’s the producing reservation station

Page 53: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Dynamic DLX

adders Mults

Load buffers Store buffers

CDB

RSRS

Operation Stack Registers

Page 54: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Tomasulo’s Algorithm• 3 major steps

– Dispatch• Get instruction from fetch queue• ALU op: check for available RS• Load: Check for available load buffer• If available: dispatch and copy read regs to RS or

load buffer• if not: stall - structural hazard

– Issue• If all ops are available: issue• If not monitor CDB for operands

– Complete• If CDB available, broadcast result• else stall

Page 55: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Tomasulo’s Algorithm contd.• Reservation stations

– Handle distributed hazard detection and instruction control

• Everything receiving data get its tag– 4-bit tag specifies reservation station or load buffer– Also which FU will produce result

• Register specifier is used to assign tags– Then they are discarded– Input register specifiers are ONLY used in dispatch.

(Rename table)• Common Data Bus:

– value + “tag” = where this comes from– vs. typical bus: value + “tag” = where this goes to

Page 56: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Tomasulo’s Algorithm Contd.

• Reservation Stations– Op Opcode– Qj, Qk Tag Fields (source ops)– Vj, VkOperand values (source ops)– Busy Currently in use

• Register file and Store Buffer– Qi Tag field– Busy Currently in use– Vi Value

• Load Buffers– Busy Currently in Use

Page 57: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Arch.Reg. Name

Tomasulo’s: Understanding Speculative vs. Architectural State

• add r1, r2, 10• sub r4, r1, 20• add r1, r3, 30

Value of r1I have it

Value of r2I have it

Value of r3I have it

Value of r4I have it

Register file

Whe

re is

the

regi

ster

?

Can be: “I have it”, “reservation station id”

Value of Src1NA NA Value of Src2NA

tgt src2

Value of Src1NA NA Value of Src2NA

Reservation Stations

Reg Arch. name

src1

Page 58: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Renaming 1st Instruction

• add r1, r2, 10• sub r4, r1, 20• add r1, r3, 30

-----RS0

Value of r2I have it

Value of r3I have it

Value of r4I have it

Register file

Value of R2r1 I have it 10I have it

tgt src2

Value of Src1NA NA Value of Src2NA

Reservation Stations

src1

Value of Src1NA NA Value of Src2NA

RS0

• Read sources (r2)• Rename r1 to RS0

Page 59: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Renaming 2nd Instruction

-----RS0

Value of r2I have it

Value of r3I have it

----RS1

Register file

Value of R2r1 I have it 10I have it

tgt src2

----r4 RS0 20I have it

Reservation Stations

src1

Value of Src1NA NA Value of Src2NA

RS1

• Sources: r1 in RS0 NYA• Rename r4 to RS1

• add r1, r2, 10• sub r4, r1, 20• add r1, r3, 30

Page 60: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Renaming 3rd Instruction

-----RS2

Value of r2I have it

Value of r3I have it

----RS1

Register file

Value of R2r1 I have it 10I have it

tgt src2

----r4 RS0 20I have it

Reservation Stations

src1

Value of R3r1 I have it 30I have itRS2

• Sources: r3 Avail.• Rename r1 to RS2

• add r1, r2, 10• sub r4, r1, 20• add r1, r3, 30

Page 61: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Example: cycle 0Instruction status Execution Write

Instruction j k Issue complete Result

LD F6 34+ R2LD F2 45+ R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk

0 Add1 No0 Add2 No0 Add3 No0 Mult1 No0 Mult2 No

Register result status

F0 F2 F4 F6 F8 F10 ...

FU

Busy AddressLoad1 NoLoad2 NoLoad3 No

load buffers

Page 62: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Example: cycle 1Instruction status Execution Write

Instruction j k Issue complete Result

LD F6 34+ R2 1LD F2 45+ R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk

0 A1 No0 A2 No0 A3 No0 M1 No0 M2 No

Register result status

F0 F2 F4 F6 F8 F10 ...

FU L1

Busy AddressL1 yesL2 NoL3 No

load buffers

34+R2

Page 63: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Example: cycle 3Instruction status Execution Write

Instruction j k Issue complete Result

LD F6 34+ R2 1 3LD F2 45+ R3 2MULTDF0 F2 F4 3SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk

0 A1 No0 A2 No0 A3 No0 M1 Yes Mul R(F4) L20 M2 No

Register result status

F0 F2 F4 F6 F8 F10 ...

FU M1 L2 L1

Busy AddressL1 yesL2 NoL3 No

load buffers

34+R245+R3

- Mul is issued vs. scoreboard- What’s waiting for L1?

Page 64: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Example…

• Check the web site…• Too much for in-class• Summary:

– Execution proceeds in any order that does not violate RAW dependences

– WAR and WAW are removed

Page 65: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Tomasulo’s vs. ScoreboardInstruction status Execution WriteInstruction j k Issue complete ResultLD F6 34+ R2 1 3 4LD F2 45+ R3 2 4 5MULTDF0 F2 F4 3 15 16SUBDF8 F6 F2 4 7 8DIVDF10 F0 F6 5 56 57ADDDF6 F8 F2 6 10 11

- In-order issue- Out-of-order execution- Out-of-order completion

Instruction status Read Execution WriteInstruction j k Issue operandscomplete ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21 61 62ADDD F6 F8 F2 13 14 16 22

Scoreboard:

Page 66: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Tomasulo’s• Out-of-order loads and stores?

– What about WAW, RAW and WAR?– Compare all load addresses against the addresses of

all preceding store buffers– Stall if they match

• CDB is a bottleneck– One write per cycle– Could duplicate– But, come at a cost– Datapath + duplicated tags and control

• Complex Implementation– Scalability?– All results to all sources– What if we want 128 instrs?

Page 67: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Tomasulo’s• Advantages

– Distribution of hazard detection– Elimination of WAR and WAW stalls

• Common Data Bus– Broadcasts result to multiple instrs (+)– Bottleneck

• Register Renaming– Removes WAR and WAW hazards– More interesting when same code appears twice

• Think of loops• More on this later

– BUT: Associative lookups– RECALL: direct map is faster

Page 68: A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.

A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

In SummaryFeature Scoreboarding Tomasulo's

CDC6600 IBM 360

Structural Stall in Issue for Stall in DispatchFU for RS

Stall in RS for FURAW Via Registers From CDB

WAR Stall in WB Copy Value to RS

WAW Stall in Issue Register Renaming

Logic Centralized Distributed

Bottlenecks No Register One CDBBypassStall in issue block