A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential...

A. Moshovos © ECE1773 - Fall ‘06 ECE Toronto

Instruction Level Parallel Processing• Sequential Execution Semantics• Superscalar Execution

– Interruptions• Out-of-Order Execution

– How it can help– Issues:

• Maintaining Sequential Semantics• Scheduling

– Scoreboard• Register Renaming

• Initially, we’ll focus on Registers, Memory later on

Sequential Semantics - Review

• Instructions appear as if they executed:– In the order they appear in the program– One after the other

• Pipelining: Partial Overlap of Instructions– Initiate one instruction per cycle– Subsequent instructions overlap partially– Commit one instruction per cycle

Can we do better than pipelining?

loop: ld r2, 10(r1)add r3, r3, r2sub r1, r1, 1bne r1, r0, loop

Pipelining:

sum += a[i--]

fetch decode ldfetch decode add

fetch decode subfetch decode bne

Superscalar:fetch decode ld

fetch decode addfetch decode sub

fetch decode bne

Superscalar - In-order (initial def.)

• Two or more consecutive instructions (in the original program order) can execute in parallel

• Is this much better than pipelining?– What if all instructions were dependent?

• Superscalar buys us nothing

• Again key is typical program behavior– Some parallelism exists– Pipelining “drains” on dependences– Superscalar consumes “fill-up” time

Practicalities

• Issue mechanism– At decode check:

• Dependences • Input operand availability

– Check against Instructions:• Simultaneously Decoded• In-progress in the pipeline (i.e., previously issued)

– Recall the register vector from pipelining

• Increasingly Complex with degree of superscalarity– 2-way, 3-way, …, n-way

Issue Rules

• Stall at decode if:– RAW dependence and no data available

• Source registers against previous targets

– WAR or WAW dependence• Target register against previous targets + sources

– No resource available• This check is done in program order

Issue Mechanism

• Assume 2 source & 1 target max per instr.– comparators for 2-way:

• 3 for tgt and 2 for src (tgt: WAW + WAR, src: RAW)

– comparators for 4-way:• 2nd instr: 3 tgt and 2 src• 3rd instr: 6 tgt and 4 src• 4th instr: 9 tgt and 6 src

tgt src1 src1

� simplifications may be possible� resource checking not shown

Implications

• Need to multiport some structures– Register File

• Multiple Reads and Writes per cycle

– Register Availability Vector• Multiple Reads and Writes per cycle

– From Decode and Commit– Also need to worry about WAR and WAW

• Resource tracking– Additional issue conditions

• Many Superscalars had additional restrictions– E.g., execute one integer and one floating point op– one branch, or one store/load

Preserving Sequential Semantics• In principle not much different than

pipelining• Program order is preserved in the pipeline• Some instructions proceed in parallel

– But order is clearly defined• Defer interrupts to commit stage (i.e.,

writeback)– Flush all subsequent instructions

• may include instructions committing simultaneously

– Allow all preceding instructions to commit• Recall comparisons are done in program

order• Must have sufficient time in clock cycle to

handle these

Interrupts Example

fetch decode ldfetch decode addfetch decode div

fetch decode bne

Exceptionraised

Exceptiontaken

fetch decode bne

fetch decode ldfetch decode addfetch decode div

fetch decode bne

Exceptionraised

Exceptiontaken

fetch decode bne

Exceptionraised

Superscalar vs. Pipelining

• In principle they are orthogonal– Superscalar non-pipelined machine– Pipelined non-superscalar– Superscalar and Pipelined (common)

• Additional functionality needed by Superscalar:– Another bound on clock cycle– At some point it limits the number of

pipeline stages

Superscalar vs. Superpipelining• Superpipelining:

– Vaguely defined as deep pipelining, i.e., lots of stages• Superscalar issue complexity: limits super-

pipelining• How do they compare?

– Conceptually, not by much.fetch decode instfetch decode inst

fetch decode instfetch decode inst

Case Study: Alpha 21164

21164: Int. Pipe

21164: Memory Pipeline

21164: Floating-Point Pipe

80486 Pipeline• Fetch

– Load 16-bytes from into prefetch buffer• Decode 1

– Determine instruction length and type• Decode 2

– Compute memory address– Generate immediate operands

• Execute– Register Read– ALU– Memory read/write

• Write-back– Update register file

• (source: CS740 CMU, ’97, all slides on 486)

80486 Pipeline detail• Fetch

– Moves 16 bytes of instruction stream into code queue

– Not required every time– About 5 instructions fetched at once (avg. length 2.5

bytes)– Only useful if don’t branch– Avoids need for separate instruction cache

• D1– Determine total instruction length– Signals code queue aligner where next instruction

begins– May require two cycles

• When multiple operands must be decoded• About 6% of “typical” DOS program

80486 Pipeline• D2

– Extract memory displacements and immediate operands

– Compute memory addresses– Add base register, and possibly scaled index register– May require two cycles

• If index register involved, or both address & immediate operand

• Approx. 5% of executed instructions• EX

– Read register operands– Compute ALU function– Read or write memory (data cache)

• WB– Update register result

Out-of-Order Execution

• Also known as dynamic scheduling– Compilers do static scheduling

• We will start by considering register only– Register interface helps a lot

• Later on we will expand to memory– Tricky: Memory interface is more powerful

than registers – Makes it harder to figure out dependences

• In principle the same method will be used for both

Beyond Superscalar Execution

do {sum += a[++m]; i--;

} while (i != 0);

out-of-order

loop: add r4, r4, 1ld r2, 10(r4)add r3, r3, r2sub r1, r1, 1bne r1, r0, loop

Superscalar:

fetch decode

fetch decode add

fetch decode ld

fetch decode add

fetch decode

fetch decode add

fetch decode ld

fetch decode add

fetch decode

fetch decode add

fetch decode ld

fetch decode add

Sequential Semantics?

• Execution does NOT adhere to sequential semantics

• To be precise: Eventually it may• Simplest solution: Define problem away

– Imprecise interrupts• On interrupt some instr. committed some not• software we’ll have to figure out what is going on• Horrible for debugging and programming

inconsistent

consistent

Interrupt

• Recall we use the term interrupt to signify the need to observe the machine’s state after any instruction– This can be indeed the result of interrupt in

the classical sense– But, it could be a debugger (still uses

interrupts)

Out-of-Order vs. Pipelining and Superscalar

• Definition: two or more instructions can execute in any order if they have no dependences (RAW, WAW, WAR)– beware of transitive dependences

• Is this better than pipelining or superscalar exec?– If all are independent: not– if all dependent: not– Programs have some parallelism– Pipelining “drain” and “fill-up” overheads– Superscalar, parallelism only when adjacent– OoO exploits par. even when not adjacent

• OoO Orthogonal to pipelining and Superscalar

Out-of-order Execution Issues

• Preserving Sequential Semantics

• Stalling Instructions w/ dependences

• Issuing Instructions when dependences are satisfied

Back to Sequential Semantics

• Instr. exec. in 3 phases:– In-progress, Completed (NEW), Committed– OOO for in-progress and Completed– In-order Commits

• Completed - out-of-order: ”Visible only inside”– Results visible to subsequent instructions– Results not visible to outsiders

• On interrupts completed results are discarded• Committed - in-order: ”Visible to all”

– Results visible to subsequent instructions– Results visible to outsiders

• On interrupt committed results are preserved

How Completes Help w/ Performance

DIV R3, _, _ADD R1, _, _ADD _, R1, _

In-ordercommits

in-ordercompletes

out-of-order completesin-order commits

complete

fetch decode

fetch decode add

fetch decode ld

fetch decode add

commit

Implementing Completes/Commits

• Key idea:– Maintain sufficient state around to be able to

roll-back when necessary– Roll-back:

• Discard (aka Squash) all not committed

• One solution (conceptual):– Upon Complete instruction records previous

value of target register– Upon Discard, instruction restores target

value– Upon Commit, nothing to do

• We will return to this shortly • Focus on scheduling mechanisms

Out-of-Order Execution the Big PictureProgram Form Processing Phase

Static program

dynamic inst.Stream (trace)

execution window

completed instructions

Dispatch/ dependences

inst. Issue

inst execution

inst. Reorder & commit

Out-of-Order Execution: Stages

• Fetch: get instruction from memory• Decode/Dispatch: what is it? What are the

dependences• Issue: Go – all dependences satisfied• Execute: perform operation• Complete: result available to other insts.• Commit: result available to outsiders

• We’ll start w/ Decode/Dispatch• Then we’ll consider Issue

OOO Scheduling• Instruction @ Decode:

– Do I have dependences yet to be satisfied?– Yes, stall until they are– No, clear to issue

• Wakeup Instructions Stalled:– Dependences satisfied– Allow instruction to issue

• Dependence:– (later instruction, earlier instruction) & type

• We’ll first consider RAW and then move on to WAW and WAR

Stalling @ Decode for RAW

• Are there unsatisfied dependences?– RAW: have to wait for register value– We don’t really care who is producing the

value– Only whether it is available

• Can use the Register Availability Vector as in pipelining/superscalar– Also known as scoreboard

• At Decode– Reset bit corresponding to your target– At writeback set– Check all bits for source regs: if any is 0 stall

Issuing Instructions: Scheduling• Determine when an instruction can issue

– Ignore resources for the time being• Stalled because of RAW w/ preceding instruction• Concept:

– Producer (write) notifies consumers (read)• Requirements:

– Consumers need to be able to identify producer– The register name is one possible link

• Mechanism– Consumer placed in a reservation station – Producers on complete broadcasts identity– Waiting instructions observe– Update Operand Availability – Issue if all operands now available

Reservation Station

• State pertaining to an instruction– What registers it reads– Whether they are available– What is the destination register– What state is the instruction in

• Waiting• Executing

Out-Of-Order Exec. Exampleloop: add r4, r4, 4

ld r2, 10(r4) 4 cycles latadd r3, r3, r2sub r1, r1, 1bne r1, r0, loop

1 1 1 1

r1 r2 r3 r4

RAVop src1 src2 tgt

Cycle 0

status

Out-Of-Order Exec. Example: Cycle 0

1 1 1 0

r1 r2 r3 r4

RAVop src1 src2 tgt

Cycle 0

add r4/1 NA/1 r4/0 Rdy

status

loop: add r4, r4, 4ld r2, 10(r4) 5 cycles latadd r3, r3, r2sub r1, r1, 1bne r1, r0, loop

Ready to be executed

Cycle 1loop: add r4, r4, 4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 0 1 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Exec

status

ld r4/1 NA/1 r2 RdyR4 gets produced now

Notify those waiting for R4

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 0 0 1

r1 r2 r3 r4

RAVop src1 src2 tgt

add r4/1 NA/1 r4 Cmtd

status

ld r4/1 NA/1 r2 ExecWait for r2

Result available @ cycle 6

add r3/1 r2/0 r3 Wait

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

0 0 0 1

r1 r2 r3 r4

RAVop src1 src2 tgt

status

sub r1/1 NA/1 r1 RdyNo dependences

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 0 0 1

r1 r2 r3 r4

RAVop src1 src2 tgt

status

sub r1/1 NA/1 r1 Execr1 produced nowNotify consumers

bne r1/1 r0/1 NA Rdyr1 will be available next cycle

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 0 0 1

r1 r2 r3 r4

RAVop src1 src2 tgt

status

sub r1/1 NA/1 r1 ComplCompleted

bne r1/1 r0/1 NA Execexecuting

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 1 0 1

r1 r2 r3 r4

RAVop src1 src2 tgt

status

Result available @ cycle 6Notify consumers

add r3/1 r2/1 r3 Rdy

sub r1/1 NA/1 r1 ComplCompleted

bne r1/1 r0/1 NA Execexecuting

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 1 1 1

r1 r2 r3 r4

RAVop src1 src2 tgt

status

ld r4/1 NA/1 r2 CmtdExecuting

Notify consumers

add r3/1 r2/1 r3 Exec

sub r1/1 NA/1 r1 Compl

Completedbne r1/1 r0/1 NA Compl

ld r2, 10(r4)add r3, r3, r2

sub r1, r1, 1

bne r1, r0, loop

1 1 1 1

r1 r2 r3 r4

RAVop src1 src2 tgt

status

ld r4/1 NA/1 r2 Cmtd

add r3/1 r2/1 r3 Cmtd

sub r1/1 NA/1 r1 Cmtd

bne r1/1 r0/1 NA Cmtd

Notifying Consumers

• Identity of Producer• Uniquely Identify the Instruction• Easily retrievable @ decode by others

– Target Register• Recall we stall on WAR or WAW

– Functional Unit • If not pipelined

– Place in instruction window– PC? not. Why?

Name Dependences and OOO

• WAW or WAR: We need to update register but others are still using it– add r1, r1, 10– sw r1, 20(r2)– add r1, r3, 30– sub r2, r1, 40

• There is only one r1– sw needs to see the value of 1st add– sub needs to wait for 2nd add and not 1st

• Solution: Stall decode when WAW or WAR

Detecting WAW and WAR

• WAW? Look at Scoreboard– If bit is 0 then there is a pending write– Stall

• WAR? Need to know whether all preceding consumers have read the value– Keep a count per register– Increase at decode for all reads– Decrease on issue

• More elegant solution via register renaming– Soon

Window vs. Scheduler• Window

– Distance between oldest and youngest instruction that can co-exist inside the CPU

– Larger window Potential for more ILP• Scheduler

– Number of instructions that are waiting to be issued

• Window– Instructions enter at Fetch– Exit at Commit

• Scheduler– Instructions enter at Decode– Leave at writeback

• Window >= Scheduler– Can be the same structure

• In window but not in scheduler completed

Scoreboarding• Schedule based on RAW dependences• WAW and WAR cause stalls

– WAW at decode– WAR at writeback

• Optimization: Why is this OK?

• Implemented in the CDC 6600 in ‘64– 18 non-pipelined FUs

• 4 FP: 2 mul, 1 add, 1 div• 7 MEM: 5 load, 2 store• 7 INT: add, shift, logical etc.

• Centralized Control Scheme– Controls all Instruction Issue– Detects all hazards

MIPS/DLX w/ Scoreboarding

RegisterFile

FP mul

FP divide

FP add

FP integer

scoreboard

Scoreboarding Overview• Ignore IF and MEM for simplicity• 4-stage execution

– Issue Check for structural hazardsCheck for WAW hazardsStall until all clear

– ReadOp Check for RAW hazardsWait until all operands readyRead Registers

– Execute Execute OperationsNotify scoreboard when complete

– Write Check for WAR hazardsStall Write until all clear

• A completing instruction cannot write dest if an earlier instruction has not read dest.

Scoreboarding Optimizations/Tricks

• WAW as in original OOO• WAR is optimized

– Second Producer is allowed to execute up to complete

– It is stalled there until preceding consumers complete

• No Commit– No precise interrupts

• Window is implemented in the scoreboard• One entry per Functional Unit

– Recall not pipelined– Instructions identified by FU id

Scoreboarding Organization• Three structures

– Instruction Status– Functional Unit Status– Register Result Status

• Instruction Status– Which stage the instruction is currently in

• Functional Unit Status: scheduling– Busy– OP Operation– Fi Dest. Reg.– Fj, Fk Source Regs– Qj, Qk FUs producing sources– Rj, Rk Ready bits for sources

• Register Result Status: dep. determination– Which FU will produce a register

Scoreboarding explained

• Register status reg:– Which FU produces the register

• Use at decode– Source reg match is a RAW– Target reg macth is a WAW stall

Functional Unit Status• Busy:

– resource allocation• OP:

– what to do once issued (e.g., add, sub)• Dest. Reg.:

– Where to write result– To find WAR

• Fj, Fk Source Regs– for WAR: can’t write if consumers pending for

previous value of register (if FU not the same)• Qj, Qk FUs producing sources

– To wait for appropriate producer• Rj, Rk Ready bits for sources

– To determine when ready: all ready

Scoreboarding ExampleInstruction status Read Execution WriteInstruction j k Issue operandscomplete ResultLD F6 34+ R2LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Functional Unit Statusdest S1 S2 FU for j FU for k Fj? Fk?

Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger NoMult1 NoMult2 NoAdd NoDivide No

Register result status

ClockF0 F2 F4 F6 F8 F10 F12 ... F30

Example: Cycle 0Instruction status Read Execution WriteInstruction j k Issue operandscomplete ResultLD F6 34+ R2 1LD F2 45+ R3MULTD F0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2

Functional Unit Statusdest S1 S2 FU for j FU for k Fj? Fk?

Name Busy Op Fi Fj Fk Qj Qk Rj RkInteger yes LD F6Mult1 NoMult2 NoAdd NoDivide No

ClockF0 F2 F4 F6 F8 F10 F12 ... F30

FU integer

Example, contd.

• The rest you’ll find on the web site• Go through it• Source: Patterson

• Summary:– Execution proceeds in an order dictated by

dependences– RAW, WAR and WAW force ordering– Tricks may be possible

Beyond Simple OoO

A: LF F6, 34(R2)B: LF F2, 45(R3)C: MULF F0, F2, F4D:SUBF F8, F2, F6E: ADDF F2, F7, F4

• E will wait for B, C and D. • WAR w/ C and D• WAW w/ B• Can we do better?

What if we had infinite registersA: LF F6, 34(R2)B: LF F2, 45(R3)C: MULF F0, F2, F4D:SUBF F8, F2, F6E: ADDF F2, F7, F4

A: LF F6, 34(R2)B: LF F2, 45(R3)C: MULF F0, F2, F4D:SUBF F8, F2, F6E: ADDF F9, F7, F4

No false dependences anymoreSince we do not reuse a name we can’t have WAW

and WAR

Why we can’t have Infinite Registers

• False/Name dependences (WAR and WAW)– Artifact of having finite registers

• There is no such thing as infinite• There is no such thing as large enough

– Well there is (in a sec.)– Computers execute Billions of Instructions

per sec. Even a multi-billion register file would soon be exhausted

• Want to exploit parallelism across several instances of the same code– Loops, recursive functions (most frequent

Yes, there is “large enough”

� At any given point there will be a finite number of instructions in the window

� if each instruction has a single register target

� if there are N instructions� How many registers do we need?

� N?� N + X?

Register Renaming• Register Version

– Every Write creates a new version– Uses read the last version– Need to keep a version until all uses have read it.

• Register Renaming:– Architectural vs. Physical Registers

• more phys. than arch.– Maintain a map of arch. to phys. regs.– Use in-order decoding to properly identify

dependences.– Instructions wait only for input op. availability.– Only last version is written to reg. file.

Register RenamingA: DIVF F3, F1, F0 r1, -, -B: SUBF F2, F1, F0 r2, -, -C: MULF F0, F2, F4 r3, r2, -D: SUBF F6, F2, F3 r4, r2, r1E: ADDF F2, F5, F4 r5, -, -F: ADDF F0, F0, F2 r6, r3, r5

Register Rename TableF0 F1 F2 F3 F5 F6 F7 ... F30

A R1B R2 R1C R3 R2 R1D R3 R2 R1 R4E R3 R5 R1 R4F R6 R5 R1 R4

Need more physical registers than architecturalIgnore control flow for the time being.

Register Renaming Process

• Only need to remember last producer of each architectural register– Vector

• At decode– Find the most recent producers for all

source registers– After: declare self as most recent producer

of target register• Complication:

– May have to retract• Speculative Execution, e.g., interrupts

– Need to be able to restore the mapping state

Register Renaming Support Structures

• Register Rename Table– f(aR) = pR– one entry per architectural Register

• Free Register List– Lists not used Physical Registers

• At Decode– grab a new register from the free list– Change mapping in rename table

• At Commit– Release Register? Not… Why?– Could release previous version

How Many Physical Registers?

• Correctness:– At least as many architectural plus?

• Performance:– As many as possible– Not correctness– Recall not all instructions produce register

results• stores and branches

Dynamic Scheduling

A: DIVF F3, F1, F0 r1, -, -

B: SUBF F2, F1, F0 r2, -, -

C: MULF F0, F2, F4 r3, r2, -

D: SUBF F6, F2, F3 r4, r2, r1

E: ADDF F2, F5, F4 r5, -, -

F: ADDF F0, F0, F2 r6, r3, r5

Name Value- Values and Names flow together- Writeback specifies both value and name- A waiting instruction inspects all results- It is allowed to execute when all inputs are available

Physical Registers

• Physical register file is just one option• What we need is separate storage

– Consumers could keep values in their reservation station

– Tomasulo’s next

Tomasulo’s Algorithm• IBM 360/91 - Fast 360 for scientific code

– Completed in 1967– Dynamic scheduling– Predates cache memories

• Pipelined FUs– Adder up to 3 instructions– Multiplier up to 2 instructions

• Tomasulo vs. Scoreboard– Distributed hazard detection and control– Results are bypassed to FUs– Common Data Bus (CDB) for results

• All results visible to all instead of via a register

DLX w/ Tomasulo• Tomasulo’s Algorithm

– Use “tags” to identify data values– Reservation stations distributed control– CDB broadcasts all results to all RSs

• Extend DLX as example– Assume multiple FUs than pipelined– Main difference is Register-Memory Insts.

• I.e., DLX does not have them• But that’s really a detail :-)

• Physical Registers?– Not really. What we need is different storage and

name for every version.– Here it’s the producing reservation station

Dynamic DLX

adders Mults

Load buffers Store buffers

Operation Stack Registers

Tomasulo’s Algorithm• 3 major steps

– Dispatch• Get instruction from fetch queue• ALU op: check for available RS• Load: Check for available load buffer• If available: dispatch and copy read regs to RS or

load buffer• if not: stall - structural hazard

– Issue• If all ops are available: issue• If not monitor CDB for operands

– Complete• If CDB available, broadcast result• else stall

Tomasulo’s Algorithm contd.• Reservation stations

– Handle distributed hazard detection and instruction control

• Everything receiving data get its tag– 4-bit tag specifies reservation station or load buffer– Also which FU will produce result

• Register specifier is used to assign tags– Then they are discarded– Input register specifiers are ONLY used in dispatch.

(Rename table)• Common Data Bus:

– value + “tag” = where this comes from– vs. typical bus: value + “tag” = where this goes to

Tomasulo’s Algorithm Contd.

• Reservation Stations– Op Opcode– Qj, Qk Tag Fields (source ops)– Vj, VkOperand values (source ops)– Busy Currently in use

• Register file and Store Buffer– Qi Tag field– Busy Currently in use– Vi Value

• Load Buffers– Busy Currently in Use

Arch.Reg. Name

Tomasulo’s: Understanding Speculative vs. Architectural State

• add r1, r2, 10• sub r4, r1, 20• add r1, r3, 30

Value of r1I have it

Register file

Can be: “I have it”, “reservation station id”

Value of Src1NA NA Value of Src2NA

tgt src2

Reservation Stations

Reg Arch. name

Renaming 1st Instruction

-----RS0

Register file

Value of R2r1 I have it 10I have it

tgt src2

• Read sources (r2)• Rename r1 to RS0

Renaming 2nd Instruction

-----RS0

----RS1

Register file

tgt src2

----r4 RS0 20I have it

• Sources: r1 in RS0 NYA• Rename r4 to RS1

Renaming 3rd Instruction

-----RS2

----RS1

Register file

tgt src2

----r4 RS0 20I have it

Value of R3r1 I have it 30I have itRS2

• Sources: r3 Avail.• Rename r1 to RS2

Example: cycle 0Instruction status Execution Write

Instruction j k Issue complete Result

LD F6 34+ R2LD F2 45+ R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

Time Name Busy Op Vj Vk Qj Qk

0 Add1 No0 Add2 No0 Add3 No0 Mult1 No0 Mult2 No

F0 F2 F4 F6 F8 F10 ...

Busy AddressLoad1 NoLoad2 NoLoad3 No

load buffers

LD F6 34+ R2 1LD F2 45+ R3MULTDF0 F2 F4SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

0 A1 No0 A2 No0 A3 No0 M1 No0 M2 No

F0 F2 F4 F6 F8 F10 ...

Busy AddressL1 yesL2 NoL3 No

load buffers

LD F6 34+ R2 1 3LD F2 45+ R3 2MULTDF0 F2 F4 3SUBD F8 F6 F2DIVD F10 F0 F6ADDD F6 F8 F2Reservation Stations S1 S2 RS for j RS for k

0 A1 No0 A2 No0 A3 No0 M1 Yes Mul R(F4) L20 M2 No

F0 F2 F4 F6 F8 F10 ...

FU M1 L2 L1

Busy AddressL1 yesL2 NoL3 No

load buffers

34+R245+R3

- Mul is issued vs. scoreboard- What’s waiting for L1?

Example…

• Check the web site…• Too much for in-class• Summary:

– Execution proceeds in any order that does not violate RAW dependences

– WAR and WAW are removed

Tomasulo’s vs. ScoreboardInstruction status Execution WriteInstruction j k Issue complete ResultLD F6 34+ R2 1 3 4LD F2 45+ R3 2 4 5MULTDF0 F2 F4 3 15 16SUBDF8 F6 F2 4 7 8DIVDF10 F0 F6 5 56 57ADDDF6 F8 F2 6 10 11

- In-order issue- Out-of-order execution- Out-of-order completion

Instruction status Read Execution WriteInstruction j k Issue operandscomplete ResultLD F6 34+ R2 1 2 3 4LD F2 45+ R3 5 6 7 8MULTD F0 F2 F4 6 9 19 20SUBD F8 F6 F2 7 9 11 12DIVD F10 F0 F6 8 21 61 62ADDD F6 F8 F2 13 14 16 22

Scoreboard:

Tomasulo’s• Out-of-order loads and stores?

– What about WAW, RAW and WAR?– Compare all load addresses against the addresses of

all preceding store buffers– Stall if they match

• CDB is a bottleneck– One write per cycle– Could duplicate– But, come at a cost– Datapath + duplicated tags and control

• Complex Implementation– Scalability?– All results to all sources– What if we want 128 instrs?

Tomasulo’s• Advantages

– Distribution of hazard detection– Elimination of WAR and WAW stalls

• Common Data Bus– Broadcasts result to multiple instrs (+)– Bottleneck

• Register Renaming– Removes WAR and WAW hazards– More interesting when same code appears twice

• Think of loops• More on this later

– BUT: Associative lookups– RECALL: direct map is faster

In SummaryFeature Scoreboarding Tomasulo's

CDC6600 IBM 360

Structural Stall in Issue for Stall in DispatchFU for RS

Stall in RS for FURAW Via Registers From CDB

WAR Stall in WB Copy Value to RS

WAW Stall in Issue Register Renaming

Logic Centralized Distributed

Bottlenecks No Register One CDBBypassStall in issue block

A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential...

Documents

Transcript of A. Moshovos ©ECE1773 - Fall ‘06 ECE Toronto Instruction Level Parallel Processing Sequential...

Prediction-Based Superpage-Friendly TLB Designsclass.ece.iastate.edu/tyagi/cpre581/papers/HPCA15TLB.pdfMisel-Myrto Papadopoulou Xin Tong Andre Seznec´ y Andreas Moshovos Dept. of

Best Execution Quality of Execution Report 2018 - …...Best Execution – Quality of Execution Report – 2018 - Retail I. Asset class: Equities – Shares & Depositary Receipts (2,000

ECE 1773 Fall 2007 © A. Moshovos (Toronto) Control Flow Prediction #1.

Stratified Sampling of Execution Traces: Execution Phases ...

A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos {zebchuk,elham,moshovos}@eecg.toronto.edu.

WP4 execution report. Testing, implementation and execution on Pilot Action.pdf · WP4 execution report. Testing, implementation and execution Testing, implementation and execution

Comment utiliser le plan execution le plan execution

T OR A AMODT (aamodt@eecg.utoronto Andreas Moshovos Paul Chow

Mixed Speculative Multithreaded Execution Models · improve the execution efﬁciency (i.e., ILP) of the latter. Yet another execution model, runahead execution (RA), has also been

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

STRATEGY = EXECUTION · Organizations that excel in strategy execution start on average 3 months earlier with execution (the ‘lead-time-to-execution’), reduce execution- and innovation

Moshovos © 1 Memory State Compressors for Gigascale Checkpoint/Restore Andreas Moshovos moshovos@eecg.toronto.edu .

Pipelining and Precise Interruptsmoshovos/ACA07/lecture... · 2007. 10. 3. · 3-stage pipeline observation point. Moshovos ECE1773 10 Critical Path Determines Clock Cycle Overlap

Introduction to CUDA Programming Introduction to Programming Massively Parallel Graphics processors Andreas Moshovos moshovos@eecg.toronto.edu ECE, Univ.

A. Moshovos ©ECE1773 - Fall ‘07 ECE Toronto Out-of-Order Execution Structures.

Programming Massively Parallel Graphics Multiprocessors using …moshovos/CUDA08/arx/TransientStability.pdf · Programming Massively Parallel Graphics Multiprocessors using CUDA Final

Building Strategy Execution Officer to Enable execution of ...

Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu .

Wormhole: Wisely Predicting Multidimensional …Wormhole: Wisely Predicting Multidimensional Branches Jorge Albericio, Joshua San Miguel, Natalie Enright Jerger, and Andreas Moshovos

Project Execution Strategy ROAD - Global CCS Institute · This report aims to provide an overview of the project’s execution ... ROAD’s execution strategy. The actual execution

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos, Babak Falsafi § # Predictor Virtualization University of Toronto Canada § Carnegie Mellon University.