Advanced Pipelining

59
Advanced Pipelining • Optimally Scheduling Code • Optimally Programming Code • Scheduling for Superscalars (6.9) • Exceptions (5.6, 6.8)

description

Advanced Pipelining. Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8). for(i=0;i

Transcript of Advanced Pipelining

Page 1: Advanced Pipelining

Advanced Pipelining

• Optimally Scheduling Code

• Optimally Programming Code

• Scheduling for Superscalars (6.9)

• Exceptions (5.6, 6.8)

Page 2: Advanced Pipelining

Optimally schedule code

• for(i=0;i<N;i++)• A[i] = A[i] + 10;

• & (A[0]) in $s1• & (A[i]) in $s2

slt $t1, $s3, $s0

beq $t1, $0, end

loop:

lw $t0, 0($s1)

addi $t0, $t0, 10

sw $t0, 0($s1)

addi $s1, $s1, 4

slt $t1, $s1, $s2

bne $t1, $0, loop

Page 3: Advanced Pipelining

1. Identify Dependencies

lw $t0, 0($s1)

addi $t0, $t0, 10

sw $t0, 0($s1)

addi $s1, $s1, 4

slt $t1, $s1, $s2

bne $t1, $0, loop

$t0 – lw->addi – RAW$t0 – addi->sw - RAW

Page 4: Advanced Pipelining

2. Draw timing diagramWITH DATA FORWARDING

lw $t0, 0($s1)

addi $t0, $t0, 10

sw $t0, 0($s1)

addi $s1, $s1, 4

slt $t1, $s1, $s2

bne $t1, $0, loop

F D X M W

Page 5: Advanced Pipelining

3. Remove WAR/WAW dependencies

lw $t0, 0($s1)

addi $t0, $t0, 10

sw $t0, 0($s1)

addi $s1, $s1, 4

slt $t1, $s1, $s2

bne $t1, $0, loop

RAW, WAR, WAW

F D X M W F D X M W F D X M W F D X M W F D X M W F D X M W

D

F

F

lw

addi

sw

addi

slt

bne

Target the false dependencies

Page 6: Advanced Pipelining

3. Remove WAR/WAW dependencies

lw $t0, 0($s1)

sw $t0, 0($s1)

addi $s1, $s1, 4

lw $t0, 0($s1)

addi $s1, $s1, 4 sw $t0, 0($s1)

lw $t0, 0($s1)

addi

sw

Original Incorrect Correct

Page 7: Advanced Pipelining

lw $t0, 0($s1)

addi $s1, $s1, 4

addi $t0, $t0, 10

sw $t0, ____($s1)

slt $t1, $s1, $s2

bne $t1, $0, loop

lw $t0, 0($s1)

addi $t0, $t0, 10

sw $t0, 0($s1)

addi $s1, $s1, 4

slt $t1, $s1, $s2

bne $t1, $0, loop

Page 8: Advanced Pipelining

3. Remove WAR/WAW dependencies

lw $t0, 0($s1)

addi $s1, $s1, 4

addi $t0, $t0, 10

slt $t1, $s1, $s2

sw $t0, -4($s1)

bne $t1, $0, loop

F D X M W F D X M W F D X M W F D X M W F D X M W F D X M W

lw

addi

sw

addi

slt

bne

Page 9: Advanced Pipelining

Software Control Hazard Removal

If ( (x % 2) == 1)isodd = 1;

Page 10: Advanced Pipelining

Software Control Hazard Removal

If ( x == true)y = false;

elsey = true;

Page 11: Advanced Pipelining

If ((x == MON) || (x == TUE) || (x == WED)){}

Software Control Hazard Removal

Page 12: Advanced Pipelining

If ((TheCoinTossIsHeads) || (StudentStudiedForExam)){}

Increasing Branch Performance

Page 13: Advanced Pipelining

What does it all mean?

• Does that mean that error-checking code is bad? That is a whole lot of branches if you do it well!!!

Page 14: Advanced Pipelining

The moral is…..

• Calculation is less expensive than …..

Page 15: Advanced Pipelining

Superscalars - Parallelism

Ford mass produces cars. We want to “mass produce” instructions

Increase Depth – assembly line – build many cars at the same time, but each car is in a different stage of assembly.

Increase Width – multiple assembly lines – build many cars at the same time by building many line, all of which operate simultaneously.

Page 16: Advanced Pipelining

“Superpipelining” (deep pipelining – many stages)

• Limiting returns because….

• Register delays are __________________________ of clock

• Difficult to __________________

Page 17: Advanced Pipelining

SuperScalars

• __________ parts of pipeline

• Multiple instructions in _______ stage at once

Page 18: Advanced Pipelining

SuperScalars

• Which instructions can execute in parallel?

• Fetching multiple instructions per cycle

Page 19: Advanced Pipelining

Static Scheduling – VLIW or EPIC (Itanium)

• __________ schedules the instructions

• If one instruction stalls, all following instructions stall

• Book Example: SuperScalar MIPS:• Two instructions / cycle

• one alu/branch, one ld/st each cycle

Page 20: Advanced Pipelining

Schedule for SS MIPSLoop: lw $t0, 0($s1)

addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1, -4bne $s1, $zero,Loop

PC ALU/branch ld/st08162432

Page 21: Advanced Pipelining

SuperScalars - Static

bne

Fetch Memory WriteBackExecuteDecode

Read Values Write Values

addu

sw lw

addi

Page 22: Advanced Pipelining

Loop Problem

• Problem:– Too many _______________ in loop

– Not enough ______________ to fill in holes

• Solution:– Do ______________ at once

– More instructions

– Only one branch

Page 23: Advanced Pipelining

Loop Unrolling1. Unroll Loop

Loop: lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1) lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1)bne $s1, $zero,Loop

Loop: lw $t0, 0($s1) addi $s1, $s1, -4 addu $t0, $t0, $s2sw $t0, 4($s1)bne $s1, $zero,Loop

Page 24: Advanced Pipelining

Loop Unrolling2. Rename Registers

Loop: lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1) lw $t1, 0($s1)addi $s1, $s1, -4addu $t1, $t1, $s2sw $t1, 4($s1)bne $s1, $zero,Loop

But wait!!! How has this helped? There are tons of dependencies?Whatever are we to do? Register Renaming!!!

Page 25: Advanced Pipelining

Loop Unrolling2. Rename Registers

Loop: lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1) lw $t1, 0($s1)addi $s1, $s1, -4addu $t1, $t1, $s2sw $t1, 4($s1)bne $s1, $zero,Loop

(Repeated slide for your reference)

Loop: lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1) lw $t0, 0($s1)addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, 4($s1)bne $s1, $zero,Loop

Page 26: Advanced Pipelining

Loop Unrolling3. Reduce Instructions

Loop: lw $t0, 0($s1)addi $s1, $s1, -8addu $t0, $t0, $s2sw $t0, 8($s1) lw $t1, 4($s1)addu $t1, $t1, $s2sw $t1, 4($s1)bne $s1, $zero,Loop

Loop: lw $t0, 0($s1)addi $s1, $s1, -4addi $s1, $s1, -4addu $t0, $t0, $s2sw $t0, ___($s1) lw $t1, ___($s1)addu $t1, $t1, $s2sw $t1, 4($s1)bne $s1, $zero,Loop

Page 27: Advanced Pipelining

Loop Unrolling4. Schedule

Loop: lw1 $t0, 0($s1)addi $s1, $s1, -8addu1 $t0, $t0, $s2sw1 $t0, 8($s1) lw2 $t1, 4($s1)addu2 $t1, $t1, $s2sw2 $t1, 4($s1)bne $s1, $zero,Loop

ALU/branch lw/swlw1

Page 28: Advanced Pipelining

Performance Comparison

Original Unrolled

ALU/branch ld/stlw $t0, 0($s1)

addi $s1, $s1, -4addu $t0, $t0, $s2bne $s1, $zero,L sw $t0, 4($s1)

Page 29: Advanced Pipelining

Static Scheduling Summary

• Code size ______________ (because of nops)

• It can not resolve __________ dependencies

• If one instruction stalls, ___________________

Page 30: Advanced Pipelining

Dynamic Scheduling

• _________ schedules ready instructions

• Only ___________ instructions stall

• _______________ resolved in hardware

Page 31: Advanced Pipelining

4-wide Dynamic SuperscalarFetch

Register FileInstruction Window

Ld/St 1Add 2Add 3Add

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

Register Alias Table

lw r2, 0(s1)

Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop

addu r2,ldst1,r5sw 1add1, 0(s1)

addi r1,r1,-4bne 2add1,r7,Loop

lw r2, 0(s1)

sw r2, 0(s1)addu r2,r2,r5

addi r1,r1,-4

addi r1,r1,-4lw r2, 0(s1)

Fetch 4 instructions each

cycle

addu r2,ldst1,r5addi r1,r1,-4

bne 2add1,r7,Loop

sw r2, 0(s1)

Page 32: Advanced Pipelining

4-wide Dynamic SuperscalarDecode

Register FileInstruction Window

Ld/St 1Add 2Add 3Add

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

Register Alias Table

lw r2, 0(s1)

Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop

addu r2,ldst1,r5sw 1add1, 0(s1)

addi r1,r1,-4bne 2add1,r7,Loop

lw r2, 0(s1)

sw r2, 0(s1)addu r2,r2,r5

addi r1,r1,-4

addi r1,r1,-4lw r2, 0(s1)

Register Alias Table records 1. Current Register Number

(WAW/WAR Register Renaming)

or

addu r2,ldst1,r5addi r1,r1,-4

bne 2add1,r7,Loop

sw r2, 0(s1)

Page 33: Advanced Pipelining

4-wide Dynamic SuperscalarDecode

Register FileInstruction Window

Ld/St 1Add 2Add 3Add

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

Register Alias Table

lw r2, 0(s1)

Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop

addu r2,ldst1,r5sw 1add1, 0(s1)

addi r1,r1,-4bne 2add1,r7,Loop

lw r2, 0(s1)

sw r2, 0(s1)addu r2,r2,r5

addi r1,r1,-4

addi r1,r1,-4lw r2, 0(s1)

Register Alias Table records 1. Current Register Number

(WAW/WARRegister Renaming)

or2. Functional Unit

(RAW – result not ready)

addu r2,ldst1,r5addi r1,r1,-4

bne 2add1,r7,Loop

sw r2, 0(s1)

Page 34: Advanced Pipelining

4-wide Dynamic SuperscalarExecute

Register FileInstruction Window

Ld/St 1Add 2Add 3Add

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

Register Alias Table

lw r2, 0(s1)

Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop

addu r2,ldst1,r5sw 1add1, 0(s1)

addi r1,r1,-4bne 2add1,r7,Loop

lw r2, 0(s1)

sw r2, 0(s1)addu r2,r2,r5

addi r1,r1,-4

addi r1,r1,-4lw r2, 0(s1)

Wait until your inputs are ready

addu r2,ldst1,r5addi r1,r1,-4

bne 2add1,r7,Loop

sw r2, 0(s1)

Page 35: Advanced Pipelining

4-wide Dynamic SuperscalarExecute

Register FileInstruction Window

Ld/St 1Add 2Add 3Add

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

Register Alias Table

lw r2, 0(s1)

Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop

addu r2,ldst1,r5sw 1add1, 0(s1)

addi r1,r1,-4bne 2add1,r7,Loop

lw r2, 0(s1)

sw r2, 0(s1)addu r2,r2,r5

addi r1,r1,-4

addi r1,r1,-4lw r2, 0(s1)

Execute once they are ready

addu r2,ldst1,r5addi r1,r1,-4

bne 2add1,r7,Loop

sw r2, 0(s1)

Page 36: Advanced Pipelining

4-wide Dynamic SuperscalarMemory

Register FileInstruction Window

Ld/St 1Add 2Add 3Add

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

Register Alias TableLoop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop

addu r2,ldst1,r5sw 1add1, 0(s1)

addi r1,r1,-4bne 2add1,r7,Loop

lw r2, 0(s1)

sw r2, 0(s1)addu r2,r2,r5

addi r1,r1,-4

addi r1,r1,-4lw r2, 0(s1)

First calculate the address

addu r2,ldst1,r5addi r1,r1,-4

bne 2add1,r7,Loop

sw r2, 0(s1)lw r2, 0(s1)

Page 37: Advanced Pipelining

4-wide Dynamic SuperscalarMemory

Register FileInstruction Window

Ld/St 1Add 2Add 3Add

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

Register Alias Table

lw r2, 0(s1)

Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop

addu r2,ldst1,r5sw 1add1, 0(s1)

addi r1,r1,-4bne 2add1,r7,Loop

lw r2, 0(s1)

sw r2, 0(s1)addu r2,r2,r5

addi r1,r1,-4

addi r1,r1,-4lw r2, 0(s1)

Ld/St Queue checks memory addresses – out

of order lw/sw

addu r2,ldst1,r5addi r1,r1,-4

bne 2add1,r7,Loop

sw r2, 0(s1)

Page 38: Advanced Pipelining

4-wide Dynamic SuperscalarCommit

Register FileInstruction Window

Ld/St 1Add 2Add 3Add

CommitBuffer

Ld/StQueue

2add1 1add1 2 3

Register Alias Table

lw r2, 0(s1)

KEYWaiting for valueReading value

Loop: lw r2, 0(r1) addu r2, r2, r5 sw r2, 0(r1) addi r1, r1, -4 bne r1, r7,Loop

addu r2,ldst1,r5sw 1add1, 0(s1)

addi r1,r1,-4bne 2add1,r7,Loop

lw r2, 0(s1)

sw r2, 0(s1)addu r2,r2,r5

addi r1,r1,-4

addi r1,r1,-4lw r2, 0(s1)

addu r2,r2,r5addi r1,r1,-4

bne r1,r7,Loop

sw r2, 0(s1)

Instructions wait until all previous instructions

have completed

Page 39: Advanced Pipelining

Fallacies & Pitfalls• Pipelining is easy

–______________ is difficult

• Instruction set has no impact on pipelining–Complicated _____________

& _____________________ instructions complicate pipelining immensely

Page 40: Advanced Pipelining

Technology Influences

• Pipelining ideas are good ideas regardless of technology–Only recently, with extra chip

space, has ___________________ become better than ____________________

–Now, pipelining limited by ________

Page 41: Advanced Pipelining

Exceptions –Unexpected Events

• Internal • External

Page 42: Advanced Pipelining

Definitions

a. Anything unexpected happens

b. External event occurs

c. Internal event occurs

d. Change in control flow

Exception Interrupt

PowerPC

Intel

MIPS

Page 43: Advanced Pipelining

Exception-Handling

• Stop• Transfer control to OS• Tell OS what

happened• Begin executing

where we left off

Page 44: Advanced Pipelining

1. Detect Exception

• Add control lines to detect errors

Page 45: Advanced Pipelining

Step 2: Store PC into EPC

Read Addr Out Data

InstructionMemory

PC

Inst

4

src1 src1data

src2 src2dataRegister File

destreg

destdata

op/funrsrtrdimm

Addr Out Data

Data Memory

In Data

32Sign Ext

16

<<2

<<2

Page 46: Advanced Pipelining

Step 3: Tell OS the problem

• Store error code in the _________

• Use vectored interrupts

– Use error code to determine _________

Page 47: Advanced Pipelining

Cause Register

• Set a flag in the cause register

• How does the OS find out if an overflow occurred if the bit corresponding to an overflow is bit 5?

Page 48: Advanced Pipelining

Vectored Interrupts

• The address of trap handler is determined by cause

Exception type Exception vector address (in hex)

Undefined Instruction C0 00 00 00hex

Arithmetic Overflow C0 00 00 20hex

Page 49: Advanced Pipelining

Cause Register – Go to OS

Read Addr Out Data

InstructionMemory

PC

Inst

4

src1 src1data

src2 src2dataRegister File

destreg

destdata

op/funrsrtrdimm

Addr Out Data

Data Memory

In Data

32Sign Ext

16

<<2

<<2

EPC-4 Cause

Handler PC

Page 50: Advanced Pipelining

Vectored Interrupt – Go to OS

Read Addr Out Data

InstructionMemory

PC

Inst

4

src1 src1data

src2 src2dataRegister File

destreg

destdata

op/funrsrtrdimm

Addr Out Data

Data Memory

In Data

32Sign Ext

16

<<2

<<2

EPC-4

Cause Vector Table

Page 51: Advanced Pipelining

Steps for Exceptions

• Detect exception

• Place processor in state before offending instruction

• Record exception type

• Record instruction’s PC in EPC

• Transfer control to OS

Page 52: Advanced Pipelining

What happens if the third instruction is undefined?

Time->

add $s0, $0, $0

lw $s1, 0($t0)

undefined

or $s3, $s4, $t3

IF ID

IF ID

IF

MEM

ID

IF

1 2 3 4 5 6 7 8

ID WB

MEM

WB

MEM

WB

MEM

WB

In what stage is it detected? In what cycle?

1. Detection

Page 53: Advanced Pipelining

1. Detection

• Must associate exception with proper instruction

• What happens if multiple exceptions happen in the same cycle?

Page 54: Advanced Pipelining

Time->

add $s0, $0, $0

lw $s1, 0($t0)

undefined

or $s3, $s4, $t3

IF ID

IF ID

IF

MEM

ID

IF

1 2 3 4 5 6 7 8

2. Preserve state before instruction

What? What does that mean?!?

Page 55: Advanced Pipelining

3. Record exception type

• Place value in cause register or

• Use vectored interrupts– (exception routine address dependent on

exception type)

Page 56: Advanced Pipelining

PC

44

Addr Instr

Inst Mem

src1 src1datasrc2

RegFile src2datadestdestdata

ALUAddr OutData

DataMem

InData

X

<

Undef addlwor

4. Record PC in EPCMachine in detection cycle

Page 57: Advanced Pipelining

PC

44

Addr Instr

Inst Mem

src1 src1datasrc2

RegFile src2datadestdestdata

ALUAddr OutData

DataMem

InData

X

<

Undef

4. Record PC in EPCMachine in before transfer

Where is the proper PC? Long gone!!!

Page 58: Advanced Pipelining

4. Record PC in EPC

• Non-trivial because PC changes each cycle, and exceptions can be detected in several stages (decode, execute, memory)

• Precise exceptions

• Imprecise exceptions

Page 59: Advanced Pipelining

5. Transfer control to OS

• Same as before