Gary MarsdenSlide 1University of Cape Town Pipelining Technique where multiple instructions are...

Gary Marsden Slide 1University of Cape Town

PipeliningTechnique where multiple instructions are

overlapped in execution (key for speed)Time

76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Time76 PM 8 9 10 11 12 1 2 AM

A

B

C

D

Taskorder

Taskorder


AnalogyEach step is called a pipe stage or pipe

segmentPipelining improves throughput rather than

the speed of a given instruction– Concorde vs 747

Only possible in multi-cycle datapathsAll stages must be ready to proceed at

same timeClock cycle determined by slowest stageGoal: balance length of each stage


Pipelined datapath Having 5 steps in an instruction means a 5

stage pipeline– 5 instructions being executed at a given time1. IF: Instruction fetch2. ID: Instruction Decode3. EX: Execute and effective address calculation4. MEM: Memory access5. WB: Write back


Comparative timingRegWrite

Total time

1

1

8765

Instructionfetch Reg ALU Data

access Reg

8 nsInstruction

fetch Reg ALU Dataaccess Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch Reg ALU Data

access Reg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction


2 nsInstruction


2 ns 2 ns 2 ns 2 ns 2 ns



View of datapath

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Instruction

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

ReaddataAddress

Datamemory

1

ALUresult

Mux

ALUZero

IF: Instruction fetch ID: Instruction decode/register file read

EX: Execute/address calculation

MEM: Memory access WB: Write back


Progression in pipeGeneral left-right progression

– Like a car assembly lineTwo exceptions

– Write back stage places result back in to register which is in the middle of the datapath

– Selection of PC value - could be a branchRight-left flow may affect subsequent

instructions Like multi-path, we need registers to hold

values between stages


Symbolic view

IM Reg DM RegALU

IM Reg DM RegALU

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7

Time (in clock cycles)

lw $2, 200($0)

lw $3, 300($0)


lw $1, 100($0) IM Reg DM RegALU


Extra registers

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Instruction

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX

Datamemory

Address


Execution of load instruction

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Instruction

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Executionlw

Address

Datamemory


Execution of store

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Instruction

IF/ID EX/MEM

Mux

0

1

Add

PC

0Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX MEM/WB

Executionsw

Address


OoopsWhen doing a write back for ‘lw’ we don’t

know where to write!

Instructionmemory

Address

4

32

0

Add Addresult

Shiftleft 2

Instruction

IF/ID EX/MEM MEM/WB

Mux

0

1

Add

PC

0

Address

Writedata

Mux

1Registers

Readdata 1

Readdata 2

Readregister 1

Readregister 2

16Sign

extend

Writeregister

Writedata

Readdata

Datamemory

1

ALUresult

Mux

ALUZero

ID/EX


A note on notations

IM Reg DM Reg

IM Reg DM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6


lw $10, 20($1)


sub $11, $2, $3

ALU

ALU


Time ( in clock cycles)


Instructionfetch

Instructiondecode

Instructionfetch

Instructiondecode Execution Write back

Execution

Dataaccess

Dataaccess Write backlw $10, $20($1)

sub $11, $2, $3


Pipeline control Just like we did for the single datapath

machine, but with a twist– Label control lines on existing data path– Assume PC written on each cycle (no PCWrite)– To control pipeline stage, need only control

values for that stage– Usual five stages for control: IF, ID, EXE, MEM,

WB


Pipeline control diagram

PC

Instructionmemory

Address

Instruction

Instruction[20– 16]

MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0Registers

Writeregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1Write

data

Read

data Mux

1

ALUcontrol

RegWrite

MemRead


6

IF/ID ID/EX EX/MEM MEM/WB

MemWrite

Address

Datamemory

PCSrc

Zero

Add Addresult

Shiftleft 2

ALUresult

ALUZero

Add

0

1

Mux

0

1

Mux


Buffering pipeline control

Control

EX

M

WB

M

WB

WB

IF/ID ID/EX EX/MEM MEM/WB

Instruction


Another scary picture

PC

Instructionmemory

Instruction

Add


MemtoReg

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

RegWrite

MemRead

Control

ALU


6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

MemWrite

AddressData

memory

Address


ObservationsAlthough a new instruction starts every

clock cycle, still need 5 cycles to completeTakes four cycles before we are up to full

efficiencyWhen stage is inactive, control lines are

deassertedControl sequencing is implicit in pipeline

stages– No mInstructions like before


Data hazard Sequences of instructions with dependencies make

high-performance pipelines hard to design– Sub $2,$1,$3; – AND $12, $2, $5– Oopsie!

Resolving– Forbid the compiler to do this

• Interleave only independent instructions• Use a No-op (wasteful)

– Stall– Forward


Data hazard diagram

IM Reg

IM Reg



sub $2, $1, $3


and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM


Overcoming data hazardsThe ‘add’ problem we can overcome with

hardware design– Write register file in first half of clock cycle, read

in secondDoesn’t help with ‘and’ and ‘or’

– Need to detect hazard and forward correct value


Detecting hazardsWe can’t get the computer to draw a

diagram, instead we use the following notation– 1(a) EX/MEM.WriteReg = IF/ID.ReadReg1– 1(b) EX/MEM.WriteReg = IF/ID.ReadReg2– 2(a) MEM/WB.WriteReg = IF/ID.ReadReg1– 2(b) MEM/WB.WriteReg = IF/ID.ReadReg2


Forwarding If we can detect a hazard, we can forward

the correct value as soon as it is available– We will see how to do this soon

By ‘forwarding’ we can pull the value from the appropriate pipeline register rather than waiting for it to be written back at the end of an instruction


Forwarding to resolve hazards

IM Reg

IM Reg



sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM


Achieving control for forwarding

Registers

Mux M

ux

ALU

ID/EX MEM/WB

Datamemory

Mux

Forwardingunit

EX/MEM

b. With forwarding

ForwardB

Rd EX/MEM.RegisterRd

MEM/WB.RegisterRd

RtRtRs

ForwardA

Mux

ALU

ID/EX MEM/WB

Datamemory

EX/MEM

a. No forwarding

Registers

Mux


Until…

PC Instructionmemory

Registers

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Forwardingunit

IF/ID

Instruction

Mux

RdEX/MEM.RegisterRd

MEM/WB.RegisterRd

Rt

Rt

Rs

IF/ID.RegisterRd

IF/ID.RegisterRt

IF/ID.RegisterRt

IF/ID.RegisterRs


StallsForwarding is an efficient way to solve data

hazards, but not all can be solved this way

Reg

IM

Reg

Reg

IM



lw $2, 20($1)


and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM


Load Word problemsWe cannot forward when an instruction

tries to read a register following a lw that is writing to that register

We need to detect this– Hazard detection unit in addition to the

forwarding unitConditions

– If(ID/EX.MemRead AND– ((ID/EX.RegWrite = IF/ID.RegRead1) OR– (ID/EX.RegWrite = IF/ID.RegRead2) ))

lw is only instruction to set this line


StallsOnce detected, we have to stall execution

until the value is available (whereupon it is forwarded)

Sometimes called a ‘bubble’ the idea being that we send an air bubble up the pipe, not data

Not strictly true– The control unit just gets the stalled stages of

the pipeline to repeat what they were doing until the value is available


Bubbles in the pipe

lw $2, 20($1)


and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble


Adding hazard detection


Registers

Mux

Mux

Mux

Control

ALU

EX

M

WB

M

WB

WB

ID/EX

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

0

Mux

IF/ID

Instruction

ID/EX.MemRead

IF/IDWrite

PCWrite

ID/EX.RegisterRt

IF/ID.RegisterRd

IF/ID.RegisterRtIF/ID.RegisterRtIF/ID.RegisterRs

RtRs

Rd

Rt EX/MEM.RegisterRd

MEM/WB.RegisterRd


Branch hazardsAnother type of hazard involves branches:

an instruction must be fetched every cycle to keep the pipeline full… but the decision about a branch does not come to the MEM stage

Called a ‘control’ or a ‘branch’ hazard– Occur less frequently than data hazards– Are simple to understand– Not much we can do really


Effect of a branch

Reg

Reg

CC 1


40 beq $1, $3, 7


IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg


Coping with branchingStall subsequent instructions on ‘beq’

– This increases the cost of a branch from one cycle to four cycles

Assume branch not taken– Carry on as before– Only penalty will be if the branch is taken– We can then ‘flush’ buffers


Lessening the impactCurrently branch decisions are made at stage

4We could save one stage by getting the value

from the buffer at stage 3 (like forwarding)Can even calculate the branch in first stage!

– Move branch adder from MEM to ID stage– Add a bunch of XOR gates to do comparison of

register values (do not use the ALU)– Need to alter forwarding unit to cope with this

Impact down to one lost cycle


Datapath to lessen branch impact


4

Registers

Mux

Mux

Mux

ALU

EX

M

WB

M

WB

WB

ID/EX

0

EX/MEM

MEM/WB

Datamemory

Mux

Hazarddetection

unit

Forwardingunit

IF.Flush

IF/ID

Signextend

Control

Mux

=

Shiftleft 2

Mux


Branch prediction ‘Assume branch not taken’ is a very

primitive form of branch predictionWe can use a ‘branch prediction buffer’ or

‘branch history table’ to see what happened the last time the branch was executed– Think about loops

Buffers are usually 2-bit– One bit buffers can flip-flop– 2 bit buffers need two wrong guesses before

they change


It doesn’t stop thereSome processors support ‘superpipeline’

– These are simply pipelines with more stagesOthers have ‘superscalar’ pipelines

– Basically the entire pipeline is replicated– Big overhead in control– Usually between 2 to 9 datapaths

• 4 superscalar pipelines give a CPI of 0.25!Final wrinkle is dynamic pipeline scheduling

– Copes with stalls, stalling the next instruction but allowing, non-dependent, subsequent instructions to go


Pipelining for realBoth the Pentium and PPC 604 use

dynamically scheduled pipelines– Have a 512 entry branch prediction table


Pipelines in reality30% of Pentium

is legacy

Branch

Instructioncache andfetch unit Instruction

decodeMicrocode(control)

Reorder buffer(control)

Reservation stations(control)

Memorybuffer

I/O unit

Data cache

Integerdata- path

Floating-point

datapathQuickTime™ and a

TIFF (Uncompressed) decompressorare needed to see this picture.


Pentium fetch/execute 1. Prefetch/Fetch: Instructions are fetched from

the instruction cache and aligned in prefetch buffers for decoding.

2. Decode1: Instructions are decoded into the Pentium's internal instruction format. Branch prediction also takes place at this stage.

3. Decode2: Same as above, and microcode ROM kicks in here, if necessary. Also, address computations take place at this stage.

4. Execute: The integer hardware executes the instruction.

5. Write-back: The results of the computation are written back to the register file


Pentium branch prediction3 types of prediction

– Only 20% miss

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.


P4 pipeline20 stages deep

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.


PowerPC processorScary!


BewarePipelining is not as easy as it looks

– Subtle and complex interplay Instruction set has a huge impact on

pipeline efficiency– Variable instruction lengths and addressing

modes problematic Increasing depth of pipe does not always

improve performance


Performance trade off

1 2 4 8 16

Pipeline depth

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Relative performance


Comparisons

Slower

Clock rate

FasterSlower

Instruction throughput(instructions per clock cycle or 1/CPI)

Multicycledatapath

(section 5.4)

Pipelineddatapath

(Chapter 6)

Single-cycledatapath

(section 5.3)

Faster

Shared

Hardware

Several1

Clock cycles of latency for an instruction

Single-cycledatapath

(section 5.3)

Pipelineddatapath

(Chapter 6)

Multicycledatapath

(section 5.4)

Specialized


SummaryPipleines speed up throughput Pipeline has stages corresponding to

execution steps of multi-cycle instructionsRequires buffers and special purpose

components to be addedProblems with data hazards

– Forward and stallingProblems with branch prediction,

– Do nothing, assume not taken, move comparison early, use branch prediction table

Gary MarsdenSlide 1University of Cape Town Pipelining Technique where multiple instructions are...

Documents

Transcript of Gary MarsdenSlide 1University of Cape Town Pipelining Technique where multiple instructions are...