Gary MarsdenSlide 1University of Cape Town Pipelining Technique where multiple instructions are...
-
Upload
lester-boone -
Category
Documents
-
view
223 -
download
0
description
Transcript of Gary MarsdenSlide 1University of Cape Town Pipelining Technique where multiple instructions are...
Gary Marsden Slide 1University of Cape Town
PipeliningTechnique where multiple instructions are
overlapped in execution (key for speed)Time
76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Time76 PM 8 9 10 11 12 1 2 AM
A
B
C
D
Taskorder
Taskorder
Gary Marsden Slide 2University of Cape Town
AnalogyEach step is called a pipe stage or pipe
segmentPipelining improves throughput rather than
the speed of a given instruction– Concorde vs 747
Only possible in multi-cycle datapathsAll stages must be ready to proceed at
same timeClock cycle determined by slowest stageGoal: balance length of each stage
Gary Marsden Slide 3University of Cape Town
Pipelined datapath Having 5 steps in an instruction means a 5
stage pipeline– 5 instructions being executed at a given time1. IF: Instruction fetch2. ID: Instruction Decode3. EX: Execute and effective address calculation4. MEM: Memory access5. WB: Write back
Gary Marsden Slide 4University of Cape Town
Comparative timingRegWrite
Total time
1
1
8765
Instructionfetch Reg ALU Data
access Reg
8 nsInstruction
fetch Reg ALU Dataaccess Reg
8 nsInstruction
fetch
8 ns
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Programexecutionorder(in instructions)
Instructionfetch Reg ALU Data
access Reg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 nsInstruction
fetch Reg ALU Dataaccess Reg
2 nsInstruction
fetch Reg ALU Dataaccess Reg
2 ns 2 ns 2 ns 2 ns 2 ns
Programexecutionorder(in instructions)
Gary Marsden Slide 5University of Cape Town
View of datapath
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Instruction
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
ReaddataAddress
Datamemory
1
ALUresult
Mux
ALUZero
IF: Instruction fetch ID: Instruction decode/register file read
EX: Execute/address calculation
MEM: Memory access WB: Write back
Gary Marsden Slide 6University of Cape Town
Progression in pipeGeneral left-right progression
– Like a car assembly lineTwo exceptions
– Write back stage places result back in to register which is in the middle of the datapath
– Selection of PC value - could be a branchRight-left flow may affect subsequent
instructions Like multi-path, we need registers to hold
values between stages
Gary Marsden Slide 7University of Cape Town
Symbolic view
IM Reg DM RegALU
IM Reg DM RegALU
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7
Time (in clock cycles)
lw $2, 200($0)
lw $3, 300($0)
Programexecutionorder(in instructions)
lw $1, 100($0) IM Reg DM RegALU
Gary Marsden Slide 8University of Cape Town
Extra registers
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Instruction
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX
Datamemory
Address
Gary Marsden Slide 9University of Cape Town
Execution of load instruction
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Instruction
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Executionlw
Address
Datamemory
Gary Marsden Slide 10University of Cape Town
Execution of store
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Instruction
IF/ID EX/MEM
Mux
0
1
Add
PC
0Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
Datamemory
1
ALUresult
Mux
ALUZero
ID/EX MEM/WB
Executionsw
Address
Gary Marsden Slide 11University of Cape Town
OoopsWhen doing a write back for ‘lw’ we don’t
know where to write!
Instructionmemory
Address
4
32
0
Add Addresult
Shiftleft 2
Instruction
IF/ID EX/MEM MEM/WB
Mux
0
1
Add
PC
0
Address
Writedata
Mux
1Registers
Readdata 1
Readdata 2
Readregister 1
Readregister 2
16Sign
extend
Writeregister
Writedata
Readdata
Datamemory
1
ALUresult
Mux
ALUZero
ID/EX
Gary Marsden Slide 12University of Cape Town
A note on notations
IM Reg DM Reg
IM Reg DM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $10, 20($1)
Programexecutionorder(in instructions)
sub $11, $2, $3
ALU
ALU
Programexecutionorder(in instructions)
Time ( in clock cycles)
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Instructionfetch
Instructiondecode
Instructionfetch
Instructiondecode Execution Write back
Execution
Dataaccess
Dataaccess Write backlw $10, $20($1)
sub $11, $2, $3
Gary Marsden Slide 13University of Cape Town
Pipeline control Just like we did for the single datapath
machine, but with a twist– Label control lines on existing data path– Assume PC written on each cycle (no PCWrite)– To control pipeline stage, need only control
values for that stage– Usual five stages for control: IF, ID, EXE, MEM,
WB
Gary Marsden Slide 14University of Cape Town
Pipeline control diagram
PC
Instructionmemory
Address
Instruction
Instruction[20– 16]
MemtoReg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0Registers
Writeregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1Write
data
Read
data Mux
1
ALUcontrol
RegWrite
MemRead
Instruction[15– 11]
6
IF/ID ID/EX EX/MEM MEM/WB
MemWrite
Address
Datamemory
PCSrc
Zero
Add Addresult
Shiftleft 2
ALUresult
ALUZero
Add
0
1
Mux
0
1
Mux
Gary Marsden Slide 15University of Cape Town
Buffering pipeline control
Control
EX
M
WB
M
WB
WB
IF/ID ID/EX EX/MEM MEM/WB
Instruction
Gary Marsden Slide 16University of Cape Town
Another scary picture
PC
Instructionmemory
Instruction
Add
Instruction[20– 16]
MemtoReg
ALUOp
Branch
RegDst
ALUSrc
4
16 32Instruction[15– 0]
0
0
Mux
0
1
Add Addresult
RegistersWriteregister
Writedata
Readdata 1
Readdata 2
Readregister 1
Readregister 2
Signextend
Mux
1
ALUresult
Zero
Writedata
Readdata
Mux
1
ALUcontrol
Shiftleft 2
RegWrite
MemRead
Control
ALU
Instruction[15– 11]
6
EX
M
WB
M
WB
WBIF/ID
PCSrc
ID/EX
EX/MEM
MEM/WB
Mux
0
1
MemWrite
AddressData
memory
Address
Gary Marsden Slide 17University of Cape Town
ObservationsAlthough a new instruction starts every
clock cycle, still need 5 cycles to completeTakes four cycles before we are up to full
efficiencyWhen stage is inactive, control lines are
deassertedControl sequencing is implicit in pipeline
stages– No mInstructions like before
Gary Marsden Slide 18University of Cape Town
Data hazard Sequences of instructions with dependencies make
high-performance pipelines hard to design– Sub $2,$1,$3; – AND $12, $2, $5– Oopsie!
Resolving– Forbid the compiler to do this
• Interleave only independent instructions• Use a No-op (wasteful)
– Stall– Forward
Gary Marsden Slide 19University of Cape Town
Data hazard diagram
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecutionorder(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2:
DM Reg
Reg
Reg
Reg
DM
Gary Marsden Slide 20University of Cape Town
Overcoming data hazardsThe ‘add’ problem we can overcome with
hardware design– Write register file in first half of clock cycle, read
in secondDoesn’t help with ‘and’ and ‘or’
– Need to detect hazard and forward correct value
Gary Marsden Slide 21University of Cape Town
Detecting hazardsWe can’t get the computer to draw a
diagram, instead we use the following notation– 1(a) EX/MEM.WriteReg = IF/ID.ReadReg1– 1(b) EX/MEM.WriteReg = IF/ID.ReadReg2– 2(a) MEM/WB.WriteReg = IF/ID.ReadReg1– 2(b) MEM/WB.WriteReg = IF/ID.ReadReg2
Gary Marsden Slide 22University of Cape Town
Forwarding If we can detect a hazard, we can forward
the correct value as soon as it is available– We will see how to do this soon
By ‘forwarding’ we can pull the value from the appropriate pipeline register rather than waiting for it to be written back at the end of an instruction
Gary Marsden Slide 23University of Cape Town
Forwarding to resolve hazards
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecution order(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2 :
DM Reg
Reg
Reg
Reg
X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :
DM
Gary Marsden Slide 24University of Cape Town
Achieving control for forwarding
Registers
Mux M
ux
ALU
ID/EX MEM/WB
Datamemory
Mux
Forwardingunit
EX/MEM
b. With forwarding
ForwardB
Rd EX/MEM.RegisterRd
MEM/WB.RegisterRd
RtRtRs
ForwardA
Mux
ALU
ID/EX MEM/WB
Datamemory
EX/MEM
a. No forwarding
Registers
Mux
Gary Marsden Slide 25University of Cape Town
Until…
PC Instructionmemory
Registers
Mux
Mux
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Datamemory
Mux
Forwardingunit
IF/ID
Instruction
Mux
RdEX/MEM.RegisterRd
MEM/WB.RegisterRd
Rt
Rt
Rs
IF/ID.RegisterRd
IF/ID.RegisterRt
IF/ID.RegisterRt
IF/ID.RegisterRs
Gary Marsden Slide 26University of Cape Town
StallsForwarding is an efficient way to solve data
hazards, but not all can be solved this way
Reg
IM
Reg
Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
DM Reg
Reg
Reg
DM
Gary Marsden Slide 27University of Cape Town
Load Word problemsWe cannot forward when an instruction
tries to read a register following a lw that is writing to that register
We need to detect this– Hazard detection unit in addition to the
forwarding unitConditions
– If(ID/EX.MemRead AND– ((ID/EX.RegWrite = IF/ID.RegRead1) OR– (ID/EX.RegWrite = IF/ID.RegRead2) ))
lw is only instruction to set this line
Gary Marsden Slide 28University of Cape Town
StallsOnce detected, we have to stall execution
until the value is available (whereupon it is forwarded)
Sometimes called a ‘bubble’ the idea being that we send an air bubble up the pipe, not data
Not strictly true– The control unit just gets the stalled stages of
the pipeline to repeat what they were doing until the value is available
Gary Marsden Slide 29University of Cape Town
Bubbles in the pipe
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Reg
IM
Reg
Reg
IM DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)
IM Reg DM RegIM
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9 CC 10
DM Reg
RegReg
Reg
bubble
Gary Marsden Slide 30University of Cape Town
Adding hazard detection
PC Instructionmemory
Registers
Mux
Mux
Mux
Control
ALU
EX
M
WB
M
WB
WB
ID/EX
EX/MEM
MEM/WB
Datamemory
Mux
Hazarddetection
unit
Forwardingunit
0
Mux
IF/ID
Instruction
ID/EX.MemRead
IF/IDWrite
PCWrite
ID/EX.RegisterRt
IF/ID.RegisterRd
IF/ID.RegisterRtIF/ID.RegisterRtIF/ID.RegisterRs
RtRs
Rd
Rt EX/MEM.RegisterRd
MEM/WB.RegisterRd
Gary Marsden Slide 31University of Cape Town
Branch hazardsAnother type of hazard involves branches:
an instruction must be fetched every cycle to keep the pipeline full… but the decision about a branch does not come to the MEM stage
Called a ‘control’ or a ‘branch’ hazard– Occur less frequently than data hazards– Are simple to understand– Not much we can do really
Gary Marsden Slide 32University of Cape Town
Effect of a branch
Reg
Reg
CC 1
Time (in clock cycles)
40 beq $1, $3, 7
Programexecutionorder(in instructions)
IM Reg
IM DM
IM DM
IM DM
DM
DM Reg
Reg Reg
Reg
Reg
RegIM
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Reg
Gary Marsden Slide 33University of Cape Town
Coping with branchingStall subsequent instructions on ‘beq’
– This increases the cost of a branch from one cycle to four cycles
Assume branch not taken– Carry on as before– Only penalty will be if the branch is taken– We can then ‘flush’ buffers
Gary Marsden Slide 34University of Cape Town
Lessening the impactCurrently branch decisions are made at stage
4We could save one stage by getting the value
from the buffer at stage 3 (like forwarding)Can even calculate the branch in first stage!
– Move branch adder from MEM to ID stage– Add a bunch of XOR gates to do comparison of
register values (do not use the ALU)– Need to alter forwarding unit to cope with this
Impact down to one lost cycle
Gary Marsden Slide 35University of Cape Town
Datapath to lessen branch impact
PC Instructionmemory
4
Registers
Mux
Mux
Mux
ALU
EX
M
WB
M
WB
WB
ID/EX
0
EX/MEM
MEM/WB
Datamemory
Mux
Hazarddetection
unit
Forwardingunit
IF.Flush
IF/ID
Signextend
Control
Mux
=
Shiftleft 2
Mux
Gary Marsden Slide 36University of Cape Town
Branch prediction ‘Assume branch not taken’ is a very
primitive form of branch predictionWe can use a ‘branch prediction buffer’ or
‘branch history table’ to see what happened the last time the branch was executed– Think about loops
Buffers are usually 2-bit– One bit buffers can flip-flop– 2 bit buffers need two wrong guesses before
they change
Gary Marsden Slide 37University of Cape Town
It doesn’t stop thereSome processors support ‘superpipeline’
– These are simply pipelines with more stagesOthers have ‘superscalar’ pipelines
– Basically the entire pipeline is replicated– Big overhead in control– Usually between 2 to 9 datapaths
• 4 superscalar pipelines give a CPI of 0.25!Final wrinkle is dynamic pipeline scheduling
– Copes with stalls, stalling the next instruction but allowing, non-dependent, subsequent instructions to go
Gary Marsden Slide 38University of Cape Town
Pipelining for realBoth the Pentium and PPC 604 use
dynamically scheduled pipelines– Have a 512 entry branch prediction table
Gary Marsden Slide 39University of Cape Town
Pipelines in reality30% of Pentium
is legacy
Branch
Instructioncache andfetch unit Instruction
decodeMicrocode(control)
Reorder buffer(control)
Reservation stations(control)
Memorybuffer
I/O unit
Data cache
Integerdata- path
Floating-point
datapathQuickTime™ and a
TIFF (Uncompressed) decompressorare needed to see this picture.
Gary Marsden Slide 40University of Cape Town
Pentium fetch/execute 1. Prefetch/Fetch: Instructions are fetched from
the instruction cache and aligned in prefetch buffers for decoding.
2. Decode1: Instructions are decoded into the Pentium's internal instruction format. Branch prediction also takes place at this stage.
3. Decode2: Same as above, and microcode ROM kicks in here, if necessary. Also, address computations take place at this stage.
4. Execute: The integer hardware executes the instruction.
5. Write-back: The results of the computation are written back to the register file
Gary Marsden Slide 41University of Cape Town
Pentium branch prediction3 types of prediction
– Only 20% miss
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Gary Marsden Slide 42University of Cape Town
P4 pipeline20 stages deep
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.
Gary Marsden Slide 43University of Cape Town
PowerPC processorScary!
Gary Marsden Slide 44University of Cape Town
BewarePipelining is not as easy as it looks
– Subtle and complex interplay Instruction set has a huge impact on
pipeline efficiency– Variable instruction lengths and addressing
modes problematic Increasing depth of pipe does not always
improve performance
Gary Marsden Slide 45University of Cape Town
Performance trade off
1 2 4 8 16
Pipeline depth
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Relative performance
Gary Marsden Slide 46University of Cape Town
Comparisons
Slower
Clock rate
FasterSlower
Instruction throughput(instructions per clock cycle or 1/CPI)
Multicycledatapath
(section 5.4)
Pipelineddatapath
(Chapter 6)
Single-cycledatapath
(section 5.3)
Faster
Shared
Hardware
Several1
Clock cycles of latency for an instruction
Single-cycledatapath
(section 5.3)
Pipelineddatapath
(Chapter 6)
Multicycledatapath
(section 5.4)
Specialized
Gary Marsden Slide 47University of Cape Town
SummaryPipleines speed up throughput Pipeline has stages corresponding to
execution steps of multi-cycle instructionsRequires buffers and special purpose
components to be addedProblems with data hazards
– Forward and stallingProblems with branch prediction,
– Do nothing, assume not taken, move comparison early, use branch prediction table