1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined...
-
Upload
marianna-stafford -
Category
Documents
-
view
232 -
download
1
Transcript of 1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined...
1
Designing a Pipelined Processor
In this Chapter, we will study
1. Pipelined datapath2. Pipelined control3. Data Hazards4. Forwarding5. Branch Hazards6. Performance
2
Pipelining is Natural
Car Wash Station (non-pipelined)
Vacuum Wash DryBrianSteve BrianSteve Brian Steve BrianBob Steve
Car Wash Station (pipelined)
Vacuum Wash DryBrianSteve BrianBob Steve BrianMike Bob Steve BrianGorge Mike Bob Steve
Vacuum, Wash, Dry each takes5 minutes
Nonpipelined Car Wash Stationserving one customer per 15minutes
Pipelined Car Wash Stationafter the first customer (it takes15 minutes), then every 5minutes finishing service fora new customer
3
What we learn from car wash station
• Pipelining does not help latency of a single task, but it helps throughput of entire workload
• Pipeline rate is limited by the slowest pipeline stage
• Multiple tasks operating simultaneously using different resources
• Potential speedup = Number of stages
• Unbalanced lengths of pipe stages reduces speedup
• Time to fill the pipe and time to drain the pipe -> reduce speedup
• Stall the pipe for dependences
4
Performance
• Suppose we execute 100 instruction
• single cycle implementation– 50 ns/cycle x 1 CPI x 100 inst. = 5000 ns
• multiple cycle implementation– 10 ns/cycle x 4.6 CPI (instr. mix) x 100 inst = 4600 ns
• ideal pipelined processor– 10 ns/cycle x (1 CPI x 100 inst + 4 cycle to fill up the pipe)
= 1040 ns
– here we assume that 5 stages in the pipeline
– 5 stages : IF, ID, EX, MEM, WB
5
Why pipeline ?
Because the resources are there ! Why waste !
Start fetching and executing the next instruction before the current one has completed
Remember the performance equation: CPU time = CPI * CC * IC
Fetch (and execute) more than one instruction at a timeSuperscalar processing
6
A Pipelined MIPS Processor• Start the next instruction before the current one has completed
– improves throughput - total amount of work done in a given time
– instruction latency (execution time, delay time, response time - time from the start of an instruction to its completion) is not reduced
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
IFetch Dec Exec Mem WBlw
Cycle 7Cycle 6 Cycle 8
sw IFetch Dec Exec Mem WB
R-type IFetch Dec Exec Mem WB
» clock cycle (pipeline stage time) is limited by the slowest stage
» for some instructions, some stages are wasted cycles
7
Single Cycle, Multiple Cycle, vs. Pipeline
Multiple Cycle Implementation:
Clk
Cycle 1
IFetch Dec Exec Mem WB
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10
IFetch Dec Exec Mem
lw sw
IFetch
R-type
lw IFetch Dec Exec Mem WB
Pipeline Implementation:
IFetch Dec Exec Mem WBsw
IFetch Dec Exec Mem WBR-type
Clk
Single Cycle Implementation:
lw sw Waste
Cycle 1 Cycle 2
8
MIPS Pipeline Datapath Modifications• What do we need to add/modify in our MIPS datapath?
– State registers between each pipeline stage to isolate them
ReadAddress
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
16 32
ALU
Shiftleft 2
Add
DataMemory
Address
Write Data
ReadDataIF
etc
h/D
ec
De
c/E
xe
c
Ex
ec
/Me
m
Me
m/W
B
IF:IFetch ID:Dec EX:Execute MEM:MemAccess
WB:WriteBack
System Clock
SignExtend
9
Pipelining the MIPS ISA• What makes it easy
– all instructions are the same length (32 bits)
» can fetch in the 1st stage and decode in the 2nd stage
– few instruction formats (three) with symmetry across formats
» can begin reading register file in 2nd stage
– memory operations can occur only in loads and stores
» can use the execute stage to calculate memory addresses
– each MIPS instruction writes at most one result (i.e., changes the machine state) and does so near the end of the pipeline (MEM and WB)
• What makes it hard– structural hazards: what if we had only one memory?
– control hazards: what about branches?
– data hazards: what if an instruction’s input operands depend on the output of a previous instruction?
10
Can pipelining cause us trouble ?
• Pipeline Hazards :– structural hazards : attempt to use the same resource by two
different instructions
– data hazards : attempt to use data before it is valid. Such as an instruction depends on results of prior instructions still in the pipeline
– control hazards : attempt to execute branch before condition is computed by prior instructions
• Resolve Pipeline Hazards :– simplest solution : wait until hazards are gone
– take action to resolve hazards, the earlier the hazards are solved, the higher the performance the machine can achieve.
• Pipeline Overhead : – need pipeline registers between stages
11
Why Pipeline? For Performance!
Instr.
Order
Time (clock cycles)
Inst 0
Inst 1
Inst 2
Inst 4
Inst 3
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM RegA
LUIM Reg DM Reg
AL
UIM Reg DM Reg
Once the pipeline is full, one instruction
is completed every cycle, so
CPI = 1
Time to fill the pipeline
12
Structural Hazard Example
Mem Reg ALU Mem Reg
Mem Reg ALU Mem Reg
Mem Reg ALU Mem Reg
Mem Reg ALU Mem Reg
instruction 1
Instruction 2
Instruction 3
instruction 4
time
Two instructions want to access the MEM at the same time
Structural Hazard !
How to resolve this hazard ?
13
Instr.
Order
Time (clock cycles)
lw
Inst 1
Inst 2
Inst 4
Inst 3
AL
UMem Reg Mem Reg
AL
UMem Reg Mem Reg
AL
UMem Reg Mem RegA
LUMem Reg Mem Reg
AL
UMem Reg Mem Reg
A Single Memory Would Be a Structural Hazard
Reading data from memory
Reading instruction from memory
• Fix with separate instr and data memories (I$ and D$)
14
How About Register File Access?
Instr.
Order
Time (clock cycles)
add $1,
Inst 1
Inst 2
add $2,$1,
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
15
How About Register File Access?
Instr.
Order
Time (clock cycles)
Inst 1
Inst 2
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM RegA
LUIM Reg DM Reg
Fix register file access hazard by doing reads in the second half of the
cycle and writes in the first half
add $1,
add $2,$1,
clock edge that controls register writing
clock edge that controls loading of pipeline state registers
16
Register Usage Can Cause Data Hazards
Instr.
Order
add $1,
sub $4,$1,$5
and $6,$1,$7
xor $4,$1,$5
or $8,$1,$9A
LUIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
• Dependencies backward in time cause hazards
• Read After write data hazard
17
Register Usage Can Cause Data Hazards
AL
UIM Reg DM Reg
AL
UIM Reg DM RegA
LUIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
• Dependencies backward in time cause hazards
add $1,
sub $4,$1,$5
and $6,$1,$7
xor $4,$1,$5
or $8,$1,$9
• Read after write data hazard
18
Loads Can Cause Data Hazards
Instr.
Order
lw $1,4($2)
sub $4,$1,$5
and $6,$1,$7
xor $4,$1,$5
or $8,$1,$9A
LUIM Reg DM Reg
AL
UIM Reg DM RegA
LUIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
• Dependencies backward in time cause hazards
• Load-use data hazard
19
stall
stall
One Way to “Fix” a Data Hazard
Instr.
Order
add $1,
AL
UIM Reg DM Reg
sub $4,$1,$5
and $6,$1,$7
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
Can fix data hazard by
waiting – stall – but impacts CPI
20
Another Way to “Fix” a Data Hazard
Instr.
Order
add $1,
AL
UIM Reg DM Reg
sub $4,$1,$5
and $6,$1,$7A
LUIM Reg DM Reg
AL
UIM Reg DM Reg
Fix data hazards by forwarding
results as soon as they are available to where they are
needed
xor $4,$1,$5
or $8,$1,$9
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
21
Another Way to “Fix” a Data Hazard
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
Fix data hazards by forwarding
results as soon as they are available to where they are
neededA
LUIM Reg DM Reg
AL
UIM Reg DM Reg
Instr.
Order
add $1,
sub $4,$1,$5
and $6,$1,$7
xor $4,$1,$5
or $8,$1,$9
22
Forwarding with Load-use Data Hazards
Instr.
Order
lw $1,4($2)
sub $4,$1,$5
and $6,$1,$7
xor $4,$1,$5
or $8,$1,$9A
LUIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
23
Forwarding with Load-use Data Hazards
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
AL
UIM Reg DM Reg
• Will still need one stall cycle even with forwarding
Instr.
Order
lw $1,4($2)
sub $4,$1,$5
and $6,$1,$7
xor $4,$1,$5
or $8,$1,$9
24
Branch Instructions Cause Control Hazards
Instr.
Order
lw
Inst 4
Inst 3
beq
AL
UIM Reg DM Reg
AL
UIM Reg DM RegA
LUIM Reg DM Reg
AL
UIM Reg DM Reg
• Dependencies backward in time cause hazards
25
stall
stall
stall
One Way to “Fix” a Control Hazard
Instr.
Order
beq
AL
UIM Reg DM Reg
lw
AL
UIM Reg DM Reg
AL
U
Inst 3IM Reg DM
Fix branch hazard by waiting –
stall – but affects CPI
26
Corrected Datapath to Save RegWrite Addr• Need to preserve the destination register address in the
pipeline state registers
ReadAddress
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
16 32
ALU
Shiftleft 2
Add
DataMemory
Address
Write Data
ReadData
IF/ID
SignExtend
ID/EX EX/MEM
MEM/WB
27
Corrected Datapath to Save RegWrite Addr• Need to preserve the destination register address in the
pipeline state registers
ReadAddress
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
16 32
ALU
Shiftleft 2
Add
DataMemory
Address
Write Data
ReadData
IF/ID
SignExtend
ID/EX EX/MEM
MEM/WB
28
MIPS Pipeline Control Path Modifications• All control signals can be determined during Decode
– and held in the state registers between pipeline stages
ReadAddress
InstructionMemory
Add
PC
4
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
Read Data 1
Read Data 2
16 32
ALU
Shiftleft 2
Add
DataMemory
Address
Write Data
ReadData
IF/ID
SignExtend
ID/EXEX/MEM
MEM/WB
Control
29
Other Pipeline Structures Are Possible• What about the (slow) multiply operation?
– Make the clock twice as slow or …
– let it take two cycles (since it doesn’t use the DM stage)
AL
UIM Reg DM Reg
MUL
AL
UIM Reg DM1 RegDM2
• What if the data memory access is twice as slow as the instruction memory?
– make the clock twice as slow or …
– let data memory access take two cycles (and keep the same clock rate)
30
Sample Pipeline Alternatives• ARM7
• StrongARM-1
• XScale
AL
UIM1 IM2 DM1 RegDM2
IM Reg EX
PC updateIM access
decodereg access
ALU opDM accessshift/rotatecommit result (write back)
AL
UIM Reg DM Reg
Reg SHFT
PC updateBTB access
start IM access
IM access
decodereg 1 access
shift/rotatereg 2 access
ALU op
start DM accessexception
DM writereg write
31
Summary
• All modern day processors use pipelining
• Pipelining doesn’t help latency of single task, it helps throughput of entire workload
• Potential speedup: a CPI of 1 and fast a CC
• Pipeline rate limited by slowest pipeline stage
– Unbalanced pipe stages makes for inefficiencies
– The time to “fill” pipeline and time to “drain” it can impact speedup for deep pipelines and short code runs
• Must detect and resolve hazards
– Stalling negatively affects CPI (makes CPI less than the ideal of 1)