1 COMP541 Combinational Logic - 4 Montek Singh Jan 30, 2012.
Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.
-
Upload
thomas-miller -
Category
Documents
-
view
216 -
download
0
description
Transcript of Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.
Computer Organization and Computer Organization and DesignDesign
PipeliningPipelining
Montek SinghMontek SinghMon, Dec 2, 2013Mon, Dec 2, 2013
Lecture 16Lecture 16
PipeliningPipelining
Between 411 problems sets, I
haven’t had a minute to do laundry
Now that’s what Icall dirty laundry
Read Chapter 4.5-4.8
Laundry ExampleLaundry Example
Device: Washer
Function: Fill, Agitate, Spin
WasherPD = 30 mins
Device: Dryer
Function: Heat, Spin
DryerPD = 60 mins
INPUT:dirty laundry
OUTPUT:4 more weeks
Laundry: One Load at a TimeLaundry: One Load at a Time Everyone knows that Everyone knows that
the real reason one puts the real reason one puts off doing laundry so off doing laundry so long is long is notnot because we because we procrastinate, are lazy, procrastinate, are lazy, or even have better or even have better things to do.things to do. The fact is, doing laundry The fact is, doing laundry
one load at a time is not one load at a time is not smart.smart.
Step 1:
Step 2:
Total = WasherPD + DryerPD
= _________ mins90
Laundry: Doing N Loads!Laundry: Doing N Loads! HereHere’’s how one s how one
would do laundry the would do laundry the ““unpipelinedunpipelined”” way. way.
Step 1:
Step 2:
Step 3:
Step 4:
Total = N*(WasherPD + DryerPD)
= ____________ minsN*90
…
Laundry: Doing N Loads!Laundry: Doing N Loads! Here’s how to Here’s how to
““pipelinepipeline”” the the laundry process.laundry process.ThatThat’’s why we wait!s why we wait!
Step 1:
Step 2:
Step 3:
Total = N * Max(WasherPD, DryerPD)
= ____________ mins
N*60
…Actually, it’s more like N*60 + 30 if we account for the startup time (i.e., filling up the pipeline) correctly. When doing pipeline analysis, we’re mostly interested in the “steady state” where we assume we have an infinite supply of inputs.
Recall Our Performance MeasuresRecall Our Performance Measures Latency:Latency:
Delay from input to corresponding outputDelay from input to corresponding outputUnpipelined Laundry = _________ minsUnpipelined Laundry = _________ minsPipelined Laundry = _________ minsPipelined Laundry = _________ mins
Throughput:Throughput: Rate at which inputs or outputs are processedRate at which inputs or outputs are processed
Unpipelined Laundry = _________ outputs/minUnpipelined Laundry = _________ outputs/minPipelined Laundry = _________ outputs/minPipelined Laundry = _________ outputs/min
90120
1/901/60
Assuming that the wash is started as soon as possible and waits (wet) in the washer until dryer is available.
Even though we increaselatency, it takes less time per load
Okay, Back to Circuits…Okay, Back to Circuits…
F
G
HX P(X)
For combinational logic: latency = tPD, throughput = 1/tPD. We can’t get the answer faster, but are we making effective use of our hardware at all times?
G(X)F(X)
P(X)
X
F & G are “idle”, just holding their outputs stable while H performs its computation
Pipelined CircuitsPipelined Circuitsuse registers to hold H’s input stable!
F
G
HX P(X)
15
20
25
Now F & G can be working on input Xi+1 while H is performing its computation on Xi. We’ve created a 2-stage pipeline :
if we have a valid input X during clock cycle j, P(X) is valid during clock j+2.
Suppose F, G, H have propagation delays of 15, 20, 25 ns and we are using ideal zero-delay registers (ts = 0, tpd = 0):
latency45
______
throughput1/45
______unpipelined
2-stage pipeline 50worse
1/25better
Pipelining uses registers to improve the
throughput of combinational
circuits
Pipeline DiagramsPipeline Diagrams
Input
F Reg
G Reg
H Reg
i i+1 i+2 i+3
Xi Xi+1
F(Xi)
G(Xi)
Xi+2
F(Xi+1)
G(Xi+1)
H(Xi)
Xi+3
F(Xi+2)
G(Xi+2)
H(Xi+1)
Clock cyclePi
pelin
e st
ages
The results associated with a particular set of input data moves diagonally through the diagram, progressing through one pipeline stage each clock cycle.
H(Xi+2)
…
…
F
G
HX P(X)
15
20
25
This is an exampleof parallelism. Atany instant we arecomputing 2 results.
Pipelining SummaryPipelining Summary Advantages:Advantages:
Higher throughput than combinational systemHigher throughput than combinational system Different parts of the logic work on different parts of Different parts of the logic work on different parts of
the problem… the problem…
Disadvantages:Disadvantages: Generally, increases latencyGenerally, increases latency Only as good as the *weakest* linkOnly as good as the *weakest* link
(often called the pipeline(often called the pipeline’’s BOTTLENECK)s BOTTLENECK)
Review of CPU PerformanceReview of CPU Performance
MIPS = Millions of Instructions/Second
Freq = Clock Frequency, MHz
CPI = Clocks per Instruction
MIPS = FreqCPI
To Increase MIPS:1. DECREASE CPI.
- RISC simplicity reduces CPI to 1.0.- CPI below 1.0? State-of-the-art multiple instruction
issue2. INCREASE Freq.
- Freq limited by delay along longest combinational path; hence
- PIPELINING is the key to improving performance.
Where Are the Bottlenecks?Where Are the Bottlenecks?Pipelining goal:
Break LONG combinational paths memories, ALU in separate stages
WA
PC
+4
InstructionMemory
A
D
RegisterFile
RA1 RA2
RD1 RD2
ALUA B
WA
ALUFN
Control Logic
Data MemoryRD
WD R/W
Adr
Wr
WDSEL0 1 2
BSELWDSELALUFNWr
J:<25:0>
PCSEL
WERF
WERF
00
PC+4
Rt: <20:16>
Imm: <15:0>
ASEL
SEXT
+x4
BT
Z
BT
WASEL
Rd:<15:11>Rt:<20:16> 0
123
WASEL
PC<31:29>:J<25:0>:00
JT
JT
N V C
ZVN C
Rs: <25:21>
ASEL20
SEXT
BSEL01
SEXT
shamt:<10:6>
PCSEL 0123456
“16”
IRQ
0x800000800x800000400x80000000
RESET
3127
1
WDWE
Goal: 5-Stage PipelineGoal: 5-Stage Pipeline
GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to barely include slowest components (mems, regfile, ALU)
APPROACH: structure processor as 5-stage pipeline:
IF Instruction Fetch stage: Maintains PC, fetches one instruction per cycle and passes it to
WB Write-Back stage: writes result back into register file.
ID/RFInstruction Decode/Register File stage: Decode
control lines and select source operands
ALUALU stage: Performs specified operation,
passes result to …
MEMMemory stage: If it’s a lw, use ALU result as an
address, pass mem data (or ALU result if not lw) to …
ALUA B
ALUFN
Data MemoryRD
WD R/WAdr Wr
WDSEL0 1 2
PC+4
ZVN C
PC
+4
InstructionMemory
AD
00
BTPC<31:29>:J<25:0>:00
JT
PCSEL 0123456
0x800000800x800000400x80000000
PCREG 00 IRREG
WARegister
FileRA1 RA2
RD1 RD2
J:<25:0>
Imm: <15:0>
+x4
BT
JT
Rt: <20:16>Rs: <25:21>
ASEL20 BSEL01
SEXTSEXT
shamt:<10:6>
“16”
1
=BZ
5-Stage 5-Stage miniMIPSminiMIPS
PCALU 00 IRALU A B WDALU
PCMEM 00 IRMEM YMEM WDMEM
WARegister
FileWA WD
WEWERF
WASEL
Rd:<15:11>Rt:<20:16>
“31”“27”
0 1 2 3
InstructionFetch
RegisterFile
ALU
WriteBack
PCWB 00 IRWB YWBMemory
Address is available right after instruction
enters Memory stage
Data is needed just before rising clock edge
at end of Write Back stage
• Omits some details
PipeliningPipelining Improve performance by increasing instruction Improve performance by increasing instruction
throughputthroughputInstruction
fetch Reg ALU Dataaccess Reg
8 nsInstruction
fetch Reg ALU Dataaccess Reg
8 nsInstruction
fetch
8 ns
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 4 6 8 10 12 14 16 18
2 4 6 8 10 12 14
...
Programexecutionorder(in instructions)
Instructionfetch Reg ALU Data
access Reg
Time
lw $1, 100($0)
lw $2, 200($0)
lw $3, 300($0)
2 nsInstruction
fetch Reg ALU Dataaccess Reg
2 nsInstruction
fetch Reg ALU Dataaccess Reg
2 ns 2 ns 2 ns 2 ns 2 ns
Programexecutionorder(in instructions)
Ideal speedup is number of stages in the pipeline. Do we achieve Ideal speedup is number of stages in the pipeline. Do we achieve this?this?
PipeliningPipelining What makes it easyWhat makes it easy
all instructions are the same lengthall instructions are the same length just a few instruction formatsjust a few instruction formats memory operands appear only in loads and storesmemory operands appear only in loads and stores
What makes it hard?What makes it hard? structural hazards: suppose we had only one memorystructural hazards: suppose we had only one memory control hazards: need to worry about branch instructionscontrol hazards: need to worry about branch instructions data hazards: an instruction depends on a previous data hazards: an instruction depends on a previous
instructioninstruction Net effect:Net effect:
Individual instructions still take the same number of Individual instructions still take the same number of cyclescycles
But improved throughput by increasing the number of But improved throughput by increasing the number of simultaneouslysimultaneously executing instructions executing instructions
Data HazardsData Hazards Problem with starting next instruction before first is Problem with starting next instruction before first is
finishedfinished dependencies that dependencies that ““go backward in timego backward in time”” are data hazards are data hazards
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecutionorder(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2:
DM Reg
Reg
Reg
Reg
DM
Software SolutionSoftware Solution Have compiler guarantee no hazardsHave compiler guarantee no hazards
Where do we insert the Where do we insert the ““nopsnops”” ? ?Between “producing” and “consuming” instructions!Between “producing” and “consuming” instructions!
subsub $2, $1, $3$2, $1, $3and and $12, $2, $5$12, $2, $5oror $13, $6, $2$13, $6, $2addadd $14, $2, $2$14, $2, $2swsw $15, 100($2)$15, 100($2)
Problem: this really slows us down!Problem: this really slows us down!
ForwardingForwarding Bypass/forward results as soon as they are Bypass/forward results as soon as they are
produced/needed. Don’t wait for them to be written produced/needed. Don’t wait for them to be written back into registers!back into registers!
IM Reg
IM Reg
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
sub $2, $1, $3
Programexecution order(in instructions)
and $12, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
10 10 10 10 10/– 20 – 20 – 20 – 20 – 20
or $13, $6, $2
add $14, $2, $2
sw $15, 100($2)
Value of register $2 :
DM Reg
Reg
Reg
Reg
X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :
DM
Can't always forwardCan't always forward Load word can still cause a hazard:Load word can still cause a hazard:
an instruction tries to read a register following a load an instruction tries to read a register following a load instruction that writes to the same register. STALL!instruction that writes to the same register. STALL!
Reg
IM
Reg
Reg
IM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6
Time (in clock cycles)
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
IM Reg DM Reg
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
DM Reg
Reg
Reg
DM
StallingStalling When needed, stall the pipeline by keeping an When needed, stall the pipeline by keeping an
instruction in the same stage fpr an extra clock cycle.instruction in the same stage fpr an extra clock cycle.
lw $2, 20($1)
Programexecutionorder(in instructions)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
Reg
IM
Reg
Reg
IM DM
CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)
IM Reg DM RegIM
IM DM Reg
IM DM Reg
CC 7 CC 8 CC 9 CC 10
DM Reg
RegReg
Reg
bubble
Branch HazardsBranch Hazards When branching, other instructions are in the pipeline!When branching, other instructions are in the pipeline!
need to add hardware for flushing instructions if we are wrongneed to add hardware for flushing instructions if we are wrong
Reg
Reg
CC 1
Time (in clock cycles)
40 beq $1, $3, 7
Programexecutionorder(in instructions)
IM Reg
IM DM
IM DM
IM DM
DM
DM Reg
Reg Reg
Reg
Reg
RegIM
44 and $12, $2, $5
48 or $13, $6, $2
52 add $14, $2, $2
72 lw $4, 50($7)
CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9
Reg
Pipeline SummaryPipeline Summary A very common technique to improve A very common technique to improve
throughput of throughput of anyany circuit circuit used in all modern processors!used in all modern processors!
Fallacies:Fallacies: ““Pipelining is easy.” No, smart people get it wrong all Pipelining is easy.” No, smart people get it wrong all
of the time!of the time! ““Pipelining is independent of ISA.” No, many ISA Pipelining is independent of ISA.” No, many ISA
decisions impact how easy/costly it is to implement decisions impact how easy/costly it is to implement pipelining (i.e. branch semantics, addressing modes).pipelining (i.e. branch semantics, addressing modes).
““Increasing pipeline stages improves performance.” Increasing pipeline stages improves performance.” No, returns diminish because of increasing No, returns diminish because of increasing complexity.complexity.