Lecture 03: Fundamentals of Computer Design - Trends and Performance Kai Bu [email protected] .
Lecture 7: Pipelining Review Kai Bu [email protected] .
-
Upload
florence-wright -
Category
Documents
-
view
215 -
download
0
Transcript of Lecture 7: Pipelining Review Kai Bu [email protected] .
Appendix CLectures 4-6
Pipelining
start executing one instructionbefore completing the previous one
Outline
• What’s Pipelining• How Pipelining Works• Pipeline Hazards• Pipeline with Multicycle FP Operations
Outline
• What’s Pipelining• How Pipelining Works• Pipeline Hazards• Pipeline with Multicycle FP Operations
Laundry Example
Ann, Brian, Cathy, DaveEach has one load of clothes towash, dry, fold.
washer30 mins
dryer40 mins
folder20 mins
Sequential Laundry
What would you do?
Task
Ord
er
A
B
C
D
Time30 40 20 30 40 20 30 40 20 30 40 20
6 Hours
Sequential Laundry
What would you do?
Task
Ord
er
A
B
C
D
Time30 40 20 30 40 20 30 40 20 30 40 20
6 Hours
Pipelined LaundryObservations• A task has a series
of stages;• Stage dependency:
e.g., wash before dry;
• Multi tasks with overlapping stages;
• Simultaneously use diff resources to speed up;
• Slowest stage determines the finish time;
Task
Ord
er
A
B
C
D
Time30 40 40 40 40 20
3.5 Hours
Pipelined LaundryObservations• No speed up for
individual task;e.g., A still takes 30+40+20=90
• But speed up for average task execution time;e.g., 3.5*60/4=52.5 < 30+40+20=90
Task
Ord
er
A
B
C
D
Time30 40 40 40 40 20
3.5 Hours
Assembly Line
Auto
Cola
Pipelining
• An implementation technique whereby multiple instructions are overlapped in execution.e.g., B wash while A dry
• Essence: Start executing one instruction before completing the previous one.
• Significance: Make fast CPUs.
A
B
Balanced Pipeline
• Equal-length pipe stagese.g., Wash, dry, fold = 40 minsper unpipelined laundry time = 40x3 mins 3 pipe stages – wash, dry, fold
AT1
40min
T2T3T4
AA
BB
BC
CD
Balanced Pipeline
• Equal-length pipe stagese.g., Wash, dry, fold = 40 minsper unpipelined laundry time = 40x3 mins 3 pipe stages – wash, dry, fold
AT1
40min
T2T3T4
AA
BB
BC
CD
Balanced Pipeline
• Equal-length pipe stagese.g., Wash, dry, fold = 40 minsper unpipelined laundry time = 40x3 mins 3 pipe stages – wash, dry, fold
AT1
40min
T2T3T4
AA
BB
BC
CD
One task/instructionper 40 mins
Time per instruction by pipeline = Time per instr on unpipelined machine
Number of pipe stages
Speed up by pipeline =Number of pipe stages
Balanced Pipeline
• Equal-length pipe stagese.g., Wash, dry, fold = 40 minsper unpipelined laundry time = 40x3 mins 3 pipe stages – wash, dry, fold
AT1
40min
T2T3T4
AA
BB
BC
CD
• Performance
Pipelining Terminology
• Latency: the time for an instruction to complete.
• Throughput of a CPU: the number of instructions completed per second.
• Clock cycle: everything in CPU moves in lockstep; synchronized by the clock.
• Processor Cycle: time required between moving an instruction one step down the pipeline;= time required to complete a pipe stage;= max(times for completing all stages);= one or two clock cycles, but rarely more.
• CPI: clock cycles per instruction
Outline
• What’s Pipelining• How Pipelining Works• Pipeline Hazards• Pipeline with Multicycle FP Operations
RISC: Five-Stage Pipeline
• How it worksseparate instruction and data mems to eliminate conflicts for a single memory between instruction fetch and data memory access.
IF MEM
Instr mem Data mem
RISC: Five-Stage Pipeline
• How it worksuse the register file in two stages;either with half CC;
in one clock cycle, write before read
ID WBread write
RISC: Five-Stage Pipeline
• How it worksintroduce pipeline registers between successive stages;pipeline registers store the results of a stage and use them as the input of the next stage.
RISC: Five-Stage Pipeline
• How it works
RISC: Five-Stage Pipeline
• How it works - omit pipeline regs for simplicity
but required in implementation
RISC: Reduced Instruction Set Computer
at most 5 clock cycles per instruction – 1IF ID EX MEM WB• Instruction Fetch cycle
send the PC to memory;fetch the current instruction from mem;PC = PC + 4; //each instr is 4 bytes
RISC: Reduced Instruction Set Computer
at most 5 clock cycles per instruction – 2IF ID EX MEM WB• Instruction Decode/register fetch cycle
decode the instruction;read the registers (corresponding to register source specifiers);
RISC: Reduced Instruction Set Computer
at most 5 clock cycles per instruction – 3IF ID EX MEM WB• Execution/effective address cycle
ALU operates on the operands from ID:3 functions depending on the instr type - 1-Memory referenceMemory reference: ALU adds base register and offset to form effective address;
RISC: Reduced Instruction Set Computer
at most 5 clock cycles per instruction – 3IF ID EX MEM WB• Execution/effective address cycle
ALU operates on the operands from ID:3 functions depending on the instr type - 2-Register-Register ALU instructionRegister-Register ALU instruction: ALU performs the operation specified by opcode on the values read from the register file;
RISC: Reduced Instruction Set Computer
at most 5 clock cycles per instruction – 3IF ID EX MEM WB• EXecution/effective address cycle
ALU operates on the operands from ID:3 functions depending on the instr type - 3-Register-Immediate ALU instructionRegister-Immediate ALU instruction: ALU operates on the first value read from the register file and the sign-extended immediate.
RISC: Reduced Instruction Set Computer
at most 5 clock cycles per instruction – 4IF ID EX MEM WB• MEMory access
for load instr: the memory does a read using the effective address;for store instr: the memory writes the data from the second register using the effective address.
RISC: Reduced Instruction Set Computer
at most 5 clock cycles per instruction – 5IF ID EX MEM WB• Write-Back cycle
for Register-Register ALU or load instr;write the result into the register file, whether it comes from the memory (for load) or from the ALU (for ALU instr).
RISC: Reduced Instruction Set Computer
3 classes of instructions - 1• ALU (Arithmetic Logic Unit) instructions
operate on two regs or a reg + a sign-extended immediate;store the result into a third reg;e.g., add (DADD), subtract (DSUB)logical operations AND, OR
RISC: Reduced Instruction Set Computer
3 classes of instructions - 2• Load (LD) and store (SD) instructions
operands: base register + offset;the sum (called effective address) is used as a memory address;Load: use a second reg operand as the destination for the data loaded from memory;Store: use a second reg operand as the source of the data stored into memory.
RISC: Reduced Instruction Set Computer
3 classes of instructions - 3• Branches and jumps
conditional transfers of control;Branch:Branch: specify the branch conditionspecify the branch condition with a set of condition bits or comparisons between two regs or between a reg and zero;decide the branch destinationdecide the branch destination by adding a sign-extended offset to the current PC (program counter);
MIPS Instruction
• at most 5 clock cycles per instruction• IF ID EX MEM WB
MIPS Instruction
IF ID EX MEM WB
IR ← Mem[PC];NPC ← PC + 4;
MIPS Instruction
IF ID EX MEM WB
A ← Regs[rs];B ← Regs[rt];Imm ← sign-extended immediate field of IR (lower 16 bits)
MIPS Instruction
IF ID EX MEM WB
ALUOutput ← A + Imm;
ALUOutput ← A func B;
ALUOutput ← A op Imm;
ALUOutput ← NPC + (Imm<<2);Cond ← (A == 0);
MIPS Instruction
IF ID EX MEM WB
LMD ← Mem[ALUOutput]; Mem[ALUOutput] ← B;
if (cond) PC ← ALUOutput;
MIPS Instruction
IF ID EX MEM WB
Regs[rd] ← ALUOutput;
Regs[rt] ← ALUOutput;
Regs[rt] ← LMD;
MIPS Instruction Demo
• Prof. Gurpur Prabhu, Iowa State Univ http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/DLXimplem.html
• Load, Store• Register-register ALU• Register-immediate ALU• Branch
Load
Load
Load
Load
Load
Load
Store
Store
Store
Store
Store
Store
Register-Register ALU
Register-Register ALU
Register-Register ALU
Register-Register ALU
Register-Register ALU
Register-Register ALU
Register-Immediate ALU
Register-Immediate ALU
Register-Immediate ALU
Register-Immediate ALU
Register-Immediate ALU
Register-Immediate ALU
Branch
Branch
Branch
Branch
Branch
Branch
Outline
• What’s Pipelining• How Pipelining Works• Pipeline Hazards• Pipeline with Multicycle FP Operations
When Pipeline Is Stuck
LD R1, 0(R2)
DSUB R4, R1, R5
R1
R1
Structural Hazard
• Example1 mem portmem conflict
data access vs
instr fetch
Load
Instr i+3
Instr i+2
Instr i+1
MEM
IF
Structural Hazard
Stall Instr i+3till CC 5
Data HazardDADD
DSUB
AND
OR
XOR
R1, R2, R3
R4, R1, R5
R6, R1, R7
R8, R1, R9
R10, R1, R11
R1
No hazard
1st half cycle: w
2nd half cycle: r
Data Hazard
• Solution: forwardingdirectly feed back EX/MEM&MEM/WBpipeline regs’ results to the ALU inputs;
if forwarding hw detects that previous ALU has written the reg corresponding to a source for the current ALU,control logic selects the forwarded result as the ALU input.
Data Hazard: ForwardingDADD
DSUB
AND
OR
XOR
R1, R2, R3
R4, R1, R5
R6, R1, R7
R8, R1, R9
R10, R1, R11
R1
Data Hazard: ForwardingDADD
DSUB
AND
OR
XOR
R1, R2, R3
R4, R1, R5
R6, R1, R7
R8, R1, R9
R10, R1, R11
R1EX/MEM
Data Hazard: ForwardingDADD
DSUB
AND
OR
XOR
R1, R2, R3
R4, R1, R5
R6, R1, R7
R8, R1, R9
R10, R1, R11
R1MEM/WB
Data Hazard: Forwarding
• Generalized forwardingpass a result directly to the functional unit that requires it;
forward results to not only ALU inputs but also other types of functional units;
Data Hazard: Forwarding
• Generalized forwarding
DADD R1, R2, R3
LD R4, 0(R1)
SD R4, 12(R1)
R1
R1
R1
R1
R4
R4
Data Hazard
• Sometimes stall is necessary
R1
R1
LD R1, 0(R2)
DSUB R4, R1, R5
MEM/WB
Forwarding cannot be backward.
Has to stall.
Branch Hazard
• Redo IF
If the branch is untaken,the stall is unnecessary.
essentially a stall
Branch Hazard: Solutions
4 simple compile time schemes – 1• Freeze or flush the pipeline
hold or delete any instructions after the branch till the branch dst is known;
i.e., Redo IF w/o the first IF
Branch Hazard: Solutions
4 simple compile time schemes – 2• Predicted-untaken
simply treat every branch as untaken;
when the branch is untaken,pipelining as if no hazard.
Branch Hazard: Solutions
4 simple compile time schemes – 2• Predicted-untaken
but if the branch is taken:turn fetched instr into a no-op (idle);restart the IF at the branch target addr
Branch Hazard: Solutions
4 simple compile time schemes – 3• Predicted-taken
simply treat every branch as taken;
not apply to the five-stage pipeline;
apply to scenarios when branch target addr is known before branch outcome.
Branch Hazard: Solutions
4 simple compile time schemes – 4• Delayed branch
delay the branch execution after the next instruction;
pipelining sequence:pipelining sequence:branch instructionsequential successorbranch target if taken
Branch delay slotthe next instruction
Branch Hazard: Solutions• Delayed branch
Outline
• What’s Pipelining• How Pipelining Works• Pipeline Hazards• Pipeline with Multicycle FP Operations
Multicycle FP Operation• FP pipeline
allow for a longer latency for op;two changes over integer pipeline:
repeat EX;use multiple FP functional units;
FP Pipeline
loads and storesinteger ALU operations
branches
FP addFP subtract
FP conversion
FP and integer multiplier
FP and integer divider
Generalized FP Pipeline
• EX is pipelined (except for FP divider)• Additional pipeline registers
e.g., ID/A1
FP divider: 24 CCs
Generalized FP Pipeline
• Exampleitalics: stage where data is neededbold: stage where a result is available
Hazard
• Divider is not fully pipelined – structural hazard
Hazard
• Instructions have varying running times, maybe >1 register write in a cycle - structural hazard
Hazard
• Instructions no longer reach WB in order – Write after write (WAW) hazard
Hazard
• Instructions may complete in a different order than they were issued – exceptions
Hazard
• Longer latency of operations – more frequent stalls for RAW hazards
RAW Hazards
Structural Hazards
WAW Hazards
• If L.D were issued one cycle earlier• L.D would write F2 one cycle earlier than
ADD.D – WAW hazardwhat if another instruction using F2 between
them? --- No WAW
All in MIPS R4000
MIPS R4000
• 5-stage -> 8-stage• Higher clock rate
MIPS R4000
• IF: first half of instruction fetch;PC selection;initiation of instruction cache access;
MIPS R4000
• IS: second half of instruction fetch;completion of instruction cache access;
MIPS R4000
• RF: instruction decode and register fetch;hazard checking;instruction cache hit detection;
MIPS R4000
• EX: executioneffective address calculation;ALU operation;branch-target computation and condition evaluation;
MIPS R4000
• DF: data fetchfirst half of data access;
MIPS R4000
• DS: second half of data fetchcompletion of data cache access;
MIPS R4000
• TC: tag checkdetermine whether the data cache access hit;
MIPS R4000
• WB: write backfor loads and register-register operations;
MIPS R4000
• 2-cycle load delay
MIPS R4000
• 3-cycle branch delay
MIPS R4000
• FP unit with eight different stages
MIPS R4000
• FP operations: latency and initiation interval
MIPS R4000
• FP operations Example 1FP multiply + FP add
MIPS R4000
• FP operations Example 2FP add + FP multiply
MIPS R4000
• FP operations Example 3: divide + add
MIPS R4000
• FP operations Example 4FP add + FP divide
?