Review : Pipelining

Review: Pipelining

Pipelining

• Laundry Example

• Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold

• Washer takes 30 minutes

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

A B C D

Sequential Laundry

• Sequential laundry takes 6 hours for 4 loads

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Pipelined LaundryStart work ASAP

• Pipelined laundry takes 3.5 hours for 4 loads

6 PM 7 8 9 10 11 Midnight

30 40 40 40 40 20

Pipelining: Observations• Multiple tasks operating

simultaneously• Pipelining doesn’t help

latency of single task, it helps throughput of entire workload

• Pipeline rate limited by slowest pipeline stage

• Potential speedup = Number pipe stages

• Unbalanced lengths of pipe stages reduces speedup

• Time to “fill” pipeline and time to “drain” it reduces speedup

6 PM 7 8 9

30 40 40 40 40 20

5 Steps of DLX DatapathFigure 3.1

MemoryAccess

WriteBack

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

Inst.Mem.

Imm.SignExt.

LMDALU DataMem.

ALUOutput

Zero? Cond.

Pipelined DLX DatapathFigure 3.4

MemoryAccess

WriteBack

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc.

Inst.Mem.

SignExt.

ALU DataMem.

ID/EX EX/MEM MEM/WB

Visualizing PipeliningFigure 3.3

Instr.

Time (clock cycles)

Limits to Pipelining

• Hazards prevent next instruction from executing during its designated clock cycle

– Structural hazards: HW cannot support this combination of instructions

– Data hazards: Instruction depends on result of prior instruction still in the pipeline

– Control hazards: Pipelining of branches & other instructions that change the PC

• Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline

One Memory Port/Structural HazardsFigure 3.6

Instr.

Time (clock cycles)

Instr 1

Instr 2

Instr 3

Instr 4

One Memory Port/Structural HazardsFigure 3.7

Instr.

Instr 1

Instr 2

Instr 3

Speed Up Equation for Pipelining

Speedup from pipelining = Ave Instr Time unpipelined

Ave Instr Time pipelined

= CPIunpipelined x Clock Cycleunpipelined

CPIpipelined x Clock Cyclepipelined

= CPIunpipelined Clock Cycleunpipelined

CPIpipelined Clock Cyclepipelined

Ideal CPI = CPIunpipelined/Pipeline depth

Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined

CPIpipelined Clock Cyclepipelined

Speed Up Equation for Pipelining

CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr

Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined

Ideal CPI + Pipeline stall CPI Clock Cyclepipelined

What is the maximum possible speedup?

Speedup = Pipeline depth Clock Cycleunpipelined

1 + Pipeline stall CPI Clock Cyclepipelined

Example: Dual-port vs. Single-port

• Machine A: Dual ported memory• Machine B: Single ported memory, but its pipelined

implementation has a 1.05 times faster clock rate• Ideal CPI = 1 for both• Loads are 40% of instructions executed SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth

SpeedUpB = Pipeline Depth/(1 + 0.4) x (clockunpipe/(clockpipe)

= (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth

SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33

• Machine A is 1.33 times faster

Data Hazard on R1Figure 3.9

Three Generic Data HazardsInstrI followed by InstrJ

• Read After Write (RAW) InstrJ tries to read operand before InstrI writes it

Three Generic Data HazardsInstrI followed by InstrJ

• Write After Read (WAR) InstrJ tries to write operand before InstrI reads it

• Can’t happen in DLX 5 stage pipeline because:

– All instructions take 5 stages,

– Reads are always in stage 2, and

– Writes are always in stage 5

Three Generic Data Hazards

InstrI followed by InstrJ

• Write After Write (WAW) InstrJ tries to write operand before InstrI writes it

– Leaves wrong result ( InstrI not InstrJ)

• Can’t happen in DLX 5 stage pipeline because:

– All instructions take 5 stages, and

– Writes are always in stage 5

• Will see WAR and WAW in later more complicated pipes

Forwarding to Avoid Data HazardFigure 3.10

HW Change for ForwardingFigure 3.20

Data Hazard Even with ForwardingFigure 3.12

Data Hazard Even with ForwardingFigure 3.13

Try producing fast code for

a = b + c;

d = e – f;

assuming a, b, c, d ,e, and f in memory. Slow code:

LW Rb,b

LW Rc,c

ADD Ra,Rb,Rc

SW a,Ra

LW Re,e

LW Rf,f

SUB Rd,Re,Rf

SW d,Rd

Software Scheduling to Avoid Load Hazards

Fast code:

LW Rb,b

LW Rc,c

LW Re,e

ADD Ra,Rb,Rc

LW Rf,f

SW a,Ra

SUB Rd,Re,Rf

SW d,Rd

Compiler Avoiding Load Stalls

% loads stalling pipeline

0% 20% 40% 60% 80%

scheduled unscheduled

Control Hazard on BranchesThree Stage Stall

Branch Stall Impact

• If CPI = 1, if 30% branch, Stall 3 cycles => new CPI = 1.9!

• Two part solution:– Determine branch taken or not sooner, AND

– Compute taken branch address earlier

• DLX branch tests if register = 0 or <> 0

• DLX Solution:– Move Zero test to ID/RF stage

– Adder to calculate new PC in ID/RF stage

– 1 clock cycle penalty for branch versus 3

Pipelined DLX DatapathFigure 3.22

Four Branch Hazard Alternatives

#1: Stall until branch direction is clear

#2: Predict Branch Not Taken– Execute successor instructions in sequence

– “Squash” instructions in pipeline if branch actually taken

– Advantage of late pipeline state update

– 47% DLX branches not taken on average

– PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken– 53% DLX branches taken on average

– But haven’t calculated branch target address in DLX

» DLX still incurs 1 cycle branch penalty

Four Branch Hazard Alternatives

#4: Delayed Branch– Define branch to take place AFTER a following instruction

branch instructionsequential successor1

sequential successor2

........sequential successorn

branch target if taken

– 1 slot delay allows proper decision and branch target address in 5 stage pipeline

– DLX uses this

Branch delay of length n

Delayed Branch

• Where to get instructions to fill branch delay slot?– Before branch instruction

– From the target address: only valuable when branch taken

– From fall through: only valuable when branch not taken

• Compiler effectiveness for single branch delay slot:– Fills about 60% of branch delay slots

– About 80% of instructions executed in branch delay slots useful in computation

– About 50% (60% x 80%) of slots usefully filled

Pipelining Summary

• Just overlap tasks, and easy if tasks are independent

• Speed Up vs Pipeline Depth:

• Hazards limit performance on computers:– Structural: need more HW resources

– Data: need forwarding, compiler scheduling

– Control: discuss next time

Speedup =Pipeline Depth

1 + Pipeline stall CPIX

Clock Cycle Unpipelined

Clock Cycle Pipelined

Review : Pipelining

Documents

Transcript of Review : Pipelining

CSE 502 Graduate Computer Architecture Lec 3-5 – Performance + Instruction Pipelining Review

Pipelining IV

Lecture 7: Pipelining Review Kai Bu kaibu@zju.edu.cn .

Pipelining and Retiming Prepared by Mark Jarvin. Agenda Synchronous circuit retiming Pipelining Software pipelining.

Pipelining Principles

Pipelining - howard huanghowardhuang.us/teaching/cs232/15-Pipelining.pdf · March 17, 2003 Pipelining 2 Instruction execution review Executing a MIPS instruction can take up to five

EECC551 - Shaaban #1 Lec # 2 Fall 2005 9-8-2005 Instruction Pipelining Review:Instruction Pipelining Review: –MIPS In-Order Single-Issue Integer Pipeline.

Instruction Pipelining

Pipelining I

Slide 1 Instruction-Level Parallelism Review of Pipelining (the laundry analogy)

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

Pipelining & Parallel Processing - KAISTics.kaist.ac.kr/ee877_2015s/3_Pipelining_and_Parallel_Processing.pdf · Pipelining processing By using pipelining latches to reduce critical

Review: Pipelining - The University of Alabama in …gcox/613/613s12reviewpipe.pdf · Morgan Kaufmann Publishers 1 February, 2012 Chapter 4 — The Processor 1 Review: Pipelining

Pipelining Verilog

Instruction Pipelining Review - Rochester Institute of ...meseec.ce.rit.edu/eecc551-fall2003/551-9-11-2003.pdf · EECC551 - Shaaban #1 Lec # 2 Fall 2003 9-11-2003 Instruction Pipelining

Kapitel 10: Pipelining - Multimediale Informationsverarbeitung · 2019-01-28 · 1 Kapitel 10: Pipelining • Pipelining (Fließbandtechnik): Implementierungstechnik Vielfältig angewendet

Pipelining - University of Toronto · 2005-09-17 · Pipelining • Principles of pipelining † Simple pipelining † Structural Hazards † Data Hazards † Control Hazards †

Pipelining & Parallel Processing - ics.kaist.ac.krics.kaist.ac.kr/ee878_2018f/[EE878]3 Pipelining and Parallel Processing.pdf · Pipelining processing By using pipelining latches

Recap (Pipelining)

PIPELINING INSTRUCTION