Review : Pipelining

31
Review: Pipelining

description

Review : Pipelining. A. B. C. D. Pipelining. Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes “ Folder ” takes 20 minutes. A. B. C. D. Sequential Laundry. 6 PM. Midnight. 7. 8. 9. 11. - PowerPoint PPT Presentation

Transcript of Review : Pipelining

Page 1: Review : Pipelining

Review: Pipelining

Page 2: Review : Pipelining

Pipelining

• Laundry Example

• Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold

• Washer takes 30 minutes

• Dryer takes 40 minutes

• “Folder” takes 20 minutes

A B C D

Page 3: Review : Pipelining

Sequential Laundry

• Sequential laundry takes 6 hours for 4 loads

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

Page 4: Review : Pipelining

Pipelined LaundryStart work ASAP

• Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20

Page 5: Review : Pipelining

Pipelining: Observations• Multiple tasks operating

simultaneously• Pipelining doesn’t help

latency of single task, it helps throughput of entire workload

• Pipeline rate limited by slowest pipeline stage

• Potential speedup = Number pipe stages

• Unbalanced lengths of pipe stages reduces speedup

• Time to “fill” pipeline and time to “drain” it reduces speedup

A

B

C

D

6 PM 7 8 9

Task

Order

Time

30 40 40 40 40 20

Page 6: Review : Pipelining

5 Steps of DLX DatapathFigure 3.1

MemoryAccess

WriteBack

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

PC

+

Inst.Mem.

4

NPC

IR

Regs

Imm.SignExt.

A

BMUX

MUX

MUX

MUX

LMDALU DataMem.

ALUOutput

Zero? Cond.

16 32

Page 7: Review : Pipelining

Pipelined DLX DatapathFigure 3.4

MemoryAccess

WriteBack

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc.

PC

+

Inst.Mem.

4

IF/ID

Regs

SignExt.

MUX

MUX

MUX

MUX

ALU DataMem.

Zero?

16 32

ID/EX EX/MEM MEM/WB

Page 8: Review : Pipelining

Visualizing PipeliningFigure 3.3

Instr.

Order

Time (clock cycles)

Page 9: Review : Pipelining

Limits to Pipelining

• Hazards prevent next instruction from executing during its designated clock cycle

– Structural hazards: HW cannot support this combination of instructions

– Data hazards: Instruction depends on result of prior instruction still in the pipeline

– Control hazards: Pipelining of branches & other instructions that change the PC

• Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline

Page 10: Review : Pipelining

One Memory Port/Structural HazardsFigure 3.6

Instr.

Order

Time (clock cycles)

Load

Instr 1

Instr 2

Instr 3

Instr 4

Page 11: Review : Pipelining

One Memory Port/Structural HazardsFigure 3.7

Instr.

Order

Load

Instr 1

Instr 2

stall

Instr 3

Page 12: Review : Pipelining

Speed Up Equation for Pipelining

Speedup from pipelining = Ave Instr Time unpipelined

Ave Instr Time pipelined

= CPIunpipelined x Clock Cycleunpipelined

CPIpipelined x Clock Cyclepipelined

= CPIunpipelined Clock Cycleunpipelined

CPIpipelined Clock Cyclepipelined

Ideal CPI = CPIunpipelined/Pipeline depth

Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined

CPIpipelined Clock Cyclepipelined

x

x

Page 13: Review : Pipelining

Speed Up Equation for Pipelining

CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr

Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined

Ideal CPI + Pipeline stall CPI Clock Cyclepipelined

What is the maximum possible speedup?

Speedup = Pipeline depth Clock Cycleunpipelined

1 + Pipeline stall CPI Clock Cyclepipelined

x

x

Page 14: Review : Pipelining

Example: Dual-port vs. Single-port

• Machine A: Dual ported memory• Machine B: Single ported memory, but its pipelined

implementation has a 1.05 times faster clock rate• Ideal CPI = 1 for both• Loads are 40% of instructions executed SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth

SpeedUpB = Pipeline Depth/(1 + 0.4) x (clockunpipe/(clockpipe)

= (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth

SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33

• Machine A is 1.33 times faster

Page 15: Review : Pipelining

Data Hazard on R1Figure 3.9

Page 16: Review : Pipelining

Three Generic Data HazardsInstrI followed by InstrJ

• Read After Write (RAW) InstrJ tries to read operand before InstrI writes it

Page 17: Review : Pipelining

Three Generic Data HazardsInstrI followed by InstrJ

• Write After Read (WAR) InstrJ tries to write operand before InstrI reads it

• Can’t happen in DLX 5 stage pipeline because:

– All instructions take 5 stages,

– Reads are always in stage 2, and

– Writes are always in stage 5

Page 18: Review : Pipelining

Three Generic Data Hazards

InstrI followed by InstrJ

• Write After Write (WAW) InstrJ tries to write operand before InstrI writes it

– Leaves wrong result ( InstrI not InstrJ)

• Can’t happen in DLX 5 stage pipeline because:

– All instructions take 5 stages, and

– Writes are always in stage 5

• Will see WAR and WAW in later more complicated pipes

Page 19: Review : Pipelining

Forwarding to Avoid Data HazardFigure 3.10

Page 20: Review : Pipelining

HW Change for ForwardingFigure 3.20

Page 21: Review : Pipelining

Data Hazard Even with ForwardingFigure 3.12

Page 22: Review : Pipelining

Data Hazard Even with ForwardingFigure 3.13

Page 23: Review : Pipelining

Try producing fast code for

a = b + c;

d = e – f;

assuming a, b, c, d ,e, and f in memory. Slow code:

LW Rb,b

LW Rc,c

ADD Ra,Rb,Rc

SW a,Ra

LW Re,e

LW Rf,f

SUB Rd,Re,Rf

SW d,Rd

Software Scheduling to Avoid Load Hazards

Fast code:

LW Rb,b

LW Rc,c

LW Re,e

ADD Ra,Rb,Rc

LW Rf,f

SW a,Ra

SUB Rd,Re,Rf

SW d,Rd

Page 24: Review : Pipelining

Compiler Avoiding Load Stalls

% loads stalling pipeline

0% 20% 40% 60% 80%

tex

spice

gcc

25%

14%

31%

65%

42%

54%

scheduled unscheduled

Page 25: Review : Pipelining

Control Hazard on BranchesThree Stage Stall

Page 26: Review : Pipelining

Branch Stall Impact

• If CPI = 1, if 30% branch, Stall 3 cycles => new CPI = 1.9!

• Two part solution:– Determine branch taken or not sooner, AND

– Compute taken branch address earlier

• DLX branch tests if register = 0 or <> 0

• DLX Solution:– Move Zero test to ID/RF stage

– Adder to calculate new PC in ID/RF stage

– 1 clock cycle penalty for branch versus 3

Page 27: Review : Pipelining

Pipelined DLX DatapathFigure 3.22

Page 28: Review : Pipelining

Four Branch Hazard Alternatives

#1: Stall until branch direction is clear

#2: Predict Branch Not Taken– Execute successor instructions in sequence

– “Squash” instructions in pipeline if branch actually taken

– Advantage of late pipeline state update

– 47% DLX branches not taken on average

– PC+4 already calculated, so use it to get next instruction

#3: Predict Branch Taken– 53% DLX branches taken on average

– But haven’t calculated branch target address in DLX

» DLX still incurs 1 cycle branch penalty

Page 29: Review : Pipelining

Four Branch Hazard Alternatives

#4: Delayed Branch– Define branch to take place AFTER a following instruction

branch instructionsequential successor1

sequential successor2

........sequential successorn

branch target if taken

– 1 slot delay allows proper decision and branch target address in 5 stage pipeline

– DLX uses this

Branch delay of length n

Page 30: Review : Pipelining

Delayed Branch

• Where to get instructions to fill branch delay slot?– Before branch instruction

– From the target address: only valuable when branch taken

– From fall through: only valuable when branch not taken

• Compiler effectiveness for single branch delay slot:– Fills about 60% of branch delay slots

– About 80% of instructions executed in branch delay slots useful in computation

– About 50% (60% x 80%) of slots usefully filled

Page 31: Review : Pipelining

Pipelining Summary

• Just overlap tasks, and easy if tasks are independent

• Speed Up vs Pipeline Depth:

• Hazards limit performance on computers:– Structural: need more HW resources

– Data: need forwarding, compiler scheduling

– Control: discuss next time

Speedup =Pipeline Depth

1 + Pipeline stall CPIX

Clock Cycle Unpipelined

Clock Cycle Pipelined