Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

24
Computer Organization and Computer Organization and Design Design Pipelining Pipelining Montek Singh Montek Singh Mon, Dec 2, 2013 Mon, Dec 2, 2013 Lecture 16 Lecture 16

description

Laundry Example Device: Washer Function: Fill, Agitate, Spin Washer PD = 30 mins Device: Dryer Function: Heat, Spin Dryer PD = 60 mins INPUT: dirty laundry OUTPUT: 4 more weeks

Transcript of Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Page 1: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Computer Organization and Computer Organization and DesignDesign

PipeliningPipelining

Montek SinghMontek SinghMon, Dec 2, 2013Mon, Dec 2, 2013

Lecture 16Lecture 16

Page 2: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

PipeliningPipelining

Between 411 problems sets, I

haven’t had a minute to do laundry

Now that’s what Icall dirty laundry

Read Chapter 4.5-4.8

Page 3: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Laundry ExampleLaundry Example

Device: Washer

Function: Fill, Agitate, Spin

WasherPD = 30 mins

Device: Dryer

Function: Heat, Spin

DryerPD = 60 mins

INPUT:dirty laundry

OUTPUT:4 more weeks

Page 4: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Laundry: One Load at a TimeLaundry: One Load at a Time Everyone knows that Everyone knows that

the real reason one puts the real reason one puts off doing laundry so off doing laundry so long is long is notnot because we because we procrastinate, are lazy, procrastinate, are lazy, or even have better or even have better things to do.things to do. The fact is, doing laundry The fact is, doing laundry

one load at a time is not one load at a time is not smart.smart.

Step 1:

Step 2:

Total = WasherPD + DryerPD

= _________ mins90

Page 5: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Laundry: Doing N Loads!Laundry: Doing N Loads! HereHere’’s how one s how one

would do laundry the would do laundry the ““unpipelinedunpipelined”” way. way.

Step 1:

Step 2:

Step 3:

Step 4:

Total = N*(WasherPD + DryerPD)

= ____________ minsN*90

Page 6: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Laundry: Doing N Loads!Laundry: Doing N Loads! Here’s how to Here’s how to

““pipelinepipeline”” the the laundry process.laundry process.ThatThat’’s why we wait!s why we wait!

Step 1:

Step 2:

Step 3:

Total = N * Max(WasherPD, DryerPD)

= ____________ mins

N*60

…Actually, it’s more like N*60 + 30 if we account for the startup time (i.e., filling up the pipeline) correctly. When doing pipeline analysis, we’re mostly interested in the “steady state” where we assume we have an infinite supply of inputs.

Page 7: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Recall Our Performance MeasuresRecall Our Performance Measures Latency:Latency:

Delay from input to corresponding outputDelay from input to corresponding outputUnpipelined Laundry = _________ minsUnpipelined Laundry = _________ minsPipelined Laundry = _________ minsPipelined Laundry = _________ mins

Throughput:Throughput: Rate at which inputs or outputs are processedRate at which inputs or outputs are processed

Unpipelined Laundry = _________ outputs/minUnpipelined Laundry = _________ outputs/minPipelined Laundry = _________ outputs/minPipelined Laundry = _________ outputs/min

90120

1/901/60

Assuming that the wash is started as soon as possible and waits (wet) in the washer until dryer is available.

Even though we increaselatency, it takes less time per load

Page 8: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Okay, Back to Circuits…Okay, Back to Circuits…

F

G

HX P(X)

For combinational logic: latency = tPD, throughput = 1/tPD. We can’t get the answer faster, but are we making effective use of our hardware at all times?

G(X)F(X)

P(X)

X

F & G are “idle”, just holding their outputs stable while H performs its computation

Page 9: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Pipelined CircuitsPipelined Circuitsuse registers to hold H’s input stable!

F

G

HX P(X)

15

20

25

Now F & G can be working on input Xi+1 while H is performing its computation on Xi. We’ve created a 2-stage pipeline :

if we have a valid input X during clock cycle j, P(X) is valid during clock j+2.

Suppose F, G, H have propagation delays of 15, 20, 25 ns and we are using ideal zero-delay registers (ts = 0, tpd = 0):

latency45

______

throughput1/45

______unpipelined

2-stage pipeline 50worse

1/25better

Pipelining uses registers to improve the

throughput of combinational

circuits

Page 10: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Pipeline DiagramsPipeline Diagrams

Input

F Reg

G Reg

H Reg

i i+1 i+2 i+3

Xi Xi+1

F(Xi)

G(Xi)

Xi+2

F(Xi+1)

G(Xi+1)

H(Xi)

Xi+3

F(Xi+2)

G(Xi+2)

H(Xi+1)

Clock cyclePi

pelin

e st

ages

The results associated with a particular set of input data moves diagonally through the diagram, progressing through one pipeline stage each clock cycle.

H(Xi+2)

F

G

HX P(X)

15

20

25

This is an exampleof parallelism. Atany instant we arecomputing 2 results.

Page 11: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Pipelining SummaryPipelining Summary Advantages:Advantages:

Higher throughput than combinational systemHigher throughput than combinational system Different parts of the logic work on different parts of Different parts of the logic work on different parts of

the problem… the problem…

Disadvantages:Disadvantages: Generally, increases latencyGenerally, increases latency Only as good as the *weakest* linkOnly as good as the *weakest* link

(often called the pipeline(often called the pipeline’’s BOTTLENECK)s BOTTLENECK)

Page 12: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Review of CPU PerformanceReview of CPU Performance

MIPS = Millions of Instructions/Second

Freq = Clock Frequency, MHz

CPI = Clocks per Instruction

MIPS = FreqCPI

To Increase MIPS:1. DECREASE CPI.

- RISC simplicity reduces CPI to 1.0.- CPI below 1.0? State-of-the-art multiple instruction

issue2. INCREASE Freq.

- Freq limited by delay along longest combinational path; hence

- PIPELINING is the key to improving performance.

Page 13: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Where Are the Bottlenecks?Where Are the Bottlenecks?Pipelining goal:

Break LONG combinational paths memories, ALU in separate stages

WA

PC

+4

InstructionMemory

A

D

RegisterFile

RA1 RA2

RD1 RD2

ALUA B

WA

ALUFN

Control Logic

Data MemoryRD

WD R/W

Adr

Wr

WDSEL0 1 2

BSELWDSELALUFNWr

J:<25:0>

PCSEL

WERF

WERF

00

PC+4

Rt: <20:16>

Imm: <15:0>

ASEL

SEXT

+x4

BT

Z

BT

WASEL

Rd:<15:11>Rt:<20:16> 0

123

WASEL

PC<31:29>:J<25:0>:00

JT

JT

N V C

ZVN C

Rs: <25:21>

ASEL20

SEXT

BSEL01

SEXT

shamt:<10:6>

PCSEL 0123456

“16”

IRQ

0x800000800x800000400x80000000

RESET

3127

1

WDWE

Page 14: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Goal: 5-Stage PipelineGoal: 5-Stage Pipeline

GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed to barely include slowest components (mems, regfile, ALU)

APPROACH: structure processor as 5-stage pipeline:

IF Instruction Fetch stage: Maintains PC, fetches one instruction per cycle and passes it to

WB Write-Back stage: writes result back into register file.

ID/RFInstruction Decode/Register File stage: Decode

control lines and select source operands

ALUALU stage: Performs specified operation,

passes result to …

MEMMemory stage: If it’s a lw, use ALU result as an

address, pass mem data (or ALU result if not lw) to …

Page 15: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

ALUA B

ALUFN

Data MemoryRD

WD R/WAdr Wr

WDSEL0 1 2

PC+4

ZVN C

PC

+4

InstructionMemory

AD

00

BTPC<31:29>:J<25:0>:00

JT

PCSEL 0123456

0x800000800x800000400x80000000

PCREG 00 IRREG

WARegister

FileRA1 RA2

RD1 RD2

J:<25:0>

Imm: <15:0>

+x4

BT

JT

Rt: <20:16>Rs: <25:21>

ASEL20 BSEL01

SEXTSEXT

shamt:<10:6>

“16”

1

=BZ

5-Stage 5-Stage miniMIPSminiMIPS

PCALU 00 IRALU A B WDALU

PCMEM 00 IRMEM YMEM WDMEM

WARegister

FileWA WD

WEWERF

WASEL

Rd:<15:11>Rt:<20:16>

“31”“27”

0 1 2 3

InstructionFetch

RegisterFile

ALU

WriteBack

PCWB 00 IRWB YWBMemory

Address is available right after instruction

enters Memory stage

Data is needed just before rising clock edge

at end of Write Back stage

• Omits some details

Page 16: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

PipeliningPipelining Improve performance by increasing instruction Improve performance by increasing instruction

throughputthroughputInstruction

fetch Reg ALU Dataaccess Reg

8 nsInstruction

fetch Reg ALU Dataaccess Reg

8 nsInstruction

fetch

8 ns

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

...

Programexecutionorder(in instructions)

Instructionfetch Reg ALU Data

access Reg

Time

lw $1, 100($0)

lw $2, 200($0)

lw $3, 300($0)

2 nsInstruction

fetch Reg ALU Dataaccess Reg

2 nsInstruction

fetch Reg ALU Dataaccess Reg

2 ns 2 ns 2 ns 2 ns 2 ns

Programexecutionorder(in instructions)

Ideal speedup is number of stages in the pipeline. Do we achieve Ideal speedup is number of stages in the pipeline. Do we achieve this?this?

Page 17: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

PipeliningPipelining What makes it easyWhat makes it easy

all instructions are the same lengthall instructions are the same length just a few instruction formatsjust a few instruction formats memory operands appear only in loads and storesmemory operands appear only in loads and stores

What makes it hard?What makes it hard? structural hazards: suppose we had only one memorystructural hazards: suppose we had only one memory control hazards: need to worry about branch instructionscontrol hazards: need to worry about branch instructions data hazards: an instruction depends on a previous data hazards: an instruction depends on a previous

instructioninstruction Net effect:Net effect:

Individual instructions still take the same number of Individual instructions still take the same number of cyclescycles

But improved throughput by increasing the number of But improved throughput by increasing the number of simultaneouslysimultaneously executing instructions executing instructions

Page 18: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Data HazardsData Hazards Problem with starting next instruction before first is Problem with starting next instruction before first is

finishedfinished dependencies that dependencies that ““go backward in timego backward in time”” are data hazards are data hazards

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecutionorder(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2:

DM Reg

Reg

Reg

Reg

DM

Page 19: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Software SolutionSoftware Solution Have compiler guarantee no hazardsHave compiler guarantee no hazards

Where do we insert the Where do we insert the ““nopsnops”” ? ?Between “producing” and “consuming” instructions!Between “producing” and “consuming” instructions!

subsub $2, $1, $3$2, $1, $3and and $12, $2, $5$12, $2, $5oror $13, $6, $2$13, $6, $2addadd $14, $2, $2$14, $2, $2swsw $15, 100($2)$15, 100($2)

Problem: this really slows us down!Problem: this really slows us down!

Page 20: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

ForwardingForwarding Bypass/forward results as soon as they are Bypass/forward results as soon as they are

produced/needed. Don’t wait for them to be written produced/needed. Don’t wait for them to be written back into registers!back into registers!

IM Reg

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

sub $2, $1, $3

Programexecution order(in instructions)

and $12, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

10 10 10 10 10/– 20 – 20 – 20 – 20 – 20

or $13, $6, $2

add $14, $2, $2

sw $15, 100($2)

Value of register $2 :

DM Reg

Reg

Reg

Reg

X X X – 20 X X X X XValue of EX/MEM :X X X X – 20 X X X XValue of MEM/WB :

DM

Page 21: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Can't always forwardCan't always forward Load word can still cause a hazard:Load word can still cause a hazard:

an instruction tries to read a register following a load an instruction tries to read a register following a load instruction that writes to the same register. STALL!instruction that writes to the same register. STALL!

Reg

IM

Reg

Reg

IM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6

Time (in clock cycles)

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

IM Reg DM Reg

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

DM Reg

Reg

Reg

DM

Page 22: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

StallingStalling When needed, stall the pipeline by keeping an When needed, stall the pipeline by keeping an

instruction in the same stage fpr an extra clock cycle.instruction in the same stage fpr an extra clock cycle.

lw $2, 20($1)

Programexecutionorder(in instructions)

and $4, $2, $5

or $8, $2, $6

add $9, $4, $2

slt $1, $6, $7

Reg

IM

Reg

Reg

IM DM

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (in clock cycles)

IM Reg DM RegIM

IM DM Reg

IM DM Reg

CC 7 CC 8 CC 9 CC 10

DM Reg

RegReg

Reg

bubble

Page 23: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Branch HazardsBranch Hazards When branching, other instructions are in the pipeline!When branching, other instructions are in the pipeline!

need to add hardware for flushing instructions if we are wrongneed to add hardware for flushing instructions if we are wrong

Reg

Reg

CC 1

Time (in clock cycles)

40 beq $1, $3, 7

Programexecutionorder(in instructions)

IM Reg

IM DM

IM DM

IM DM

DM

DM Reg

Reg Reg

Reg

Reg

RegIM

44 and $12, $2, $5

48 or $13, $6, $2

52 add $14, $2, $2

72 lw $4, 50($7)

CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9

Reg

Page 24: Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Pipeline SummaryPipeline Summary A very common technique to improve A very common technique to improve

throughput of throughput of anyany circuit circuit used in all modern processors!used in all modern processors!

Fallacies:Fallacies: ““Pipelining is easy.” No, smart people get it wrong all Pipelining is easy.” No, smart people get it wrong all

of the time!of the time! ““Pipelining is independent of ISA.” No, many ISA Pipelining is independent of ISA.” No, many ISA

decisions impact how easy/costly it is to implement decisions impact how easy/costly it is to implement pipelining (i.e. branch semantics, addressing modes).pipelining (i.e. branch semantics, addressing modes).

““Increasing pipeline stages improves performance.” Increasing pipeline stages improves performance.” No, returns diminish because of increasing No, returns diminish because of increasing complexity.complexity.