1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined...

1

Designing a Pipelined Processor

In this Chapter, we will study

1. Pipelined datapath2. Pipelined control3. Data Hazards4. Forwarding5. Branch Hazards6. Performance

2

Pipelining is Natural

Car Wash Station (non-pipelined)

Vacuum Wash DryBrianSteve BrianSteve Brian Steve BrianBob Steve

Car Wash Station (pipelined)

Vacuum Wash DryBrianSteve BrianBob Steve BrianMike Bob Steve BrianGorge Mike Bob Steve

Vacuum, Wash, Dry each takes5 minutes

Nonpipelined Car Wash Stationserving one customer per 15minutes

Pipelined Car Wash Stationafter the first customer (it takes15 minutes), then every 5minutes finishing service fora new customer

3

What we learn from car wash station

• Pipelining does not help latency of a single task, but it helps throughput of entire workload

• Pipeline rate is limited by the slowest pipeline stage

• Multiple tasks operating simultaneously using different resources

• Potential speedup = Number of stages

• Unbalanced lengths of pipe stages reduces speedup

• Time to fill the pipe and time to drain the pipe -> reduce speedup

• Stall the pipe for dependences

4

Performance

• Suppose we execute 100 instruction

• single cycle implementation– 50 ns/cycle x 1 CPI x 100 inst. = 5000 ns

• multiple cycle implementation– 10 ns/cycle x 4.6 CPI (instr. mix) x 100 inst = 4600 ns

• ideal pipelined processor– 10 ns/cycle x (1 CPI x 100 inst + 4 cycle to fill up the pipe)

= 1040 ns

– here we assume that 5 stages in the pipeline

– 5 stages : IF, ID, EX, MEM, WB

5

Why pipeline ?

Because the resources are there ! Why waste !

Start fetching and executing the next instruction before the current one has completed

Remember the performance equation: CPU time = CPI * CC * IC

Fetch (and execute) more than one instruction at a timeSuperscalar processing

6

A Pipelined MIPS Processor• Start the next instruction before the current one has completed

– improves throughput - total amount of work done in a given time

– instruction latency (execution time, delay time, response time - time from the start of an instruction to its completion) is not reduced

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

IFetch Dec Exec Mem WBlw

Cycle 7Cycle 6 Cycle 8

sw IFetch Dec Exec Mem WB

R-type IFetch Dec Exec Mem WB

» clock cycle (pipeline stage time) is limited by the slowest stage

» for some instructions, some stages are wasted cycles

7

Single Cycle, Multiple Cycle, vs. Pipeline

Multiple Cycle Implementation:

Clk

Cycle 1

IFetch Dec Exec Mem WB

Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10

IFetch Dec Exec Mem

lw sw

IFetch

R-type

lw IFetch Dec Exec Mem WB

Pipeline Implementation:

IFetch Dec Exec Mem WBsw

IFetch Dec Exec Mem WBR-type

Clk

Single Cycle Implementation:

lw sw Waste

Cycle 1 Cycle 2

8

MIPS Pipeline Datapath Modifications• What do we need to add/modify in our MIPS datapath?

– State registers between each pipeline stage to isolate them

ReadAddress

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read Data 1

Read Data 2

16 32

ALU

Shiftleft 2

Add

DataMemory

Address

Write Data

ReadDataIF

etc

h/D

ec

De

c/E

xe

c

Ex

ec

/Me

m

Me

m/W

B

IF:IFetch ID:Dec EX:Execute MEM:MemAccess

WB:WriteBack

System Clock

SignExtend

9

Pipelining the MIPS ISA• What makes it easy

– all instructions are the same length (32 bits)

» can fetch in the 1st stage and decode in the 2nd stage

– few instruction formats (three) with symmetry across formats

» can begin reading register file in 2nd stage

– memory operations can occur only in loads and stores

» can use the execute stage to calculate memory addresses

– each MIPS instruction writes at most one result (i.e., changes the machine state) and does so near the end of the pipeline (MEM and WB)

• What makes it hard– structural hazards: what if we had only one memory?

– control hazards: what about branches?

– data hazards: what if an instruction’s input operands depend on the output of a previous instruction?

10

Can pipelining cause us trouble ?

• Pipeline Hazards :– structural hazards : attempt to use the same resource by two

different instructions

– data hazards : attempt to use data before it is valid. Such as an instruction depends on results of prior instructions still in the pipeline

– control hazards : attempt to execute branch before condition is computed by prior instructions

• Resolve Pipeline Hazards :– simplest solution : wait until hazards are gone

– take action to resolve hazards, the earlier the hazards are solved, the higher the performance the machine can achieve.

• Pipeline Overhead : – need pipeline registers between stages

11

Why Pipeline? For Performance!

Instr.

Order

Time (clock cycles)

Inst 0

Inst 1

Inst 2

Inst 4

Inst 3

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg

AL

UIM Reg DM RegA

LUIM Reg DM Reg

AL

UIM Reg DM Reg

Once the pipeline is full, one instruction

is completed every cycle, so

CPI = 1

Time to fill the pipeline

12

Structural Hazard Example

Mem Reg ALU Mem Reg

Mem Reg ALU Mem Reg

Mem Reg ALU Mem Reg

Mem Reg ALU Mem Reg

instruction 1

Instruction 2

Instruction 3

instruction 4

time

Two instructions want to access the MEM at the same time

Structural Hazard !

How to resolve this hazard ?

13

Instr.

Order

Time (clock cycles)

lw

Inst 1

Inst 2

Inst 4

Inst 3

AL

UMem Reg Mem Reg

AL

UMem Reg Mem Reg

AL

UMem Reg Mem RegA

LUMem Reg Mem Reg

AL

UMem Reg Mem Reg

A Single Memory Would Be a Structural Hazard

Reading data from memory

Reading instruction from memory

• Fix with separate instr and data memories (I$ and D$)

14

How About Register File Access?

Instr.

Order

Time (clock cycles)

add $1,

Inst 1

Inst 2

add $2,$1,

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg

15

How About Register File Access?

Instr.

Order

Time (clock cycles)

Inst 1

Inst 2

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg

AL

UIM Reg DM RegA

LUIM Reg DM Reg

Fix register file access hazard by doing reads in the second half of the

cycle and writes in the first half

add $1,

add $2,$1,

clock edge that controls register writing

clock edge that controls loading of pipeline state registers

16

Register Usage Can Cause Data Hazards

Instr.

Order

add $1,

sub $4,$1,$5

and $6,$1,$7

xor $4,$1,$5

or $8,$1,$9A

LUIM Reg DM Reg

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg

• Dependencies backward in time cause hazards

• Read After write data hazard

17

Register Usage Can Cause Data Hazards

AL

UIM Reg DM Reg

AL

UIM Reg DM RegA

LUIM Reg DM Reg

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg


add $1,

sub $4,$1,$5

and $6,$1,$7

xor $4,$1,$5

or $8,$1,$9

• Read after write data hazard

18

Loads Can Cause Data Hazards

Instr.

Order

lw $1,4($2)

sub $4,$1,$5

and $6,$1,$7

xor $4,$1,$5

or $8,$1,$9A

LUIM Reg DM Reg

AL

UIM Reg DM RegA

LUIM Reg DM Reg

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg


• Load-use data hazard

19

stall

stall

One Way to “Fix” a Data Hazard

Instr.

Order

add $1,

AL

UIM Reg DM Reg

sub $4,$1,$5

and $6,$1,$7

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg

Can fix data hazard by

waiting – stall – but impacts CPI

20

Another Way to “Fix” a Data Hazard

Instr.

Order

add $1,

AL

UIM Reg DM Reg

sub $4,$1,$5

and $6,$1,$7A

LUIM Reg DM Reg

AL

UIM Reg DM Reg

Fix data hazards by forwarding

results as soon as they are available to where they are

needed

xor $4,$1,$5

or $8,$1,$9

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg

21

Another Way to “Fix” a Data Hazard

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg

Fix data hazards by forwarding

results as soon as they are available to where they are

neededA

LUIM Reg DM Reg

AL

UIM Reg DM Reg

Instr.

Order

add $1,

sub $4,$1,$5

and $6,$1,$7

xor $4,$1,$5

or $8,$1,$9

22

Forwarding with Load-use Data Hazards

Instr.

Order

lw $1,4($2)

sub $4,$1,$5

and $6,$1,$7

xor $4,$1,$5

or $8,$1,$9A

LUIM Reg DM Reg

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg

23

Forwarding with Load-use Data Hazards

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg

AL

UIM Reg DM Reg

• Will still need one stall cycle even with forwarding

Instr.

Order

lw $1,4($2)

sub $4,$1,$5

and $6,$1,$7

xor $4,$1,$5

or $8,$1,$9

24

Branch Instructions Cause Control Hazards

Instr.

Order

lw

Inst 4

Inst 3

beq

AL

UIM Reg DM Reg

AL

UIM Reg DM RegA

LUIM Reg DM Reg

AL

UIM Reg DM Reg


25

stall

stall

stall

One Way to “Fix” a Control Hazard

Instr.

Order

beq

AL

UIM Reg DM Reg

lw

AL

UIM Reg DM Reg

AL

U

Inst 3IM Reg DM

Fix branch hazard by waiting –

stall – but affects CPI

26

Corrected Datapath to Save RegWrite Addr• Need to preserve the destination register address in the

pipeline state registers

ReadAddress

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read Data 1

Read Data 2

16 32

ALU

Shiftleft 2

Add

DataMemory

Address

Write Data

ReadData

IF/ID

SignExtend

ID/EX EX/MEM

MEM/WB

27

Corrected Datapath to Save RegWrite Addr• Need to preserve the destination register address in the

pipeline state registers

ReadAddress

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read Data 1

Read Data 2

16 32

ALU

Shiftleft 2

Add

DataMemory

Address

Write Data

ReadData

IF/ID

SignExtend

ID/EX EX/MEM

MEM/WB

28

MIPS Pipeline Control Path Modifications• All control signals can be determined during Decode

– and held in the state registers between pipeline stages

ReadAddress

InstructionMemory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read Data 1

Read Data 2

16 32

ALU

Shiftleft 2

Add

DataMemory

Address

Write Data

ReadData

IF/ID

SignExtend

ID/EXEX/MEM

MEM/WB

Control

29

Other Pipeline Structures Are Possible• What about the (slow) multiply operation?

– Make the clock twice as slow or …

– let it take two cycles (since it doesn’t use the DM stage)

AL

UIM Reg DM Reg

MUL

AL

UIM Reg DM1 RegDM2

• What if the data memory access is twice as slow as the instruction memory?

– make the clock twice as slow or …

– let data memory access take two cycles (and keep the same clock rate)

30

Sample Pipeline Alternatives• ARM7

• StrongARM-1

• XScale

AL

UIM1 IM2 DM1 RegDM2

IM Reg EX

PC updateIM access

decodereg access

ALU opDM accessshift/rotatecommit result (write back)

AL

UIM Reg DM Reg

Reg SHFT

PC updateBTB access

start IM access

IM access

decodereg 1 access

shift/rotatereg 2 access

ALU op

start DM accessexception

DM writereg write

31

Summary

• All modern day processors use pipelining

• Pipelining doesn’t help latency of single task, it helps throughput of entire workload

• Potential speedup: a CPI of 1 and fast a CC

• Pipeline rate limited by slowest pipeline stage

– Unbalanced pipe stages makes for inefficiencies

– The time to “fill” pipeline and time to “drain” it can impact speedup for deep pipelines and short code runs

• Must detect and resolve hazards

– Stalling negatively affects CPI (makes CPI less than the ideal of 1)

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined...

Documents

Transcript of 1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined...