Processor Pipelining

8/12/2019 Processor Pipelining

http://slidepdf.com/reader/full/processor-pipelining 1/144

14:332:331

The Processor(datapath and pipelining)

Address Instruction

Instruction

Memory

Write Data

Reg Addr

Reg Addr

Reg Addr

Register

File ALUData

Memory

Address

Write Data

Read DataPC

Read

Data

Read

Data



Review: Design Principles Simplicity favors regularity

fixed size instructions and data – 32-bits Good design demands good compromises-

Only three instruction formats

Smaller is faster-

limited instruction set

limited number of registers in register file

limited number of addressing modes

Make the common case fast-arithmetic operands from the register file (load-store

machine)

allow instructions to contain immediate operands



We're ready to look at an implementation of the MIPS

Simplified to contain only:

memory-reference instructions: lw, sw

arithmetic-logical instructions: add, sub, and,

or, slt

control flow instructions: beq, j

The Processor: Datapath & Control



Generic implementation (first two stages are same):

use the program counter (PC) to supplythe instruction address and fetch theinstruction from memory (and update the PC)

decode the instruction (and read registers)

execute the instruction

All instructions (except j) use the ALU after reading the

registers

The Processor: Datapath & ControlFetch

PC = PC+4

DecodeExec



Abstract Implementation View Two types of functional units:

elements that operate on data (combinational – likeALU) elements that contain state (sequential - registers and

memory)



Abstract Implementation View

Control

Unit

DataMemory

Address

Write Data

Read Dataoverflow

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU

z

ero

Address Instruction

Instruction

Memory

PC

Single cycle operation (multi-cycle presented later)

Split memory (Harvard ) model - one memory forinstructions (instruction cache) and one for data

Shows how PC is incremented or changed by branchtaken, does not show multiplexors or control lines



Clocking Methodologies Clocking methodology defines when signals can be read

and when they can be written. Do not read while writing- unpredictable result falling (negative) edge

rising (positive) edgecycle time

We adopt an edge-triggered clocking methodology =update on clock edge

Values stored in the state elements are updated only on a

clock edge. Input to combinatorial element are values stored by the state elements in a previous clock cycle,

Combinatorial elements output can be used in the following clock cycle.



Synchronous digital system All signals must propagate from State element 1

through combinatorial block and to State element 2 in asingle clock cycle.

State element is read in half cycle and written secondhalf.

The time needed for the logic gates to settle determinesthe length of the clock cycle.

State

element1

State

element2

Combinationallogic

clock

one clock cycle

Assumes state elements are written on every clock cycle; If not, need explicit write control signal - write occurs

only when both the write control is asserted and clockedge occurs



Building the Datapath The datapath assures that data travels between

various memory units to registers and ALUs;

The control unit regulates this transfer anddetermines what actions are to be taken on the data.

It does so using control lines that are connected to

various hardware units.

The edge triggered clock methodology allows a givenstate element to be both read and be written to in asingle clock cycle.

Either we have a long clock cycle for one instruction(length is determined by the slowest instruction), ormultiple cycles per instruction.



Building the Datapath

What are the “building blocks” needed to implement a

subset of the MIPS instructions (lw, sw, arithmetic andlogic instructions - like add , sub, and , or, slt)? Any instruction needs to be fetched and the PC

incremented by 4

Read

AddressInstruction

InstructionMemory

Add

PC

4

Special ALU that only adds

PC is updated on

every clock cycle

Instruction Memory is

read every cycle, so it

doesn’t need an explicitread control signal

Extend later for j, beq



Decoding Instructions Decoding instructions involves

sending the fetched instruction’s opcode andfunction field bits to the control unit

Instruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

Control

Unit

reading two values from the Register FileRegister File addresses are contained in the

instruction

5

532

32



overflowInstruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU zero

R format operations (add , sub, slt, and , or)

Executing R Format Operations

R-type:31 25 20 15 5 0

op rs rt rd functshamt

10

ALU control

4

32

32

– perform the indicated (by op and funct) operation onvalues in rs and rt

– store the result back into the Register File (intolocation rd )

The fourth ALU control line supports NOR



Note that Register File is not written every cycle (e.g.

sw), so we need an explicit write control signal forthe Register File (RegWrite)

Executing R Format Operations

RegWrite

32

Instruction

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU

overflow

zero

ALU control

4

32

32

To execute lw or sw operations we need a Data

Memory unit with two control signals for write into( MemWrite) and read from ( MemRead )

MemWrite

MemRead

Data

Memory

Address

Write Data

Read Data



Executing Load and Store Operations

Load and store operations compute a memory address by adding the base register (in rs) to the 16-bit signed

offset field in the instruction

The 32 bits in the base register were read from the

Register File during decode

The offset value in the low order 16 bits of theinstruction must be sign extended to create a 32-bit

signed value

I-Type: op rs rt address offset

31 25 20 15 0



Data

Memory

Address

Write Data

Read Data

MemWrite

MemRead

16 32

Sign

Extend

16-bit offset

RegWrite

Read Addr 1 zero

ALU

overflow

Write Data

InstructionRead Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU control

sw - value read from the Register File during decode must be written to the Data Memory

lw - value read from the Data Memory must be stored in the

Register File

Executing Load and Store Operations



Executing Branch Operations

Branch operations have to compare the operands readfrom the Register File during decode (rs and rt

values) for equality (zero ALU output is asserted) compute the branch target address by adding the

updated PC to the sign extended 16-bit signed offset

field in the instruction

offset value in the low order 16 bits of the instruction

must be sign extended to create a 32-bit signed value

and then shifted left 2 bits to turn it into a word address


31 25 20 15 0



Executing Branch Operations

RegWrite

Read Addr 1 zero

ALU

overflow

Write Data

InstructionRead Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU control

(to branch control logic -take

branch if zero line is high)

Add

Branch

target

addressShift

left 2

Add

4

PC

PC + 4

(perform a subtraction)

3216

16-bit offset

Sign

Extend



Executing Jump Operations

Jump operations have to replace the lower 28 bits of thePC+4 with the lower 26 bits of the fetched instructionshifted left by 2 bits

Read

AddressInstruction

Instruction

Memory

Add

PC

4

26

4

Shift

left 2 28

Jumpaddress

J-Type: op

31 25 0

jump target address



Creating a Single Datapath from the Parts We need to assemble the datapath segments, add

control lines as needed, and design the control path Fetch, decode and execute each instruction in one

clock cycle – single cycle design. Cycle time isdetermined by length of the longest path

No datapath resource can be used more than once perinstruction, so some must be duplicated (that is whywe have a separate Instruction Memory and DataMemory)

To share datapath elements between differentinstruction classes will need multiplexors at theinput of the shared elements

Need control lines to do the selection of inputs



Fetch, R, and Memory Access Portions

Read

AddressInstruction

Instruction

Memory

Add

PC

4RegWrite

ovf

zero

ALU control

ALUData

Memory

Address

Write Data

Read Data

MemWrite

MemRead

lw

R

R

Sign

Extend16 32

lw / sw

Register

File

Write Data

Read Addr 1

Read Addr 2

Write Addr

Read

Data 1

Read

Data 2



Multiplexor Insertion

MemtoReg

Read

AddressInstruction

Instruction

Memory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU

ovf

zero

ALU controlRegWrite

DataMemory

Address

Write Data

Read Data

MemWrite

MemReadSign

Extend16 32

ALUSrc



Adding the Branch Portion

Add

4 Shift

left 2

Add

Read

AddressInstruction

Instruction

Memory

PC

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU

ovf

zero

ALU controlRegWrite

Data

Memory

Address

Write Data

Read Data

MemWrite

MemReadSign

Extend16 32

MemtoReg ALUSrc

R

lw / sw

R

lw

PCSrc

Branch not taken, R, lw /sw



We wait for everything to settle down - ALU might not

produce “right answer” right away

Cycle time determined by length of the longest path

Split memory (Harvard) model - single cycle operation

Simplified to contain only the instructions: lw, sw,add, sub, and, or, slt, beq .

Sequential components (PC, RegFile, Memory) are edgetriggered - state elements are written on every clock

cycle; if not, need explicit write control signal write occurs only when both the write control is

asserted and the clock edge occurs

Datapath - review



Single cycle datapath

Observations - op field always in bits 31-26

address of the two registers to be read are always specified by the rs and rt fields (bits 25-21 and 20-16)

address of register to be written is in one of two places

– in rt (bits 20-16) for lw; in rd (bits 15-11) for R -

type instructions

base register for lw and sw always in rs (bits 25-21)

offset for beq , lw, and sw always in bits 15-0


31 25 20 15 0

R-type:

31 25 20 15 5 0

op rs rt rd funct shamt

10



(Almost) Complete Single Cycle Datapath

Read

AddressInstr[31-0]

Instruction

Memory

Add

PC

4

Write Data

Read Addr 1

Read Addr 2

Write AddrALU

ovf

zeroData

Memory

Address

Write Data

Read Data

MemWrite

MemRead

Register

File

Read

Data 1

Read

Data 2

RegWrite

Sign

Extend16 32

Shift

left 2

Add

RegDst

0

1

ALUSrc

0

1

MemtoReg

1

0

PCSrc

1

0

ALU

control

ALUOp

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[

15 -11]



ALU's operation based on instruction type and functioncode

MIPS uses multiple control levels to increase speed ofthe main control unit and decrease its size

ALU Control

ALU control input

(Ainvert+Binvert +Operation)

Function

0000 and

0001 or

0010 add

0110 subtract

0111 set on less than

1100 NOR

set



Multiple levels of control

main control unit generates the ALUOp bits

ALU control unit generates ALU controlinputs (main control is smaller)

ALU Control, continued

Instr op funct ALUOp desired action ALU control input

lw xxxxxx 00 add 0010

sw xxxxxx 00 add 0010

beq xxxxxx 01 subtract 0110

add 100000 10 add 0010

sub 100010 10 subtract 0110and 100100 10 AND 0000

or 100101 10 OR 0001

slt 101010 10 Set on less than 0111

ALU

control

ALUOp(from

control block)

Instr[5-0]

ALU control

input



ALU Control Truth Table

Can make use of more don’t cares

since ALUOp does not use the encoding 11

since F5 and F4 are always 10

F5 F4 F3 F2 F1 F0 ALUOp1 ALUOp0 Op3 Op2 Op1 Op0

X X X X X X 0 0 0 0 1 0

X X X X X X x 1 0 1 1 0

X X 0 0 0 0 1 x 0 0 1 0

X X 0 0 1 0 1 x 0 1 1 0X X 0 1 0 0 1 x 0 0 0 0

X X 0 1 0 1 1 x 0 0 0 1

X X 1 0 1 0 1 x 0 1 1 1



ALU Control Combinational Logic

From the truth table can design the ALU Control logic



Summary of control lines

RegDest Source of the destination register for the

operation

RegWrite Enables writing a register in the register file

ALUsrc Source of second ALU operand, can be a

register or part of the instruction PCsrc Source of the PC (increment [PC + 4] or

branch)

MemRead/MemWrite Reading / Writing from datamemory

MemtoReg Source of write register contents

D t th ith C t l U it



Read Addr 2

ALU

ovf

zeroData

Memory

Write Data

Address

Read Data 1

0Write Data

Read Addr 1

Write Addr

Register

File

Read

Data 1

Read

Data 2

Sign

Extend16 32

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15

-11]

Shift

left 2

Instr[31-0]

Add

0

1

Read

Address

Instruction

Memory

Add

PC

4

1

01

0

Instr[5-0]

Datapath with Control Unit

RegDst

Control

UnitInstr[31-26]

PCSrc

MemReadMemtoReg

MemWriteRegWrite

ALUSrc

ALU

control

ALUOpBranch



Main Control Unit

Instr RegDst ALUSrc MemReg RegWr MemRd MemWr Branch ALUOp1 ALUOp0

R-

type

000000

1 0 0 1 X 0 0 1 X

lw100011 0 1 1 1 1 0 0 0 0

sw

101011X 1 X 0 X 1 0 0 0

beq000100

X 0 X 0 X 0 1 X 1



Control Unit Logic From the truth table can design the Main Control logic

R-type lw sw beq

MemtoReg

MemReadMemWrite

ALUOp1

ALUOp0

Branch

ALUSrc

RegDst

RegWrite

000000 100011 101011 000100

Instr[31]Instr[30]

Instr[29]Instr[28]Instr[27]Instr[26]



Adding the Jump Operation

PCSrc

0

1

Shift

left 2

0

1

Jump

32Instr[25-0]26

PC+4[31-28]

28 PC

J has the opfield 000010 followed by 26-bit jump

address. We need to add to the datapath.

The instruction bits [25-0] are shifted left two bits and

concatenated to the four MSB of PC+4.

A new multiplexer and control line are needed to select

this input for the program counter

C t l U it L i



Control Unit Logic One more gate is needed in the Main Control logic for JInstr[31]Instr[30]

Instr[29]Instr[28]Instr[27]Instr[26]

RegDst

ALUSrc

MemtoReg

RegWrite

MemReadMemWrite

Branch

ALUOp1

ALUOp0

R-type lw sw beq

Jump 000010



Add

0

1

ALUOpBranch

PCSrcShift

left 2

Read

Address

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU

ovf

zero

RegWrite

Data

Memory

Address

Write Data

Read Data

MemWrite

MemRead

Sign

Extend16 32

MemtoReg

ALUSrc

RegDst

ALU

control

1

11

0

0

0

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15

-11]

Control

UnitInstr[31-26]

Instr[31-0]

Instruction

Memory

Add

PC

4

PC+4[31-28]

Adding Jump Operation

Jump

0

1Shift

left 232

Instr[25-0]

2628

C l U i L i



Control Unit Logic One more gate is needed in the Main Control logic forJal ( J type instruction which stores PC+4 in $ra $31)

Instr[31]Instr[30]Instr[29]Instr[28]Instr[27]Instr[26]

RegDst

MemtoReg

RegWrite

ALUOp1

ALUOp0

R-typeJal

000011000000

Jump

3 26-bit address J format

Addi J l O ti



Add

0

1

ALUOpBranch

PCSrcShift

left 2

Read

Address

Write Data

Read Addr 1

Read Addr 2

Write Addr

Register

File

Read

Data 1

Read

Data 2

ALU

ovf

zero

RegWrite

Data

Memory

Address

Write Data

Read Data

MemWrite

MemRead

Sign

Extend16 32

MemtoReg

ALUSrc

RegDst

ALU

control

1

1

1

0

0

0

Instr[5-0]

Instr[15-0]

Instr[25-21]

Instr[20-16]

Instr[15-11]

Control

UnitInstr[31-26]

Instr[31-0]

Instruction

Memory

Add

PC

4

PC+4[31-28]

Adding Jal Operation

Jump

0

1Shift

left 232

Instr[25-0]

2628

PC+4

31



Memto- Reg Mem Mem

RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUOp0 Jump

R-format 01 0 00 1 0 0 0 1 0 0

lw 00 1 01 1 1 0 0 0 0 0

sw xx 1 xx 0 0 1 0 0 0 0

beq xx 0 xx 0 0 0 1 0 1 0

J xx x xx 0 0 0 x x x 1

PC Jump Address

PC+ 4R[31]

MemtoReg

Is now 2 bits

RegDst

Is now 2 bits

Adding Control line Settings for Jal Operation

JAL 10 x 10 1 0 0 x x x 1



Single Cycle Implementation Cycle Time Unfortunately, though simple, the single cycle approach

is not used because it is inefficient Clock cycle must have the same length for every

instruction

What is the longest path (slowest instruction)?

Calculate cycle time assuming negligible delays (formuxes, control unit, sign extend, PC access, shift left 2,wires) except :

Instruction and Data Memory (2ns)

ALU and adders (2ns)

Register File access (reads or writes) (1ns)

floating point operations even longer



Instruction Critical Paths

Instr. I Mem Reg Rd ALU Op D Mem Reg Wr Total delay

(ns)

R-type

load

store

beq

jump

2 1 2 1 6

2 1 2 2 1 8

2 1 2 2 7

2 1 2 5

2 2



A floating point add.d = Instr. Fetch (2 ns)+ Reg. Read (1 ns)+ALU add(8 ns)+ Reg. Write (1 ns)= 12 ns

Floating point load l.s=2+1+2(ALUop)+2(data mem)+1 (Reg.) =8 ns

Floating point store s.s =2+1+2(ALU)+2(data mem) = 7 ns.

The longest instruction is floating point multiply mul = Inst. Fetch

(2 ns)+Reg. Read (1 ns)+ALU multiply (16 ns)+ Reg. Write (1 ns) =20 ns

Floating point branch = 5 ns, floating point jump=2 (fetch)

If clock period is variable in length, then we need to look atinstruction frequency. For example Loads (31%), stores (21%), R-type (27%), beq(5%), j (2%), add.d, sub.d (7%), mult.d, div.d(7%).

Combining to compute the clockcycle=8x31%+7x21%+6x27%+5x5%+2x2%+20x7%+12x7%=

7 ns

What about floating point operations?



Instead of a fixed cycle time, we allow cycle time to depend oninstruction class.

We can then compare performance, considering that CPI will still be 1, and Instruction count does not change.

Perf. CPU variable cycle time = CPU exec. time fixed cycle time

Perf. CPU fixed cycle time CPU exec. time var. cycle time

= Clock period fixed

Clock period variable

because performance= _____________1______________ )

Instr. Count x CPI x Clock Period

Performance improvement = 20 ns (fixed cycle clock period) = 2.86 faster

7 ns (variable cycle clock period)

What about variable cycle length?



Single cycle/instr. datapath is wasteful of area sincesome functional units must be duplicated since they cannot be “shared” during an instruction execution

e.g., need separate adders to do PC update and branchtarget address calculations, as well as an ALU to do R-type arithmetic/logic operations and data memoryaddress calculations

Where We are Headed Single cycle/instr. uses the clock cycle inefficiently – the

clock cycle must be timed to accommodate the slowest instruction – we cannot make common case fast.

especially problematic for more complex instructions

like floating point multiplications



Is wasteful of area since some functional units must be

duplicated since they can not be shared during a clockcycle (e.g., adders, memory units)

But, it is simple(r) and easy to understand

Single Cycle Disadvantages & Advantages

Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowest instruction

Single Cycle Implementation:

lw sw Waste

Clk

Cycle 1 Cycle 2



The Five Stages of a lw Instruction

We will consider only a subset of instructions (lw,

sw, add , sub, and , or, slt, beq )

IFetch: Instruction Fetch and Update PC

Dec: Registers Read and Instruction Decode

Exec: calculate memory address Mem: Read the data from the Data Memory

WB: Write the data back to the register file

IFetch Dec RR Exec Mem WB

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

lw

2 ns

Wh t if



What if…. Several instructions were worked on by the CPU at the “same time”

Each major logic unit works on a different stage of a different

instruction - Like doing laundry for different roommates



Single Cycle vs. Pipelined


Clk

Load Store Waste

Cycle 1 Cycle 2

lw IFetch Dec Exec Mem WB

Pipeline Implementation:

IFetch Dec Exec Mem WBsw

IFetch Dec Exec Mem WBR-type

wasted cycle

Pipelined lw finishes later than

non-pipelined version

Time savings

Si l C l Pi li d




Pipelined Implementation:

Single Cycle, vs. Pipelined

Time savings Instruction 2

Time savings Instruction 3

Assume memory and ALU ops take 200 ps, Reg ops take 100 ps

Register read in second half of cycle



What makes it easy - all instructions are the same length(32 bits) - The first two pipeline stages are the same forall instructions.

few instruction formats (three) with symmetry across

formats - registers addresses are in the same location andthus can be read while instructions are being decoded(read reg file in the first ½ of the clock cycle)

memory operations can occur only in loads and stores,

thus the ALU can compute memory addresses in EXstage

operands are aligned in memory so a single data transferrequires only one memory access

Designing MIPS Instructions for Pipelining

MIPS Pipeline Datapath Modifications



MIPS Pipeline Datapath Modifications What do we need to add/modify in our single-cycle per

instruction datapath to make it pipelined?

The MIPS instruction has (up to) five stages, thus pipelienehas 5 stages:

Ifetch to fetch the instruction from Instruction memory Dec to decode the instruction and read Register File

registers Exec to do the ALU operations Mem to read from/write into Data Memory WB to write back into the register file. So we need a way to

separate the data path into five pieces, without losingintermediate results. We will introduce Pipeline registers between

stages to isolate them and store intermediate results




Data

Memory Address

Write Data

Read

Data

M

e m / W B

System Clock


ALU

1

0

Shiftleft 2

Add

E x

e c / M e m 1

0

All instructions advance during one clock cycle between one pipeline register and the next

IF (fetch) EX (execute) Mem WB

(write back)

Read

Address

Instruction

Memory

Add

P C

4

0

1

I F e t c h / D e c

Read

Data 2

16

Read Addr 1

Write Addr

Write Data

Read Addr 2

Register

File

Read

Data 1

32

D e c / E x e c

ID (decode)

Sign

Extend




MIPS Pipeline Datapath Modifications Because all data is passed through the pipeline, the address of the

register where data needs to be loaded (lw) also needs to be passedIFetch

System Clock

Read

Address

Instruction

Memory

Add

P C

4

0

1

I F e t c h / D e c

1

0

Read

Data 2

16

Read Addr 1

Write Addr

Write Data

Read Addr 2

Register

File

Read

Data 1

32

D e c / E x e c

Sign

Extend

Data

Memory Address

Write Data

Read

Data

M e m / W B

ALU

1

0

Shiftleft 2

Add

E x e c / M e m

Dec Exec Mem WB

Pipeline Register size (for now)

IF/ID 64

ID/EX 32x4+5=133

EX/MEM 32x3+5=101

MEM/WB 32x2=5=69

Extends pipeline reg. to hold address of destination reg.

MIPS Pipeline Control Path Modifications



IFetch

Data

Memory Address

Write Data

Read

Data

M

e m / W B

Read

Address

Instruction

Memory

Add

P C

4

0

1

I F e t c h / D e c

System Clock

1

0

ALU

1

0

Shiftleft 2

Add

E x

e c / M e m

Read

Data 2

16

Read Addr 1

Write Addr

Write Data

Read Addr 2

Register

File

Read

Data 1

32

D e c / E x e c

Sign

Extend

MIPS Pipeline Control Path Modifications All control signals are determined during Decode and held in the pipeline registers between pipeline stages

Dec

Control

Exec Mem WB





The modifiedcontrol path is

6

2

4

3

Pipeline Example



Pipeline Example

How does the non-dependent instruction sequence

execute in a pipeline ? (no support for forwarding) before <4>

before <3>

before <2>

before <1>

lw $10, 20($1)

sub $11, $2, $3

and $12, $4, $5

or $13, $6, $7

add $14, $8, $9after <1>

after <2>

Pi li E l b f <4> l t



Pipeline Example - before <4> completes

Pipeline E ample b f <3> l t



Pi li E l b f <1> l t




3

11

$4, $5

Pipeline Example l completes



Pipeline Example - lw completes

12

$5

destination register

D a t a m e m o r y n

o t u s e d

( M E M

c o n t r o l l i n e s 0 )

Pipeline Example sub completes



13

$7

Pipeline Example - sub completes

D a t a m e m o r y n

o t u s e d

( M E M

c o n t r o l l i n e s 0 )

$6, $7

Pipeline Example and completes



Pipeline Example - and completes

$9

14

N o r m a l P C + 4 i n c r e m e n t

( P C S r c = 0 )

Pipeline Example or completes



Pipeline Example - or completes

Pipeline Example add completes



Pipeline Example - add completes

Written in first half of cycle

G hi ll R ti MIPS Pi li



So-far we saw the single-clock-cycle pipeline diagrams – show the state of the entire datapath during a clockcycle (instructions are identified above the pipeline

stages). Multi-clock-cycle pipeline diagrams are simpler, andcan help answer how many cycles does it take toexecute this code

Or what is the ALU doing during a certain cycle Can represent multiple instructions in a single figure If there is a hazard, it shows why it occurs, and how it

can be fixed

Graphically Representing MIPS Pipeline

IM Reg Reg

AL U

DM

Why Pipeline? For Throughput!



Why Pipeline? For Throughput!Time (clock cycles)

I

n

s

t

r.

O

r

d

e

r

Inst 0AL U

IM Reg DM Reg

Inst 1AL U

IM Reg DM Reg

Inst 2AL U

IM Reg DM Reg

Inst 3AL U

IM Reg DM Reg

Inst 4AL U

IM Reg DM Reg

O n c e

t h e

p i p e

l i n e

i s

f u

l l ,

o n e

i n s

t r u c

t i o n

i s

c o m p

l e

t e

d

e v e r y

c y c

l e

Time to fill the pipeline

Example of graphical representation



Example of graphical representation

M

Can be converted in a

single-clock-cycle

pipeline diagram

DM

Example of single-clock-cycle (cc 5)



Example of single clock cycle (cc 5)

pipeline representation

Pipelining the MIPS ISA



What makes it hard - structural hazards: what ifwe had only one memory - then the pipeline cannot have

one instruction read from memory (fetch stage), while atthe same time another instruction writes into memory (sw)

control hazards: need to make a decision based onthe results of one instruction, while that instruction is still

executing. what about branches?

pe g t e S S

Stalling

Impact of branch stalling



We assume that all instructions in the pipeline have a

CPI of 1. Branches which always are followed by a stall

have a CPI of 2.

In a typical program branches occur 13% of the time.

Thus we can compute the aggregate CPI of the always-

stall for branch architecture as:

Impact of branch stalling

Then CPI = Σ CPI i x F ii=1

n

CPI always stall = 1 x 87% + 2 x 13% = 1.13 cycles/instruction

Thus CPU Perform. always stall = Inst. Count x CPI no stall x ClockPerform. no stall Inst. CountxCPI always stall x Clock

Perform. always stall = 1 = 0.885 ( 88.5%)

Perform. no stall 1.13




control hazards: Another approach is “prediction” - either static - always execute the instruction following a branch (assumealways that the branch is not taken), or predict dynamically (keep ahistory of each branch as taken or not taken - accurate 90% of time).


Branch not taken

Branch taken




data hazards: what if an instruction’s input operands

depend on the output of a previous instruction that did notfinish? Example an add followed by a sub.


Forwarding

The pipeline simplified representation is shading the

blocks that are used in a given clock cycle.





Forwarding will fail for a lw followed immediately by

an instruction that uses the results of the lw operation.

Example lw followed by a sub.



How About Register File Access?

I

n

s

t

r.

O

r

d

er

Time (clock cycles)

add

Inst 1

Inst 2

Inst 4

add

AL U

IM Reg DM Reg

AL

U IM Reg DM Reg

AL U

IM Reg DM Reg

AL

U

IM Reg DM Reg

AL U

IM Reg DM Reg

Can fix register file

access hazard by

doing reads in the

second half of the

cycle and writes in

the first half.

Branch Instructions Cause Control Hazards



Branch Instructions Cause Control Hazards

I

n

s

t

r.

O

r

d

e

r

add

beq

lw

Inst 4

Inst 3

AL U

IM Reg DM Reg

AL U

IM Reg DM Reg

AL U

IM Reg DM Reg

AL

U IM Reg DM Reg

AL U

IM Reg DM Reg

Dependencies backward in time cause hazards time

One Way to “Fix” a Control Hazard



One Way to Fix a Control Hazard

I

ns

t

r.

O

r

d

e

r

add

beq

AL U

IM Reg DM Reg

AL U

IM Reg DM Reg

Inst 3

lw

AL

U IM Reg DM Reg

AL U

IM Reg DM Reg

Can fix branch

hazard by waiting –

stall – but affects

throughput

stall

stall

Register Usage Can Cause Data Hazards



I

n

s

t

r.

O

r

d

er

No data hazard

Register Usage Can Cause Data Hazards

add r1,r2,r3AL U

IM Reg DM Reg

sub r4,r1,r5AL

U IM Reg DM Reg

and r6,r1,r7AL U

IM Reg DM Reg

or r8, r1, r9

AL

U

IM Reg DM Reg

xor r4,r1,r5AL U

IM Reg DM Reg

Dependencies backward in time cause hazards

Data hazard

O W “Fi ” D H d



One Way to “Fix” a Data Hazard

I

n

s

t

r.

O

r

d

e

r

add r1,r2,r3AL U

IM Reg DM Reg

sub r4,r1,r5

and r6,r1,r7

AL U

IM Reg DM Reg

AL U

IM Reg DM Reg

stall

stall

Can fix data hazard

by waiting –

stall –

but affects

throughput

Loads Can Cause Data Hazards



Loads Can Cause Data Hazards

I

n

s

t

r.

O

r

d

e

r

lw r1,100(r2)

sub r4,r1,r5

and r6,r1,r7

xor r4,r1,r5

or r8, r1, r9

AL U

IM Reg DM Reg

AL U

IM Reg DM Reg

AL U

IM Reg DM Reg

AL U

IM Reg DM Reg

AL U

IM Reg DM Reg


Stores Can Cause Data Hazards



Stores Can Cause Data Hazards

I

n

s

t

r.

O

r

d

e

r

add r1,r2,r3

sw r1,100(r5)

and r6,r1,r7

xor r4,r1,r5

or r8, r1, r9

AL U

IM Reg DM Reg

AL U

IM Reg DM Reg

AL U

IM Reg DM Reg

AL U

IM Reg DM Reg

AL U

IM Reg DM Reg


Pipeline Changes to



Pipeline Changes to

accommodate Forwarding To avoid slowing down throughput, we need to add a hardware

that detects data hazards. We call this the forwarding unit.

Data needs to be forwarded to the ALU when a data hazard isdetected. Thus the forwarding unit controls forwarding data

through additional multiplexing at the ALU input.

This logic unit needs input from the three pipeline registers.

It also needs to detect if the RegWrite control signal is asserted

– so it needs input from the control lines also.

No forwarding if EX/MEM.RegisterRd=$0 and

MEM/WB.RegisterRd=$0

Pipeline Changes to accommodate Forwarding




It needs to detect one of four cases of data hazards:

if (EX/MEM.RegWrite

and (EX/MEM.RegisterRd 0

and (EX/MEM.RegisterRd=ID/EX.RegisterRs) Forward

similarly

if (EX/MEM.RegWrite

and (EX/MEM.RegisterRd 0

and (EX/MEM.RegisterRd=ID/EX.RegisterRt) Forward





similarly

if (MEM/WB.RegWrite

and (MEM/WB.RegisterRd 0

and (MEM/WB.RegisterRd=ID/EX.RegisterRs) Forward

similarly

if (MEM/WB.RegWrite

and (MEM/WB.RegisterRd 0

and (MEM/WB.RegisterRd=ID/EX.RegisterRt) Forward

Using Pipeline Registers to solve data hazards



g p g



22

0

1

2

0

1

2

ForwardA 00 ID/EX input to ALU1 - no fwd

01 MEM/WB input to ALU1

10 EX/MEM input to ALU1

ForwardB 00 ID/EX input to ALU2

01 MEM/WB input to ALU2

10 EX/MEM input to ALU2

ForwardB 11 sign extension

input to ALU2

OR add another multiplexer

ALUSrc

0

1

Forwarding Pipeline Example



Forwarding Pipeline Example

How does the dependent instruction sequence execute in

a pipeline with support for forwarding? before <4>

before <3>

before <2>

before <1>sub $2, $1, $3

and $4, $2, $5

or $4, $4, $2

add $9, $4, $2

after <1>

after <2>

…

Forw. Pipeline Example - before <2> completes



p p p

Forw. Pipeline Example - before <1> completes



EX/MEM.RegisterRd=ID/EX.RegisterRs

Use this value of $2 not the one fetched

from register file

EX/MEM.RegWrite isasserted

Forw. Pipeline Example - sub completes



p

EX/MEM.RegisterRd=ID/EX.RegisterRs MEM/WB.RegisterRd=ID/EX.RegisterRt

Both $4 and $2 are forwardedEX/MEM.RegWrite isasserted

MEM/WB.RegWritis asserted

Forwarding Pipeline Example - and completes



EX/MEM.RegisterRd=ID/EX.RegisterRs

EX/MEM.RegWrite isasserted

Use this value of $4 not the one fetched

from register file



Example (corrected)



p ( )

Example (corrected)



p ( )

Pipeline Changes to accommodate Stalls Forwarding does not work when an instruction following a lw



Forwarding does not work when an instruction following a lw tries to read the value from the destination register of lw

lw $2 , 20($1)

and $4, $2 , $5or $8, $2 , $6

The pipeline needs to be stalled, and data forwarded from the

MEM/WB pipeline register

Forwarding does not work

How Stalls are insertedStalls happen in the EX stage such that the subsequent



Stalls happen in the EX stage, such that the subsequent

two instructions in the pipeline both repeat what they

were doing for one cycle This allows forwarding to work

Stall in CC4 and and or repeat what they did in

CC3

OR is fetched in CC3 but it stays in

ID stage for another cycle

Pipeline Changes to accommodate Stalls



We need a logic unit which detects hazards and then stalls.

The hazard detection unit operates in the instruction

decode stage, and tests to see if the instruction is a load (ifID/EX.MemRead control line is asserted)

Then it checks if either of the source registers of the instructioncurrently being decoded is the same as the target/destinationregister of the lw being executed (that is ifID/EX.RegisterRt=IF/ID.RegisterRs or

ID/ EX.RegisterRt= IF/ID.RegisterRt)

During stalling the PC is prevented from incrementing and theinstruction in the IF/ID pipeline register is preserved. Need

additional control lines for the IF/ID register and for the PC. The bubble is inserted by setting the pipelined control signals in

the ID/EX pipeline register to 0. So we need a way to change thevalues of the control lines.

Pipeline Changes to do Hazard Detection



I n s t r u c t i o n s o u r c e r e g i s t e r s

I D / E X . R e g i s t e r R

t

Pipeline Changes to do Hazard Detection



I F / I D w

r i t e

P C w r i t e

Stall by 0-ing all 9 control lines

Pipeline stalling example before<3> completes



Pipeline stalling example before<1> completesPCWrite



0

0

Bubble inserted

Registers continue to be read

IF/IDWrite

is asserted

PCWrite

is asserted

Pipeline stalling example lw completes



Forwarding unit sets ALUsrc multiplexer to use value from WB register

Pipeline stalling example bubble completes



Forwarding unit sets ALUsrc multiplexer to use value from EX/MEM register

Consider executing the following code

Example



Consider executing the following codeadd $5, $6, $7lw $6 , 100($7)sub $7, $6 , $8

How many cycles will it take to execute the code? Draw a diagram thatillustrates the dependencies that need to be resolved

CC 7 CC 8

add $5,$6,$7

lw $6,100($7)

sub $7, $6, $8



6

2

3 Branch decision in MEM stage

Pipeline Changes to accommodate Control Hazards

Control hazards are due to branch hazards and to exceptions (I/O



interrupts, requests from the OS, overflow, or an unknowninstruction).

A branch hazard occurs less frequently than data hazards, andis detected in the MEM stage of the pipeline.

Assume branch not taken, the three instructions following a branchthat is taken will be in the pipeline, and need to be flushed.

40 beq $1,$3,7

44 and $12,$2,$5

48 or $13,$6,$2

52 add $14,$2,$2

56 …

72 lw $4,50($7)

branch detected CC4

Pipeline Changes to accommodate Branch Hazards

The pipeline throughput can be improved by moving the



The pipeline throughput can be improved by moving thedecision whether the branch is taken or not to the

Decode stage of the pipeline; Then if the branch is taken, only one instructionneeds to be flushed (discarded) - the instructionimmediately after the branch instruction.

Thus we need a new logic circuit which compares thecontents of the register file outputs; Since the decision is taken in the decode stage, the

branch address needs to be computed in the decode phase too, in case the branch is to be taken

Thus we need a new adder in the decode phase, aswell as add an IF Flush control line to flush theIF/ID pipeline register.


Branch



Compute branch address Check for equality

S w i t c h t o b r a n c h a

d d r e s s

Pipelined branch example <before 2> completes



2

Branch

I F F l u s h

P C - r e l a t i v e b r a n c h 4 0 + 4 + 7 *

4 = 7 2

Flushing means instruction field is 0s

Pipelined branch example <before 1> completes



3


The above scheme will fail if we have the following



The above scheme will fail if we have the following

series of instructions:

36 add $1, $6, $7

40 beq $1, $3, 28

44 and $12, $2, $5

… 72 lw…

Because the correct value of register $1 is not in the

decode stage (in the register file) at the time when the

comparator needs it Pipeline needs to be stalled and the value of $1 needs to

be forwarded from EX/Mem pipeline register




36 add $1, $6, $7

40 beq $1, $3, 28

44 and $12, $2, $5

…

72 lw…




Stall

36 add $1, $6, $7

40 beq $1, $3, 28

…

72 lw…

flush




How can the following code be modified to make use of a delayed

Example



branch slot?:Loop: lw $2, 100($3)

addi $3, $3, 4 beq $3, $4, Loop

We cannot put addi after the beq since it modifies register $3

We cannot just put lw after the beq since register $3 had changed

First we re-write the code asLoop: addi $3, $3, 4

lw $2, 96($3) beq $3, $4, Loop

Then we can move the lw after the beq

Loop: addi $3, $3, 4 beq $3, $4, Looplw $2, 96($3)

How can the following code be modified to make use of a delayed

Example



branch slot?:Loop: lw $2, 100($3)

addi $3, $3, 4 beq $3, $4, Loop

We cannot put addi after the beq since it modifies register $3

We cannot just put lw after the beq since register $3 had changed

First we re-write the code asLoop: addi $3, $3, 4

lw $2, 96($3) beq $3, $4, Loop

Then we can move the lw after the beq

Loop: addi $3, $3, 4 beq $3, $4, Looplw $2, 96($3)

Consider the pipelined datapath that does not accommodate branch

Example 2



Consider the pipelined datapath that does not accommodate branch

hazards. Can an attempt to flush and an attempt to stall occur

simultaneously? You may want to consider the following codesequence to help you answer this question: beq $1, $2, TARGET #assume the branch

is takenlw $3, 40($4)

add $3, $3, $3sw $3, 40($4)TARGET: or $10,$11, $12

If the beq resolution is in the MEM stage, and the branch is taken,

it requires a flush of the IF/ID pipeline register (means the registerneeds to be written to) and a change of the PC to the branch

address; this happens in clock cycle 4.

At the same time a hazard is detected between lw and the

Example 2 - continued



At the same time a hazard is detected between lw and thenext instruction (add ) which is dependent (due to $3

used as source register). Thus the hazard detection unitissues a stall, and requests that the PC and the IF/IDregisters not be written to. The answer is YES, a flushand a stall are issued simultaneously.

If there are any conflicting actions, which should take priority?

Flush should take priority

Is there a simple change you can make to the datapath to

ensure the necessary priority?

The hazard detection unit should be changed to see the

Example 2- continued



The hazard detection unit should be changed to see theRegWrite signal in the execution stage after it goes

through the MUX used to flush the pipelineRegWrite

Dynamic branch prediction The static branch “predicts” that it will not be taken and



The static branch predicts that it will not be taken and

then flush if it was taken works for simple pipelines, but

is wasteful for performance for aggressive pipeliningarchitecture (such as the multiple issue of Pentium IV).

One approach is to have a branch prediction buffer (a

small memory unit indexed by the lower portion of the

address in the branch instruction). It contains a bit that

says if the branch was recently taken or not.

The value of the prediction bit is inverted if the

prediction turned out to be wrong. When the branch is

almost always taken, this 1-bit predictor will predict

wrong twice (at the start and end of the run of branches).

Dynamic branch prediction A better approach is to use a two-bit scheme, which must be wrong twice to

h th di ti f di ti



change the direction of prediction. The branch prediction is stored in a special buffer which is accessed with the beq instruction in the IF stage. If the beq is predicted as taken, then fetching begins from the target once beq is in ID.

Not Taken

Taken

Not Taken

Taken

Not TakenTaken

Further optimization with a global predictor taking into consideration the global behavior of recently executed branches. Each branch has two predictors, andtournament predictor keeps track and favors the one that was more accurate.

Dynamic branch prediction with compiler optimization

Furthermore, compilers place instructions that always execute in the“dela spot”



Best choice For mostly taken branches“delay spot”

Pipeline Changes to accommodate Exceptions Overflow is discovered at the end of the execute stage when the

ALU d i l h l i



8000180

ALU sends a signal to the control unit.

Following notification of an overflow the control unit has to flush

the two instructions that followed the one causing the overflow.

These instructions are now in the IF and ID stages of the pipeline.

Thus we add an input to the MUX in the ID stage that 0s the

control signals using an ID.Flush signal

IF.Flush

ID.Flush

Overflow

Pipeline Changes to accommodate Exceptions

The instruction that causes the overflow (which is detected in the



The instruction that causes the overflow (which is detected in theEX stage) needs to be flushed from the pipeline. This means that

an EX.Flush signal needs to be sent to two multiplexers to zerothe control signals for the last two stages of the pipeline.

Overflow is only one of the many possible exception causes. Thecause is stored in a Cause register below:

4 – address error exception (load)

5- address error exception (store) 10 – unknown instructions or reserved instruction

12 – arithmetic overflow

15 – floating point exception

Pipeline Changes to accommodate Exceptions An additional input is added to the PC MUX that sends to the PC



8000 0180hex (system reserved memory address for overflow)

The address of the instruction following the offending command issaved in the Exception Program Counter (EPC) registerand the cause in the Cause Register.

If there are multiple exceptions, their causes are stored in the causeregister, such that hardware can interrupt based on later exceptions

once the earliest exception has been serviced. In case of an I/o interrupt, the execution jumps to the system routine

needed to deal with the I/o, followed by a return to the address storedin the EPC for program completion.

The OS responds to an exception either by terminating the processthat caused the exception or by performing some action.

The process who’s exception is due to an unimplemented instructionis killed by the OS.

Branch

Pipeline Changes to accommodate Overflow

EX.Flush



0

0

EX.Flush

80000180

Overflow

Branch

Pipeline Changes to accommodate Unknown Instruction

EX.Flush (LOW)



0

0

( )

80000180

Pipelined exception example: and completes



50 add causes an overflow

Overflow

54

80000180

Pipelined exception example or completesOS instruction fetched



80000180

80000180

80000184

80000184



Static Multiple Issue

U d i b dd d d VLIW



Used in embedded processors and VLIW processors

Can improve performance by up to 200%

Layout is restricted to simplify the decoding and instruction issue

Instructions are issued in pairs, aligned on a 64-bit boundary with

the ALU and branch portion operating first;

If one of the instruction of the pair cannot be used, it is replaced by ano-op.

The hardware detects data hazards and generates stalls between two

issue packets, but the compiler is required to avoid all dependencies

within the instruction pair.

A load will cause the next two instructions to stall if they were to

use the loaded word.

CC 7 CC 8



add …

lw

beq …

sw

sub …

lw …

Static two-issue datapath We need two output ports for Instruction memory, two more read

and one more write ports for the Register file two ALUs (one



and one more write ports for the Register file, two ALUs (onehandles address computation for Data memory access), and two

sign-extending units



AMD Opteron X4 12-stage pipeline

Speculative pipeline that executes

3 instructions/clock cycle



Speculative pipeline that executes 3 instructions/clock cycle

Register renaming removes anti-

dependencies. In case of incorrect

speculation, the mapping between

architectural and physical registers is

undone.

Memory address calculation

Actual memory access

Intel Core pipeline



Each core can execute 4

instructions

simultaneously

A Core duo can execute 8

instructions

simultaneouslyBetter branch prediction

Enhanced ALU

Less power consumption

Processor Pipelining

Documents

Transcript of Processor Pipelining