Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

64
Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz

Transcript of Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Page 1: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline1

Computer Structure

Pipeline

Lihu Rappoport and Adi Yoaz

Page 2: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline2

A Basic Processor

Memory Writeback

ExecuteDecodeFetch

IP

DataCache

Register File

Control signals

AL

UInst.

src1

Inst.Cache

address

Sign

Ext.

Mem Rd/Wr

ALU control

decode

src1data

src2data

src2

dst

datadst reg

ALU source

imm

src2

src1

Write-back value

imm

opcode

src1 reg

src2 reg

Page 3: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline3

Pipelined Car Assembly

chassis engine finish

1 hour 2 hours 1 hour

Car 1Car 2Car 3

Page 4: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline4

DataAccess

DataAccess

DataAccess

DataAccess

DataAccess

Pipelining Instructions

Ideal speedup is number of stages in the pipeline. Do we achieve this?

2 4 6 8 10 12 14 16 18

2 4 6 8 10 12 14

. ..

InstFetch

Reg ALU Reg

InstFetch

Reg ALU Reg

InstFetch

InstFetch

Reg ALU Reg

InstFetch

Reg ALU Reg

InstFetch

Reg ALU Reg

2 ns

2 ns

2 ns 2 ns 2 ns 2 ns 2 ns

8 ns

8 ns

8 ns

Time

Programexecutionorder

lw R1, 100(R0)

lw R2, 200(R0)

lw R3, 300(R0)

Time

Programexecutionorder

lw R1, 100(R0)

lw R2, 200(R0)

lw R3, 300(R0)

Page 5: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline5

Pipelining Pipelining does not reduce the latency of single task,

it increases the throughput of entire workload

Potential speedup = Number of pipe stages Pipeline rate is limited by the slowest pipeline stage

Þ Partition the pipe to many pipe stagesÞ Make the longest pipe stage to be as short as possible Þ Balance the work in the pipe stages

Pipeline adds overhead (e.g., latches) Time to “fill” pipeline and time to “drain” it reduces speedup Stall for dependencies

Þ Too many pipe-stages start to loose performance

IPC of an ideal pipelined machine is 1 Every clock one instruction finishes

Page 6: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline6

Pipelined CPU

IP

DataCache

Register File

Control

signals

AL

U

Decode Execute Memory

Inst

ruct

ion

src1

Fetch WB

Inst.Cache

Mem Rd

address

Sign

Ext.

Mem Wr

ALU Control

decode

src1data

src2data

ALU source

+4

src2

dst

data

dst

Page 7: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline7

Structural Hazard

Different instructions using the same resource at the same time Register File:

Accessed in 2 stages: Read during stage 2 (ID) Write during stage 5 (WB)

Solution: 2 read ports, 1 write port

Memory Accessed in 2 stages:

Instruction Fetch during stage 1 (IF) Data read/write during stage 4 (MEM)

Solution: separate instruction cache and data cache

Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all

instructions

IM Reg

IM Reg

IM Reg DM Reg

IM DM Reg

IM DM Reg

DM Reg

Reg

Reg

Reg

DM

Page 8: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline8

Pipeline Example: cycle 1

0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7

IP

DataCache

Register File

Control

signals

AL

U

Decode Execute Memory

Inst

ruct

ion

src1

Fetch WB

Inst.Cache

Mem Rd

address

Sign

Ext.

Mem Wr

ALU Control

decode

src1data

src2data

ALU source

+4

src2

dst

data

0

dst

Page 9: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline9

Pipeline Example: cycle 2

0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7

IP

DataCache

Register File

Control

signals

AL

U

Decode Execute Memory

Inst

ruct

ion

src1

Fetch WB

Inst.Cache

Mem Rd

address

Sign

Ext.

Mem Wr

ALU Control

decode

src1data

src2data

ALU source

+4

src2

dst

data

lw4

dst

Page 10: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline10

Pipeline Example: cycle 3

0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7

IP

DataCache

Register File

Control

signals

AL

U

Decode Execute Memory

Inst

ruct

ion

src1

Fetch WB

Inst.Cache

Mem Rd

address

Sign

Ext.

Mem Wr

ALU Control

decode

src1data

src2data

ALU source

+4

src2

dst

data

lw

8sub

[R1]

9

10dst

Page 11: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline11

Pipeline Example: cycle 4

0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7

IP

DataCache

Register File

Control

signals

AL

U

Decode Execute Memory

Inst

ruct

ion

src1

Fetch WB

Inst.Cache

Mem Rd

address

Sign

Ext.

Mem Wr

ALU Control

decode

src1data

src2data

ALU source

+4

src2

dst

data

lw

12and

[R2]

[R1]+9

sub

[R3]

1011dst

Page 12: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline12

Pipeline Example: cycle 5

0 lw R10,9(R1) 4 sub R11,R2,R3 8 and R12,R4,R512 or R13,R6,R7

IP

DataCache

Register File

Control

signals

AL

U

Decode Execute Memory

Inst

ruct

ion

src1

Fetch WB

Inst.Cache

Mem Rd

address

Sign

Ext.

Mem Wr

ALU Control

decode

src1data

src2data

ALU source

+4

src2

dst

data

lw

16or

[R4]

sub

[R5]

1112

and

10

[R2]-[R3] M[[R1]+9]

dst

Page 13: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline13

RAW Dependency

sub R2, R1, R3

and R12,R2, R5

or R13,R6, R2

add R14,R2, R2

sw R15,100(R2)

Programexecutionorder

DECFetch MEMEXE WB

DECFetch MEMEXE WB

DECFetch MEMEXE WB

DECFetch MEMEXE WB

DECFetch MEMEXE WB

Page 14: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline14

Using Bypass to Solve RAW Dependency

sub R2, R1, R3

and R12,R2, R5

or R13,R6, R2

add R14,R2, R2

sw R15,100(R2)

Programexecutionorder

DECFetch MEMEXE WB

DECFetch MEMEXE WB

DECFetch MEMEXE WB

DECFetch MEMEXE WB

DECFetch MEMEXE WB

Bypass result directly from EXE output to EXE input

Page 15: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline15

RAW Dependency

0 lw R4,9(R1) 4 sub R5,R2,R3 8 and R12,R4,R512 or R13,R6,R7

IP

DataCache

Register File

Control

signals

AL

U

Decode Execute Memory

Inst

ruct

ion

src1

Fetch WB

Inst.Cache

Mem Rd

address

Sign

Ext.

Mem Wr

ALU Control

decode

src1data

src2data

ALU source

+4

src2

dst

data

lw

16or

[R4]

sub

[R5]

R5

and

R4

[R2]-[R3] M[[R1]+9]

dst

Page 16: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline16

Forwarding Hardware

0 lw R4,9(R1) 4 sub R5,R2,R3 8 and R12,R4,R512 or R13,R6,R7

IP

DataCache

Register File

Control

signals

AL

U

Decode Execute Memory

Inst

ruct

ion

src1

Fetch WB

Inst.Cache

Mem Rd

address

Sign

Ext.

Mem Wr

ALU Control

decode

src1data

src2data

ALU source

+4

src2

dst

data

L1 L2 L3 L4

dst

Bypass Control

src2

dst t-

1

src1

dst t-

2

2

1

lw

16or

[R4]

sub

[R5]

R5

and

R4

[R2]-[R3] M[[R1]+9]

R4 R5 R5 R4

Page 17: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline17

Forwarding Control

Forwarding from EXE (L3) if (L3.RegWrite and (L3.dst == L2.src1)) ALUSelA = 1

if (L3.RegWrite and (L3.dst == L2.src2)) ALUSelB = 1

Forwarding from MEM (L4) if (L4.RegWrite and ((not L3.RegWrite) or (L3.dst L2.src1)) and (L4.dst = L2.src1)) ALUSelA = 2

if (L4.RegWrite and ((not L3.RegWrite) or (L3.dst L2.src2)) and (L4.dst = L2.src2)) ALUSelB = 2

Page 18: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline18

Register File Split

Register file is written during first half of the cycle Register file is read during second half of the cycleÞ Register file is written before it is read returns the correct

data

sub R2, R1, R3

xxx

xxx

and R12,R2,R11

IM Reg

IM Reg

IM DM Reg

IM DM Reg

DM Reg

Reg

DMReg

Reg

Page 19: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline19

Load word can still causes a hazard: an instruction tries to read a register following a load

instruction that writes to the same register

Þ A hazard detection unit is needed to “stall” the load instruction

Can't Always Forward

Reg

IM

Reg

Reg

IM

IM Reg DM Reg

IM DM Reg

IM DM Reg

DM Reg

Reg

Reg

DM

1 2 3 4 5 6Time (clock cycles)

7 8 9Programexecutionorder

lw R2, 30(R1)

and R12,R2, R5

or R13,R6, R2

add R14,R2, R2

sw R15,100(R2)

Page 20: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline20

De-assert the enable to the L1 latch, and to the IP The dependent instruction (and) stays another cycle in L1

Issue a NOP into the L2 latch (instead of the stalled inst.) Allow the stalling instruction (lw) to move on

Reg

IM

Reg

Reg

IM DM

IM Reg DM RegIM

IM DM Reg

IM DM Reg

DM Reg

RegReg

Reg

bubble

1 2 3 4 5 6Time (clock cycles)

7 8 9Programexecutionorder

lw R2, 30(R1)

and R12,R2, R5

or R13,R6, R2

add R14,R2, R2

sw R15,100(R2)

Stall If Cannot Forwardif (L2.RegWrite and (L2.opcode == lw) and

( (L2.dst == L1.src1) or (L2.dst == L1.src2) ) then stall

Page 21: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline21

Example: code for (assume all variables are in memory):a = b + c;

d = e – f;Slow code

LW Rb,bLW Rc,c

Stall ADD Ra,Rb,RcSW a,Ra LW Re,e LW Rf,f

Stall SUB Rd,Re,RfSW d,Rd

Instruction order can be changed as long as the correctness is kept

Software Scheduling to Avoid Load Hazards

Fast code

LW Rb,bLW Rc,cLW Re,e ADD Ra,Rb,RcLW Rf,fSW a,Ra SUB Rd,Re,RfSW d,Rd

Page 22: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline22

Control Hazards

Page 23: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline23

Control Hazard on BranchesDecode Execute MemoryFetch WB

0 or 4 jcc 48 (offset=40)

8 and12 mul16 sub

IP

DataCache

Register File

AL

UInst

ruct

ion

src1

Inst.Cache

address

Sign

Ext.

src1data

src2data

+4

+

src2

dst

data

jcc target; if cond then IP IP + 4 + offset

Page 24: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline24

Control Hazard on Branches

IP

DataCache

Register File

AL

UInst

ruct

ion

src1

Inst.Cache

address

Sign

Ext.

src1data

src2data

+4

+

src2

dst

data

Decode Execute MemoryFetch WB

8

8

jcc4

or

0 or 4 jcc 48 (offset=40)

8 and12 mul16 sub

jcc target; if cond then IP IP + 4 + offset

Page 25: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline25

Control Hazard on Branches

IP

DataCache

Register File

AL

UInst

ruct

ion

src1

Inst.Cache

address

Sign

Ext.

src1data

src2data

+4

+

src2

dst

data

Decode Execute MemoryFetch WB

40

812

12

jccand or

0 or 4 jcc 48 (offset=40)

8 and12 mul16 sub

jcc target; if cond then IP IP + 4 + offset

Page 26: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline26

Control Hazard on Branches

IP

DataCache

Register File

AL

UInst

ruct

ion

src1

Inst.Cache

address

Sign

Ext.

src1data

src2data

+4

+

src2

dst

data

Decode Execute MemoryFetch WB

cond

1216

16

jccandsw

48

or

0 or 4 jcc 48 (offset=40)

8 and12 mul16 sub

jcc target; if cond then IP IP + 4 + offset

Page 27: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline27

Control Hazard on Branches

IP

DataCache

Register File

AL

UInst

ruct

ion

src1

Inst.Cache

address

Sign

Ext.

src1data

src2data

+4

+

src2

dst

data

Decode Execute MemoryFetch WB

16

50

jccandsw

20

sub

0 or 4 jcc 48 (offset=40)

8 and12 mul16 sub

jcc target; if cond then IP IP + 4 + offset

Page 28: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline28

Control Hazard on Branches

And

Beq

sub

sw

The 3 instructions following the branch get into the pipe even if the branch is taken

Inst from target

IM Reg DM Reg

PC

IM Reg DM Reg

PC

IM Reg DM Reg

PC

IM Reg DM Reg

PC

IM Reg DM Reg

PC

Page 29: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline29

Control Hazard: Stall

Stall pipe when branch is encountered until resolved

Stall impact: assumptions CPI = 1 20% of instructions are branches Stall 3 cycles on every branch

CPI new = 1 + 0.2 × 3 = 1.6

(CPI new = CPI Ideal + avg. stall cycles / instr.)

We loose 60% of the performance

Page 30: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline30

Control Hazard: Predict Not Taken

Execute instructions from the fall-through (not-taken) path As if there is no branch If the branch is not-taken (~50%), no penalty is paid

If branch actually taken Flush the fall-through path instructions before they change

the machine state (memory / registers) Fetch the instructions from the correct (taken) path

Assuming ~50% branches not taken on average

CPI new = 1 + (0.2 × 0.5) × 3 = 1.3

Page 31: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline31

Dynamic Branch Prediction Add a Branch Target Buffer (BTB) that predicts (at

fetch) Instruction is a branch Branch taken / not-taken Taken branch target

BTB allocated at execute – after all branch info is known

BTB is looked up at instruction fetch

IP of inst in fetch?=

Branch predicted taken or not taken

Inst predictedto be branch

Branch IP Target IP History

Predicted Target

Page 32: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline32

BTB Allocation

Allocate instructions identified as branches (after decode) Both conditional and unconditional branches are allocated

Not taken branches need not be allocated BTB miss implicitly predicts not-taken

Prediction BTB lookup is done parallel to IC lookup BTB provides

Indication that the instruction is a branch (BTB hits) Branch predicted target Branch predicted direction Branch predicted type (e.g., conditional, unconditional)

Update (when branch outcome is known) Branch target Branch history (taken / not-taken)

Page 33: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline33

BTB (cont.)

Wrong prediction Predict not-taken, actual taken Predict taken, actual not-taken, or actual taken but wrong

target

In case of wrong prediction – flush the pipeline Reset latches (same as making all instructions to be NOPs) Select the PC source to be from the correct path

Need get the fall-through with the branch Start fetching instruction from correct path

Assuming P% correct prediction rate

CPI new = 1 + (0.2 × (1-P)) × 3 For example, if P=0.7

CPI new = 1 + (0.2 × 0.3) × 3 = 1.18

Page 34: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline34

Adding a BTB to the PipelineDecode Execute MemoryFetch WB

calc. target

IP

DataCache

Register File

AL

UInst

ruct

ion

src1

Inst.Cache

address

Sign

Ext.

src1data

src2data

+4

+

src2

dst

data

BTB

Next seq. address

Predicted target

Repair target

Flush and Repair

Predicted direction

target

direction

≠allocate

45050

Lookup current IP in I$ and in BTB in parallel

I$ provides the instruction bytes

jcc

BTB provides predicted target and direction

0 or 4 jcc 50 8 and ...50 sub54 mul58 add

or

takentaken

8

Page 35: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline35

Adding a BTB to the PipelineDecode Execute MemoryFetch WB

calc. target

IP

DataCache

Register File

AL

UInst

ruct

ion

src1

Inst.Cache

address

Sign

Ext.

src1data

src2data

+4

+

src2

dst

data

BTB

Next seq. address

Predicted target

Repair target

Flush and Repair

Predicted direction

target

direction

≠allocate

50

taken8

jcc

sub

5054

0 or 4 jcc 50 8 and ...50 sub54 mul58 add

or

42

Page 36: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline36

Adding a BTB to the PipelineDecode Execute MemoryFetch WB

calc. target

IP

DataCache

Register File

AL

UInst

ruct

ion

src1

Inst.Cache

address

Sign

Ext.

src1data

src2data

+4

+

src2

dst

data

BTB

Next seq. address

Predicted target

Repair target

Flush and Repair

Predicted direction

target

direction

≠allocate

orjcc

50

taken

0 or 4 jcc 50 8 and ...50 sub54 mul58 add

5458

sub

mul

850

taken

Verify direction

Verify target (if taken)

Issue flush in case of mismatchAlong with

the repair IP

Page 37: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline37

Using The BTBPC moves to next instruction

Inst Mem gets PCand fetches new inst

BTB gets PCand looks it up

IF/ID latch loadedwith new inst

BTB Hit ?

Br taken ?

PC PC + 4PC perd addr

IF

IDIF/ID latch loaded

with pred instIF/ID latch loaded

with seq. instBranch ?

yes no

noyes

noyesEXE

Page 38: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline38

Using The BTB (cont.)ID

EXE

MEM

WB

Branch ?

Calculate brcond & trgt

Flush pipe &update PC

Corect pred ?

yes no

IF/ID latch loadedwith correct inst

continue

Update BTB

yes no

continue

Page 39: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline39

Backup

Page 40: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline40

· R-type (register insts)

· I-type (Load, Store, Branch, inst’s w/imm data)· J-type (Jump)

op: operation of the instructionrs, rt, rd: the source and destination register specifiersshamt: shift amountfunct: selects the variant of the operation in the “op” fieldaddress / immediate: address offset or immediate valuetarget address: target address of the jump instruction

op target address

02631

6 bits 26 bits

op rs rt rd shamt funct

061116212631

6 bits 6 bits5 bits5 bits5 bits5 bits

op rs rt immediate016212631

6 bits 16 bits5 bits5 bits

MIPS Instruction Formats

Page 41: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline41

Each memory location is 8 bit = 1 byte wide has an address

We assume 32 byte address An address space of 232 bytes

Memory stores both instructions and data Each instruction is 32 bit wide

stored in 4 consecutive bytes in memory

Various data types have different width

The Memory Space1 byte

00000000000000010000000200000003000000040000000500000006

FFFFFFFAFFFFFFFBFFFFFFFCFFFFFFFDFFFFFFFEFFFFFFFF

Page 42: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline42

Register File

RegWrite

Readreg 1

Readreg 2

Writereg

Writedata

Readdata 1

Readdata 2

RegisterFile

32

32

32

5

5

5

The Register File holds 32 registers Each register is 32 bit wide The RF supports parallel

reading any two registers and writing any register

Inputs Read reg 1/2: #register whose value

will be output on Read data 1/2 RegWrite: write enable

Write reg (relevant when RegWrite=1) #register to which the value in Write data is written to

Write data (relevant when RegWrite=1) data written to Write reg

Outputs Read data 1/2: data read from Read reg 1/2

Page 43: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline43

Memory Components Inputs

Address: address of the memory location we wish to access

Read: read data from location Write: write data into location Write data (relevant when Write=1)

data to be written into specified location Outputs

Read data (relevant when Read=1) data read from specified location

Write

Address

WriteData

ReadDataMemory

Read

32

32

32

Cache Memory components are slow relative to the CPU

A cache is a fast memory which contains only small part of the memory Instruction cache stores parts of the memory space which hold code Data Cache stores parts of the memory space which hold data

Page 44: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline44

The Program Counter (PC)

Holds the address (in memory) of the next instruction to be executed

After each instruction, advanced to point to the next instruction If the current instruction is not a taken branch,

the next instruction resides right after the current instruction

PC PC + 4 If the current instruction is a taken branch,

the next instruction resides at the branch target

PC target (absolute jump)

PC PC + 4 + offset×4 (relative jump)

Page 45: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline45

Fetch Fetch instruction pointed by PC from I-Cache

Decode Decode instruction (generate control signals) Fetch operands from register file

Execute For a memory access: calculate effective address For an ALU operation: execute operation in ALU For a branch: calculate condition and target

Memory Access For load: read data from memory For store: write data into memory

Write Back Write result back to register file update program counter

Instruction Execution Stages

Instruction

Fetch

Instruction

Decode

Execute

Memory

Result

Store

Page 46: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline46

The MIPS CPU

Instruction Decode /

register fetch

Instruction fetch

Execute /address

calculation

Memoryaccess

Writeback

ALUSrcALU

ALU res

Zero

AddShift left 2

0

1

mux

ALUControl

ALUOp

Signextend

16 32

Instruction

4

Add

PC

InstructionCache

Address

Instruction

0

1

mux

MemRead

MemWrite

Address

WriteData

ReadData

DataCache

Branch

PCSrc

MemtoReg

1

0

mux

RegDst

RegWrite

Readreg 1

Readreg 2

Writereg

Writedata

Readdata 1

Readdata 2

Re

gis

ter

Fil

e

[20-16]

[25-21]

0

1

mux [15-11]

[15-0]

[5-0]

Co

ntr

ol

Page 47: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline47

Executing an Add Instruction

ALUSrc=0ALU

ALU res

Zero

AddShift left 2

0

1

mux

ALUControl

ALUOp

Signextend

16 32

Instruction

4

Add

PC

InstructionMemory

Address

Instruction

0

1

mux

MemRead=0

MemWrite=0

Address

WriteData

ReadData

DataMemory

Branch=0

PCSrc=0

MemtoReg=0

1

0

mux

RegDst=1

RegWrite=1

Readreg 1

Readreg 2

Writereg

Writedata

Readdata 1

Readdata 2

Re

gis

ter

Fil

e [20-16]

[25-21]

0

1

mux [15-11]

[15-0]

[5-0]

R3

R5

35

R3 + R5

+

[PC]+4

2

op rs rt rd shamt funct 061116212631

3 5 2 0 0 = AddALU

Add R2, R3, R5 ; R2 R3+R5

ADD

Page 48: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline48

Executing a Load Instruction

op rs rt immediate 016212631

2 1 30LW

LW R1, (30)R2 ; R1 Mem[R2+30]

ALUSrc=1ALU

ALU res

Zero

AddShift left 2

0

1

mux

ALUControl

ALUOp

Signextend

16 32

Instruction

4

Add

PC

InstructionMemory

Address

Instruction

0

1

mux

MemRead=1

MemWrite=0

Address

WriteData

ReadData

DataMemory

Branch=0

PCSrc=0

MemtoReg=11

0

mux

RegDst=0

RegWrite=1

Readreg 1

Readreg 2

Writereg

Writedata

Readdata 1

Readdata 2

Re

gis

ter

Fil

e [20-16]

[25-21]

0

1

mux [15-11]

[15-0]

[5-0]

R2

30

2

R2+30+

[PC]+4

1

Mem[R2+30]

LW

Page 49: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline49

Executing a Store Instruction

op rs rt immediate 016212631

2 1 30SW

SW R1, (30)R2 ; Mem[R2+30] R1

ALUSrc = 1ALU

ALU res

Zero

AddShift left 2

0

1

mux

ALUControl

ALUOp

Signextend

16 32

Instruction

4

Add

PC

InstructionMemory

Address

Instruction

0

1

mux

MemRead

MemWrite = 1

Address

WriteData

ReadData

DataMemory

Branch=0

PCSrc = 0

MemtoReg =1

0

mux

RegDst =

RegWrite = 0

Readreg 1

Readreg 2

Writereg

Writedata

Readdata 1

Readdata 2

Re

gis

ter

Fil

e [20-16]

[25-21]

0

1

mux [15-11]

[15-0]

[5-0]

R2

30

2

R2+30+

[PC]+4

1SW

R1

Page 50: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline50

Executing a BEQ InstructionBEQ R4, R5, 27 ; if (R4-R5=0) then PC PC+4+SignExt(27)*4 ; else PC PC+4

op rs rt immediate 016212631

4 5 27BEQ

ALUSrc=0ALU

ALU res

Zero

AddShift left 2

0

1

mux

ALUControl

ALUOp

Signextend

16 32

Instruction

4

Add

PC

InstructionMemory

Address

Instruction

0

1

mux

MemRead=0

MemWrite=0

Address

WriteData

ReadData

DataMemory

Branch=1

PCSrc= (R4 – R5 == 0)

MemtoReg= 1

0

mux

RegDst=

RegWrite=0

Readreg 1

Readreg 2

Writereg

Writedata

Readdata 1

Readdata 2

Re

gis

ter

Fil

e [20-16]

[25-21]

0

1

mux [15-11]

[15-0]

[5-0]

R4

R5

45

R4-R5-

PC+4

27

ADD

PC+4orPC+4+SignExt(27)*4

Page 51: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline51

Control Signals

add sub ori lw sw beq jump

RegDst

ALUSrc

MemtoReg

RegWrite

MemWrite

Branch

Jump

ALUctr<2:0>

1

0

0

1

0

0

0

Add

1

0

0

1

0

0

0

Subtract

0

1

0

1

0

0

0

Or

0

1

1

1

0

0

0

Add

x

1

x

0

1

0

0

Add

x

0

x

0

0

1

0

Subtract

x

x

x

0

0

x

1

xxx

func

op 00 0000 00 0000 00 1101 10 0011 10 1011 00 0100 00 0010

10 0000 10 0010 Don’t Care

Page 52: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline52

Pipelined CPU: Load (cycle 1 – Fetch)

ALUSrc

6

ALUresult

Zero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1

Readreg 2

Writereg

Writedata

Readdata 1

Readdata 2

Re

gis

ter

Fil

e

[15-0]

[20-16]

[15-11]

Signextend

16 32

L2 L3 L4

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction lw

op rs rt immediate 016212631

2 1 30LW

LW R1, (30)R2 ; R1 Mem[R2+30]

PC+4

Page 53: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline53

Pipelined CPU: Load (cycle 2 – Dec)

ALUSrc

6

ALUresult

Zero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1

Readreg 2

Writereg

Writedata

Readdata 1

Readdata 2

Re

gis

ter

Fil

e

[15-0]

[20-16]

[15-11]

Signextend

16 32

L2 L3 L4

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction

op rs rt immediate 016212631

2 1 30LW

LW R1, (30)R2 ; R1 Mem[R2+30]

PC+4

R2

30

Page 54: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline54

Pipelined CPU: Load (cycle 3 – Exe)

ALUSrc

6

ALUresult

Zero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1

Readreg 2

Writereg

Writedata

Readdata 1

Readdata 2

Re

gis

ter

Fil

e

[15-0]

[20-16]

[15-11]

Signextend

16 32

L2 L3 L4

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction

op rs rt immediate 016212631

2 1 30LW

LW R1, (30)R2 ; R1 Mem[R2+30]

R2+30

Page 55: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline55

Pipelined CPU: Load (cycle 4 – Mem)

ALUSrc

6

ALUresult

Zero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1

Readreg 2

Writereg

Writedata

Readdata 1

Readdata 2

Re

gis

ter

Fil

e

[15-0]

[20-16]

[15-11]

Signextend

16 32

L2 L3 L4

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction

op rs rt immediate 016212631

2 1 30LW

LW R1, (30)R2 ; R1 Mem[R2+30]

D

Page 56: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline56

Pipelined CPU: Load (cycle 5 – WB)

ALUSrc

6

ALUresult

Zero

Add result

AddShift left 2

ALUControl

ALUOp

RegDst

RegWrite

Readreg 1

Readreg 2

Writereg

Writedata

Readdata 1

Readdata 2

Re

gis

ter

Fil

e

[15-0]

[20-16]

[15-11]

Signextend

16 32

L2 L3 L4

Inst

ruct

ion

MemRead

MemWrite

Address

WriteData

ReadData

DataMemory

Branch

PCSrc

MemtoReg

4

InstructionMemory

Address

Add

IF/ID

PC

0

1

mux

0

1

mux

0

1

mux

0

1

mux

Instruction

op rs rt immediate 016212631

2 1 30LW

LW R1, (30)R2 ; R1 Mem[R2+30]

D

Page 57: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline57

PC

Instructionmemory

Inst

ruct

ion

Add

Instruction[20– 16]

Me

mto

Re

g

ALUOp

Branch

RegDst

ALUSrc

4

16 32Instruction[15– 0]

0

0

Mux

0

1

Add Addresult

RegistersWriteregister

Writedata

Readdata 1

Readdata 2

Readregister 1

Readregister 2

Signextend

Mux

1

ALUresult

Zero

Writedata

Readdata

Mux

1

ALUcontrol

Shiftleft 2

Re

gWrit

e

MemRead

Control

ALU

Instruction[15– 11]

6

EX

M

WB

M

WB

WBIF/ID

PCSrc

ID/EX

EX/MEM

MEM/WB

Mux

0

1

Me

mW

rite

AddressData

memory

Address

Datapath with Control

Page 58: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline58

Multi-Cycle Control

Pass control signals along just like the data

Execution/Address Calculation stage control lines

Memory access stage control lines

Write-back stage control

lines

InstructionReg Dst

ALU Op1

ALU Op0

ALU Src Branch

Mem Read

Mem Write

Reg write

Mem to Reg

R-format 1 1 0 0 0 0 0 1 0lw 0 0 0 1 0 1 0 1 1sw X 0 0 1 0 0 1 0 Xbeq X 0 1 0 1 0 0 0 X

Control

EX

M

WB

M

WB

WB

IF/ID ID/EX EX/MEM MEM/WB

Instruction

Page 59: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline59

Five Execution Steps Instruction Fetch

Use PC to get instruction and put it in the Instruction Register. Increment the PC by 4 and put the result back in the PC.

IR = Memory[PC];PC = PC + 4;

Instruction Decode and Register Fetch Read registers rs and rt Compute the branch address

A = Reg[IR[25-21]]; B = Reg[IR[20-16]]; ALUOut = PC + (sign-extend(IR[15-0]) << 2);

We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)

Page 60: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline60

Five Execution Steps (cont.)• Execution

ALU is performing one of three functions, based on instruction type: Memory Reference: effective address calculation.

ALUOut = A + sign-extend(IR[15-0]); R-type:

ALUOut = A op B; Branch:

if (A==B) PC = ALUOut;

• Memory Access or R-type instruction completion

• Write-back step

Page 61: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline61

The Store Instruction

sw rt, rs, imm16

mem[PC] Fetch the instruction from memory

Addr <- R[rs] + SignExt(imm16)

Calculate the memory addressMem[Addr] <- R[rt] Store the register into memoryPC <- PC + 4 Calculate the next instruction’s

address

op rs rt immediate

016212631

6 bits 16 bits5 bits5 bits

Mem[Rs + SignExt[imm16]] <- Rt Example: sw rt, rs, imm16

Page 62: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline62

RAW Hazard: SW Solution

IM Reg

CC 1 CC 2 CC 3 CC 4 CC 5 CC 6Time (clock cycles)

CC 7 CC 8 CC 910 10/–20Value of R2

DM Reg

IM DM RegReg

IM Reg

IM Reg DM Reg

IM DM Reg

Reg

Reg

DM

sub R2, R1, R3

NOP

NOP

NOP

and R12,R2, R5

or R13,R6, R2

add R14,R2, R2

sw R15,100(R2)

Programexecutionorder

10 10 10 -20 -20 -20 -20

IM Reg

IM Reg DM Reg

IM DM Reg

Reg

Reg

DM

• Have compiler avoid hazards by adding NOP instructions

• Problem: this really slows us down!

Page 63: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline63

Delayed Branch

Define branch to take place AFTER n following instruction HW executes n instructions following the branch

regardless of branch is taken or not SW puts in the n slots following the branch instructions

that need to be executed regardless of branch resolution Instructions that are before the branch instruction, or Instructions from the converged path after the branch

If cannot find independent instructions, put NOP

Original Coder3 = 23R4 = R3+R5If (r1==r2) goto xR1 = R4 + R5X: R7 = R1

New CodeIf (r1==r2) goto x r3 = 23R4 = R3 +R5NOPR1 = R4 + R5X: R7 = R1

Page 64: Computer Structure 2014 – Pipeline 1 Computer Structure Pipeline Lihu Rappoport and Adi Yoaz.

Computer Structure 2014 – Pipeline64

Delayed Branch Performance Filling 1 delay slot is easy, 2 is hard, 3 is harder Assuming we can effectively fill d% of the delayed slots

CPInew = 1 + 0.2 × (3 × (1-d))

For example, for d=0.5, we get CPInew = 1.3 Mixing architecture with micro-arch

New generations requires more delay slots Cause computability issues between generations