1 Pipelining (Chapter 8) TU-Delft TI1400/11-PDS iosup/Courses/2011_ti1400_10.ppt.

47
1 Pipelining (Chapter 8) TU-Delft TI1400/11-PDS http://www.pds.ewi.tudelft.nl/~iosup/Courses/2011_ti1400_10.p
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    220
  • download

    1

Transcript of 1 Pipelining (Chapter 8) TU-Delft TI1400/11-PDS iosup/Courses/2011_ti1400_10.ppt.

1

Pipelining(Chapter 8)

TU-DelftTI1400/11-PDS

http://www.pds.ewi.tudelft.nl/~iosup/Courses/2011_ti1400_10.ppt

TU-DelftTI1400/11-PDS

2

Basic idea (1)

F1 E1 F2 F3 F4E2 E3 E4

I1 I2 I3 I4

sequential execution time

B1

Instructionfetchunit

Executionunit

buffer

TU-DelftTI1400/11-PDS

3

Basic idea (2): Overlap

F1 E1

F2

F3

F4

E2

E3

E4

I1

I2

I3

I4

pipelined execution

time

1 2 3 4 5 Clock cycle

TU-DelftTI1400/11-PDS

4

Instruction phases

• F Fetch instruction

• D Decode instruction and fetch operands

• O Perform operation

• W Write result

TU-DelftTI1400/11-PDS

5

Four-stage pipeline

F1 D1

F2

F3

F4

D2

D3

D4

I1

I2

I3

I4

pipelined execution

time

1 2 3 4 5 Clock cycle

O1 W1

O2 W2

O3 W3

O4 W4

TU-DelftTI1400/11-PDS

6

Hardware organization (1)

Fetchunit

B1

Decodeand

fetchoper.

B2

Operunit

B3

Writeunit

TU-DelftTI1400/11-PDS

7

Hardware organization (2)

During cycle 4, the buffers contain:• B1:

- instruction I3• B2:

- the source operands of I2- the specification of the operation- the specification of the destination operand

• B3:- the result of the operation of I1- the specification of the destination operand

TU-DelftTI1400/11-PDS

8

Hardware organization (3)

Fetchunit

B1

Decodeand

fetchoper.

B2

Operunit

B3

Writeunit

I3 Operands I2Operation I2

Result I1

TU-DelftTI1400/11-PDS

9

Pipeline stall (1)

• Pipeline stall: delay in a stage of the pipeline due to an instruction

• Reasons for pipeline stall:- Cache miss

- Long operation (for example, division)

- Dependency between successive instructions

- Branching

TU-DelftTI1400/11-PDS

10

Pipeline stall (2): Cache miss

F1 D1

F2

F3

D2

D3

I1

I2

I3

time

1 2 3 4 5 Clock cycle

O1 W1

O2 W2

O3 W3

6 7 8

Cache miss in I2

TU-DelftTI1400/11-PDS

11

Pipeline stall (3): Cache miss

F1 F2

D2

F

D

O

1 2 3 4 5Clock cycle

F2 F2

D3

6 7 8

W

D1

F2 F3

idle idle idle

O1 O2 O3idle idle idle

W1 W2 W3idle idle idle

Effect of cache miss in F2

TU-DelftTI1400/11-PDS

12

Pipeline stall (4): Long operation

F2 D2I2 O2 W2

F3 D3I3 O3 W3

F4 D4I4 O4 W4

time

F1 D1I1

1 2 3 4 5 Clock cycle

O1 W1

6 7 8

TU-DelftTI1400/11-PDS

13

Pipeline stall (5): Dependencies

• Instructions:ADD R1, 3(R1)ADD R4, 4(R1)

cannot be done in parallel• Instructions:

ADD R2, 3(R1)ADD R4, 4(R3)

can be done in parallel

TU-DelftTI1400/11-PDS

14

Pipeline stall (6): Branch

time

Ii

Ik

Fi Ei

Fk Ek

(branch)

Pipeline stall due to Branch

only start fetching instructions after branch has beenexecuted

TU-DelftTI1400/11-PDS

15

Data dependency (1): example

MUL R2,R3,R4 /* R4 destination */

ADD R5,R4,R6 /* R6 destination */

New value of R4 must be available before ADD instruction uses it

TU-DelftTI1400/11-PDS

16

Data dependency (2): example

timeI1 F1 D1 O1 W1

F2 D2 O2 W2I2

W3F3 D3 O3I3

I4 F4 D4 O4 W4

MUL

ADD

Pipeline stall due to data dependence between W1 and D2

TU-DelftTI1400/11-PDS

17

Branching: Instruction queue

Fetch

Dispatch Operation Write

instruction queue

........

TU-DelftTI1400/11-PDS

18

Idling at branch

time

Ij

Ij+1

Fj Ej

Fj+1

(branch)

Ik Fk Ek

idle

Ik+1 Fk+1 Ek+1

TU-DelftTI1400/11-PDS

19

Branch with instruction queue

I1 F1 E1

I3 F3 E3

I2 F2 E2

I4 F4

Ij Fj Ej

Ij+1 Fj+1 Ej+1

Ij+2 Fj+2 Ej+2

Ij+3 Fj+3 Ej+3

time

branch

Branch folding:execute a later branch instruction simultaneously(i.e., compute target)

I4 discarded

TU-DelftTI1400/11-PDS

20

Delayed branch (1): reordering

LOOP Shift_left R1Decrement R2Branch_if>0 LOOP

NEXT Add R1,R3

LOOP Decrement R2Branch_if>0 LOOPShift_left R1

NEXT Add R1,R3

Original

Reordered alwaysexecuted

alwaysloose acycle

TU-DelftTI1400/11-PDS

21

Delayed branch (2): execution timing

F E

F E

F E

F E

F E

F E

F E

Decrement

Branch

Shift

Decrement

Branch

Shift

Add

TU-DelftTI1400/11-PDS

22

Branch prediction (1)

I1 F1 D1 E1 W1

F2

F3

F4

E2

D3 E3 X

D4 X

Fk Dk

Compare

Branch-if>I2

I3

I4

Effect of incorrect branch prediction

Ik

TU-DelftTI1400/11-PDS

23

Branch prediction (2)

Possible implementation:- use a single bit- bit records previous choice of branch- bit tells from which location to fetch next

instructions

TU-DelftTI1400/11-PDS

24

Data paths of CPU (1)Source 1Source 2

SRC1 SRC2

ALU

RSLT

Registerfile

Destination

Operand forwarding

TU-DelftTI1400/11-PDS

25

Data paths of CPU (2)

Operation Write

SRC1SRC2 RSLT

forwarding data path

register fileALU

TU-DelftTI1400/11-PDS

26

Pipelined operation

I1 F R1 + R3

F

Add

ShiftI2

I3

I4

R2

shift R3R3

F D O W

F D O W

I1: Add R1, R2, R3I2: Shift_left R3

result of Add has tobe available

TU-DelftTI1400/11-PDS

27

Short pipeline

I1 F R1 + R3R2

F D fwd,shift R3 -

F D O W

I2

I3

TU-DelftTI1400/11-PDS

28

Long pipeline

F D O WI1 1 O2 O3

FI2

I3

D O1 O2 O3 Wfwd

F D O1 O2 O3 W

TU-DelftTI1400/11-PDS

29

Compiler solution

I1: Add R1, R2, R3I2: Shift_left R3

I1: Add R1, R2, R3NOPNOP

I2: Shift_left R3

insert no-operations towait for result

TU-DelftTI1400/11-PDS

30

Side effects

I2: ADD D1, D2

I3: ADDX D3, D4carry copy

Other form of (implicit) data dependency:instructions can have side effects that are usedby the next instruction

TU-DelftTI1400/11-PDS

31

Complex addressing mode

F D X+[R1] [X+[R1]][[X+[R1]]] R2 D

F DD Dfwd,O

Load

Next instruct. DW

Load (X(R1)), R2

Cause pipe line stall

X in instruction

TU-DelftTI1400/11-PDS

32

Simple addressing modes

F D X+[R1]

[X+[R1]]

[[X+[R1]]]

R2 DAdd

F DD

F DD

R2

R2

F DD Dfwd,O W

Load

Load

Next instruction

Add #X,R1,R2Load (R2),R2Load (R2),R2

Build up from simple instructions: same amount of time

TU-DelftTI1400/11-PDS

33

Addressing modes

• Requirements addressing modes with pipelining:- operand access not more than one memory

access- only load and store instructions access memory- addressing modes do not have side effects

• Possible addressing modes:- register- register indirect- index

TU-DelftTI1400/11-PDS

34

Condition codes (1)

• Problems in RISC with condition codes (CCs):- do instructions after reordering have access

to the right CC values?- are CCs already available at the next

instruction?

• Solutions:- compiler detection- no automatic use of CCs, only when explicitly

given in instruction

TU-DelftTI1400/11-PDS

35

Explicit specification of CCs

Increment R5Add R2, R4Add-with-increment R1, R3

ADDI R5, R5, 1ADDC R4, R2, R4ADDE R3, R1, R3

double precisionaddition

PowerPC instructions (C: change carry flag, E: use carry flag)

TU-DelftTI1400/11-PDS

36

Two execution units

Fetch

DispatchUnit

FP Unit

Write

instruction queue

IntegerUnit

........

TU-DelftTI1400/11-PDS

37

Instruction flow (superscalar)

F1 D1 O1 W1I1 O1 O1

F2 D2 O2 W2

F3 D3 O3 O3 O3

W4F4 D4 O4

W3

Fadd

I2 Add

I3 Fsub

I4 Sub

Simultaneous execution of floating pointand integer operations

TU-DelftTI1400/11-PDS

38

Completion in program order

D1 O1 W1I1 O1 O1

F2 D2 O2 W2

F3 D3 O3 O3 O3

W4F4 D4 O4

W3

Fadd

I2 Add

I3 Fsub

I4 Sub

F1

wait until previous instruction has completed

TU-DelftTI1400/11-PDS

39

Consequences completion order

When an exception occurs:• writes not necessarily in order of

instructions: imprecise exceptions• writes in order: precise exceptions

TU-DelftTI1400/11-PDS

40

PowerPC pipeline

Data cache Instr. cache

Instr. fetch Branch unit

Dispatcher

Instructionqueue

Completionqueue

LSUIU

FPU

store queue

TU-DelftTI1400/11-PDS

41

Performance Effects (1)

• Execution time of a program: T

• Dynamic instruction count: N

• Number of cycles per instruction: S

• Clock rate: R

• Without pipelining: T = (N x S) / R

• With an n-stage pipeline: T’ = T / n ???

TU-DelftTI1400/11-PDS

42

Performance Effects (2)

• Cycle time: 2 ns (R is 500 MHz)• Cache hit (miss) ratio instructions: 0.95

(0.05)• Cache hit (miss) ratio data: 0.90 (0.10)• Fraction of instructions that need data

from memory: 0.30• Cache miss penalty: 17 cycles • Average extra delay per instruction:

(0.05 + 0.3 x 0.1) x 17 = 1.36 cycles, so slow down by a factor of more than

2!!

TU-DelftTI1400/11-PDS

43

Performance Effects (3)

• On average, the fetch stage takes, due to instruction cache misses:

1 + (0.05 x 17) = 1.85 cycles

• On average, the decode stage takes, due to operand cache misses:

1 + (0.3 x 0.1 x 17) = 1.51 cycles

• For a total additional cost of 1.36 cycles

TU-DelftTI1400/11-PDS

44

Performance Effects (4)

• If only one stage takes longer, the additional time should be counted relative to one stage, not relative to the complete instruction:

• In other words: here, the pipeline is as slow as the slowest stage

F1 D1 O1 W1

F1 D1 O1 W1

TU-DelftTI1400/11-PDS

45

Performance Effects (5)• Delay of 1 cycle every 4 instructions in only

one stage: average penalty: 0.25

• Average inter-completion time:

(3x1 + 1x2)/4=1.25

F4 D4 O4 W4

F1 D1 O1 W1

F3 D3 O3 W3

F2 D2 O2 W2

F5 D5 O5 W5

TU-DelftTI1400/11-PDS

46

Performance Effects (6)

• Delays in two stages:- k % of the instructions in one stage, penalty s

cycles- l % of the instructions in another stage, penalty t cycles

• Average inter-completion time:((100-k-l) x 1 + k(1+s) + l(1+t))/100 =

(100+ ks +lt)/100

• In example (k=5, l=3, s=t=17): 2.36

TU-DelftTI1400/11-PDS

47

Performance Effects (7)

• Large number of pipeline stages seems advantageous, but:

- more instructions simultaneously being processed, so more opportunity for conflicts

- branch penalty becomes larger

- ALU is usually bottleneck, no use having smaller time steps