3 Pipelining

Pipelining: Basic and Intermediate Concepts

1

Outline

• What is pipelining?• The basic pipeline for a RISC instruction

set• The major hurdle of pipelining – pipeline

hazards– Structural hazards– Data hazards– Control hazards

2

What Is Pipelining?

3

Pipelining: It’s Natural! Laundry Example A, B, C and D each have one load of clothes to wash, dry, and fold

Washer takes 30 minutes

Dryer takes 40 minutes

Folder takes 20 minutes

4

A B C D

Sequential Laundry

5

6 PM 7 8 9 10 11 Midnight

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20Task

Order

Time

• Sequential laundry takes 6 hours for 4 loads• If they learned pipelining, how long would laundry take?

Pipelined Laundry: Start work ASAP

6

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20

• Pipelined laundry takes 3.5 hours for 4 loads

Pipelining Lessons• Pipelining does not help

latency of single task, it helps throughput of entire workload

• Pipeline rate limited by slowest pipeline stage

• Multiple tasks operating simultaneously

• Potential speedup = Number of pipe stages

– Unbalanced lengths of pipe stages reduces speedup

– Time to fill pipeline and time to drain it reduces speedup

7

A

B

C

D

6 PM 7 8 9

Task

Order

Time

30 40 40 40 40 20

What is Pipelining?• Pipelining is an implementation technique whereby multiple

instructions are overlapped in execution. • In a computer pipeline, each step in the ―pipe-line‖ completes a

part of an instruction. Each of these steps is called a pipe stage or a pipe segment.

• The stages are connected one to the next to form a pipe-instructions enter at one end, progress through the stages, and exit at the other end, just as cars would in an assembly line.

• The throughput of an instruction pipeline is determined by how often an instruction exits the pipeline.

• The time required between moving an instruction one step down the pipeline is a processor cycle.

8

What is Pipelining ? (Cont.)• Because all stages proceed at the same time, the length

of a processor cycle is determined by the time required for the slowest pipe stage.

• In a computer, this processor cycle is usually 1 clock cycle (sometimes it is 2, rarely more).

• If the stages are perfectly balanced, then the time per instruction on the pipelined processor—assuming ideal conditions—is equal to

9

StagesPipeofNumberMachinedUnpipelineonnInstructioPerTime

________

MIPS Architecture• RISC, load-store architecture, simple address• 32-bit instructions, fixed format• 32 64-bit GPRs, R0-R31.

– Really, only 31 – R0 is just a constant 0.• 32 64-bit FPRs, F0-F31

– Can hold 32-bit floats also (with other ½ unused).– “SIMD” extensions operate on more floats in 1 FPR

• A few special registers– Floating-point status register

• Load/store 8-, 16-, 32-, 64-bit integers– All sign-extended to fill 64-bit GPR– Also 32- bit floats/doubles

10

MIPS Addressing Modes

• Register (arith./logical ops only)• Immediate (arith./logical only) & Displacement

(load/stores only)– 16-bit immediate / offset field– Register indirect: use 0 as displacement offset– Direct (absolute): use R0 as displacement base

• Byte-addressed memory, 64-bit address• Software-settable big-endian/little-endian flag• Alignment required

11

Start with Unpipelined RISC (MIPS)• Every instruction can be executed in 5 steps

– Every instructions takes at most 5 clock cycles• Each step outputs just passed to next step (no latches)

12

Implementation of RISC Instruction Set

• Implementing the instruction set requires the introduction of several temporary registers that are not part of the architecture.

• Every instruction takes at most 5 clock cycles:

1. IF - instruction fetch2. ID - instruction decode and register fetch3. EX - execution/effective address 4. MEM - memory access/ branch completion5. WB - write back

13

The Basic Pipeline for MIPS

14

Simple MIPS Pipeline Stages now get executed 1 per cycle

› Ideal result is the CPI reduced from 5 to 1› Is it really this simple? Of course not but it’s a start

Different operations use the same resource on the same cycle? Structure Hazard!! Separate instruction and data memories (IM, DM) Register files: read in ID and write in WB (distinct use)

› Write PC in IF and write either the incremented PC or the value of the branch target of an earlier branch (branch handling problem)

Registers are needed between two adjacent stages for storing intermediate results › Otherwise, they will be overwritten by next instruction)

15

Best Case Pipeline Scenario

16

Fill DrainStable(5 times throughput)

17

A pipeline can be thought of as a series of data paths (resources) shifted in time

Read Write

Perform register write/read in the first/second half of CC

A pipeline showing the pipeline registers between successive pipeline stages

18

Important Pipeline Characteristics

Latency› Time it takes for an instruction to go through

the pipe› Latency = # stages x stage-delay› Dominant feature if there are lots of exceptions

Throughput› Determined by the rate at which instructions

can start/finish› Dominant feature if no exceptions

19

Basic Performance Issues

Pipelining improve CPU instruction throughput› Does not reduce the execution time of an individual

instruction› Slightly increase the execution time of an individual

instruction Overhead in the control of the pipeline

Pipeline register delay + clock skew (Appendix A-10) Limit the practical depth of a pipeline

› A program runs faster and has lower total execution time, even though no single instruction runs faster

20

Pipeline Hazards

• Pipeline hazards prevent the next instruction in the instruction stream from execution during its designated clock cycles

• Hazards reduce the pipeline performance from the ideal speedup

21

Pipeline Hazards Structural hazards

› Caused by resource conflict› Possible to avoid by adding resources – but may be too costly

Data hazards› Instruction depends on the results of a previous instruction in

a way that is exposed by the overlapping of instructions in the pipeline

› Can be mitigated somewhat by a smart compiler Control hazards

› When the PC does not get just incremented› Branches and jumps - not too bad

22

Hazards cause Stalls – Two Policy Choices

• How about just stalling all stages– OK but problem is usually adjacent stage conflicts– Hence nothing moves and stall condition never clears– Cheap option but it does not work

• Stall later let earlier progress– Instructions issued later than the stalled instructions are also

stalled– Instructions issued earlier than the stalled instructions must

continue

23

Structural Hazards• If some combination of instructions cannot be accommodated

because of resource conflicts, the machine is said to have a structural hazard.

– Some functional unit is not fully pipelined– Some resource has not been duplicated enough to allow all

combinations of instructions in the pipeline to execute• Single port register file - conflict with multiple stage needs• Memory fetch - may need one in both IF and MEM stages

• Pipeline stalls instructions until the required unit is available

– A stall is commonly called a pipeline bubble or just bubble

24

Structural Hazard Example

25

Remove Structural Hazard

26

(Only load/store/branch use stage 4)

No real hazard if inst1 is not a load or store

Pipeline Stalled for a Structural Hazard (Another View)

Clock Cycle Number

Inst. 1 2 3 4 5 6 7 8 9 10

i (Load) IF ID EX MEM WB

i+1 IF ID EX MEM WB

i+2 IF ID EX MEM WB

i+3 STALL IF ID EX MEM WB

i+4 IF ID EX MEM WB

i+5 IF ID EX MEM

i+6 IF ID EX

27

Why Would a Designer Allow Structural Hazard?

• A machine without structural hazards will always have a lower CPI (if all other factors are equal)

• Why would a Designer Allow Structural Hazard?– Reduce cost

• Pipeline or duplicate all the functional units may be too costly

28

Data Hazards

29

Introduction• Data hazards occur when the pipeline changes the order of read/write

accesses to operands so that the order differs from the order seen sequentially executing instructions on an unpipelined machine.

– Example: later instructions use a result not having been produced by an earlier instruction

• Example– ADD R1, R2, R3– SUB R4, R1, R5– AND R6, R1, R7– OR R8, R1, R9– XOR R10, R1, R11

30

R1 R2 + R3

R1 gets produced in the first instruction,and used in every subsequent instruction

The use of the result of ADD in the next three instructions causes a hazard, since the register is not written until after those instructions

read it…

31read/write

Forwarding -- also called bypassing, shorting, short-circuiting

• Key is to keep the ALU result around• Example

– ADD R1,R2,R3– SUB R4, R1,R5

• How do we handle this in general?– Forwarded value can be at ALU output or

Mem stage output

32

ADD produces R1 value at ALU outputSUB needs it again at the ALU input

Forwarding (Cont.)• Use the previous example

– Forward the result from where ADD produces (EX/MEM register) to where SUB needs it (ALU input latch)

– Forwarding works as follows:• ALU result from EX/MEM register is fed back to ALU input latch • If the forwarding hardware detects the previous ALU operation

has written the register corresponding to a source for the current ALU operation, control logic selects the forwarding result as the ALU input rather than the value read from the register file

• Generalization of forward– Pass a result directly to the functional unit that requires it: a

result is forwarded from the pipeline register corresponding to the output of one unit to the input of another

33

Result With Forwarding

34

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU DM Reg

IM Reg

ALU

IM Reg

ADD R1, R2, R3

SUB R4, R1, R5

AND R6, R1, R7

OR R8, R1, R9

XOR R10, R1, R11

Another Forwarding Example

• Example– ADD R1, R2, R3– LW R4, 0(R1)– SW 12(R1), R4

• Forwarding Result – Next Slide

35

36

AR2BR3 AO=A+B

(Prod. R1)Do Nothing

R1AO

AR1BR4Imm0

AO=A+Imm(Use R1)

LMD=Mem[AO](Prod. R4)

R4LMD

AR1BR4Imm12

AO=A+Imm(Use R1)

Mem[AO]B(Use R4)

When Forwarding Fails

37

DM: LMDMEM[ALUO]RD: R1LMD

RS:AR1, BR5ALU: ALUOA-B

RS:AR1, BR7ALU: ALUOA ANDB

RS:AR1, BR5ALU: ALUOA OR B

Stalls

• Some latencies can’t be absorbed -- the case in the previous slide– Stalls are the result– Need pipeline interlock circuits

• Detects a hazard and introduces bubbles until the hazard clears

– CPI for stalled instructions will bloat by the number of bubbles• Bubbles cause the forwarding paths to change

• In MIPS, if the instruction after load uses the load result, one clock-cycle stall will occur!

38

Bubbles and new Forwarding Paths

39

Handling Stalls• Hardware: Pipeline Interlocks

– Must detect when required data cannot be provided– Stall stages to create bubble

• Software: pipeline or instruction scheduling– Performed by a smart compiler

40

Hardware vs. Software

LW RB, BLW RC, CADD RA, RB, RCSW A, RALW RE, ELW RF, FSUB RD, RE, RFSW D, RD

LW RB, BLW RC, CLW RE, EADD RA, RB, RCLW RF, F SW A, RA SUB RD, RE, RFSW D, RD

A = B + C; D = E –FPipeline Scheduling

Data Hazard Forms

• RAW - read after write– j reads before i writes - hence j gets wrong old value– Most common form of data hazard problem– As we have seen forwarding can overcome this one

• WAW - write after write– instructions i then j– j writes before i writes - leaving incorrect value– can this happen in MIPS? Why?

• WAW can happen only in pipelines that write in more than one pipe stage (or allow an instruction to proceed even when a previous instruction is stalled)

41

i occurs before j: program execution order

Data Hazard Forms (Cont.)

• WAR - write after read– i then j is intended order– j writes before i reads - i ends up with

incorrect new value– Is this a Problem in the MIPS? Why?

• May happen only when some instructions write results early in pipe stages, and others read a source late in stages

• RAR – read after read– Not a hazard

42

Control Hazards

43

Introduction• Control Hazards – How does branch influence the

pipeline?• Problem is more complex - need 2 things

– Branch target (taken means not PC+4, not taken the condition fails) (MEM)

– CC valid - in the MIPS case the result of the Zero detect unit (EX)

– Both happen late in the pipe• How to deal with branch?

– Stall the pipeline as soon as we detect the branch (ID), and stall the pipeline until we reach the MEM stage• Three-cycle stall

– The first IF is essentially a stall (when taken branch)– Consider a 30% branch frequency and an ideal CPI of 1…

44

A branch causes a 3-cycle stall in the MIPS pipeline

Branch instruction IF ID EX MEM WB

Branch successor(PC+4 or BTA, depends on CC)

IF Stall Stall IF ID EX MEM WB

Branch successor + 1 IF ID EX MEM WB

Branch successor + 2 IF ID EX MEM

Branch successor + 3 IF ID EX

Branch successor + 4 IF ID

Branch successor + 5 IF

45

Control Hazard Avoidance• Simplest Scheme

– Freeze pipe until you know the CC and branch target– Cheap but slow– Too slow since we’d negate half of the pipeline speedup since 2 or 3

bubbles• Predict not taken (47% MIPS branches not taken on average)

– Make sure to defer state change (destructive phase) is delayed until you know whether you guessed right

– If not then back out or flush• Predict taken (53% MIPS branches taken on average)

– No use in MIPS (target address and branch outcome are known at the same stage)

• Or let the compiler decide - same options

46

Predict-Not-Taken

47

A Stall indeed

What Makes Pipelining Hard to Implement?

• Hazards prevent next instruction from executing during its designated clock cycle.

• Exceptions and interrupts add complexity to the pipelining unit and decrease its efficiency:

• Used to describe exceptional situations where the normal execution order of instruction is changed in unexpected ways.

• The terms interrupt, fault, and exception can be used to describe exceptional situations.

• The occurrence of an event is usually signaled by an interrupt from either the hardware or the software. Hardware may trigger an interrupt at any time by sending a signal to the CPU. Software may trigger an interrupt by executing a special operation called a system call (exception or trap).

48

What Makes Pipelining Hard to Implement?

• Examples: I/O device request, Invoking an operating system service from a user program, Integer arithmetic overflow, Power failure, page fault, divide error.

• Other instructions in the pipeline can raise exceptions that may force the CPU to abort the instructions in the pipeline before they complete.

• Pipeline must be safely shut down and the state saved so instruction can be restarted in the correct state after the exception is served.

• When an exception occurs, the pipeline control can take the following steps to save the pipeline state safely:

• Force a trap instruction into the pipeline on the next IF.

49

Stopping and Restarting Execution

• Until the trap is taken, turn off all writes (WB) for the faulting instruction and for all instructions that follow in the pipeline; this can be done by placing zeros into the pipeline latches of all instructions in the pipeline, starting with the instruction that generates the exception, but not those that precede that instruction.

• After the exception-handling routine in the OS receives control, it immediately saves the PC of the faulting instruction (and other PCs). This value will be used to return from the exception later.

• After the exception has been handled, special instructions return the processor from the exception by reloading the PCs and restarting the instruction stream.

• If the pipeline can be stopped so that the instructions just before the faulting instruction are completed and those after it can be restarted from scratch, the pipeline is said to have precise exceptions.

50

3 Pipelining

Engineering

Transcript of 3 Pipelining