ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining & Amdahl’s Law

ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining & Amdahl’s Law

Pipelining is a method of processing in which a problem is divided into a number of sub-problems and solved and the solutions of the sub-problems for different instances of the problem are then overlapped.

Example: a[i] = b[i] + c[i] + d[i] + e[i] + f[i], i = 1, 2, 3,…,n

+ + + +

D D D

D D

D

b[1]

c[1]

d[1]

e[1]

f[1]

c[2]

d[2]

e[2]

f[2]

a[1]

a[2]

Adders have delay D to compute.

Computation time = 4D + (n-1)D = nD +3D

Speed-up = 4nD/{3D + nD} -> 4 for large n.

We can describe the computation process in a linear pipeline algorithmically. There are three distinct phases to this computation: (a) filling the pipeline, (b) running the pipeline in the filled state until the last input arrives, and (c) emptying the pipeline.

(linear pipeline)while(1){resetLatches(); clock = 0; //fill the pipeline for(j = 0; j <= n-1; j++) {for(k = 0; k <= j; k++) segment(k); clock++; } //execute all segments until the last input arrives while (clock <= m) {for(j = 0; j <= n-1; j++) segment(j); clock++; } //empty the pipeline for(j = 0; j <= n-1; j++) {for(k = j; k <= n-1; k++) segment(k); clock++; }}

Instruction pipelines:

fetch decode execute

0 I1

1 I2 I1

2 I3 I2 I1

3 I4 I3 I2

4 I4 I3

Goal:(i) to increase the throughput (number of instructions/sec) in executing programs(ii) to reduce the execution time (clock cycles/instruction, etc.

clock

fetch decode execute

0 I1

1 I2 I1

2 I3 I2 I1

3 I4 I3 I2 I1

4 I5 I4 I3 I2 I1

memory write backclock

Speed-up of pipelined execution of instructions over a sequential execution:

€

S(5) = CPIuNu f5

CPIpN p f1

Assuming that the systems operate at the same clock rate and use the same number of operations:

€

S(5) = CPIu

CPIp1

ExampleSuppose that the instruction mix of programs executed on a serial and pipelined machines is 40% ALU, 20% branching, and 40% memory with 4, 2, and 4 cycles per each instruction in the three classes respectively.

Then, under ideal conditions (no stalls due to hazards)

€

S(5) = CPIu

CPIp1

= 4 × 0.4 + 2 × 0.2 + 4 × 0.41

= 3.3

If, the clock speed needs to be increased for the pipeline implementation then the speed-up will have to be scaled down accordingly.

MIPS Pipeline

IF ID EX WB

Register operations

IF ID EX ME

Register/Memory operations

WB

Instruction Pipelines (Hennessy & Patterson)

Hazards

1-Structural Hazards2-Data Hazards3-Control Hazards

Structural Hazards: They arise when limited resources are scheduled to operate on different streams during the same clock period.

Structural Hazards: They arise when limited resources are scheduled to operate concurrently on different streams during the same clock period.

Example: Memory conflict (data fetch + instruction fetch) or datapath conflict (arithmetic operation + PC update)

Clock IF ID EX ME WB

0 I1

1 I2 I1

2 I3 I2 I1

3 I4 I3 I2 I1

4 I5 I4 I3 I2 I1

5 I6 I5 I4 I3 I2

6 I7 I6 I5 I4 I3

Fix: Duplicate hardware (too expensive) Stall the pipeline (serialize the operation) (too slow)

Clock IF ID EX ME WB

0 I1

1 I2 I1

2 I2 I1

3 I2 I1

4 I3 I2 I1

5 I4 I3 I2

6 I4 I3

7 I4 I3

8 I5 I4 I3

9 I6 I5 I4

Speed-up = Tserial/Tpipeline = 5nts/ {2nts + 2ts}, for odd n = 5nts/ {2nts + 3ts }, for even n -> 5/2 as the number of instructions, n, tends to infinity.

Thus, we loose half the throughput due to stalls.

Note: The pipeline time of execution can be computed using the recurrences T1 = 4 Ti = Ti-1 + 1 for even i Ti = Ti-1 + 3 for odd i

Data Hazards

They occur when the executions of two instructions may result in the incorrect reading of operands and/or writing of a result.

Read After Write (RAW) Hazard (Data Dependency)

Write After Read Hazard (WAR) (Data Anti-dependency)

Write After Write Hazard (WAW) (Data Anti-dependency)

RAW Hazards

They occur when reads are early and writes are late.

Clock IF ID EX ME WB0 I1

1 I2 I1

2 I3 I2 I1

3 I4 I3 Read I1

4 I5 I4 I3 I2 Write

5 I6 I5 I4 I3 I2

6 I7 I6 I5 I4 I3

I1: R1 = R1 + R2 I2: R3 = R1 + R2

RAW Hazards (Cont’d)They can be avoided by stalling the reads but this increases the execution time. A better approach is to use data forwarding:

Clock IF ID EX ME WB0 I1

1 I2 I1

2 I3 I2 I1

3 I4 I3 Read I1

4 I5 I4 Read I2 Write

5 I6 I5 I4 I3 I2

6 I7 I6 I5 I4 I3

I1: R1 = R1 + R2 I2: R3 = R1 + R2

WAR Hazards They occur when writes are early and reads are late

Clock IF ID EX ME WB EX ME WB

0 I1

1 I2 I1

2 I3 I2 I1

3 I4 I3 I2 I1

4 I5 I4 I3 I2 I1

5 I6 I5 I4 I3 Write Read

6 I7 I6 I5 I4 I3 I2 I1

I4 I3 I2 I1

I1: R2 = R2 + R3 ; R9 = R3 + R4 , I2: R3 = R7 + R5; R6 = R2 + R8

Branch Prediction in Pipeline Instruction Sequencing

One of the major issues in pipelined instruction processing is to schedule conditional branch instructions.

When a pipeline controller encounters a conditional branch instruction it has a choice to decode it into one of two instruction streams.

If the branch condition is met then the execution continues from the target of the conditional branch instruction;

Otherwise, it continues with the instruction that follows the conditional branch instruction.

Example: Suppose that we execute the following assembly code on a 5-stage pipeline (IF, ID,EX,ME, WB):

JCD R0 < 10, add;

SUB R0,R1; JMP D,halt;

add: ADD R0,R1;

halt: HLT;

If we assume that R0 < 10 then the SUB instruction would have been incorrectly fetched during the second clock cycle. and we will have to another fetch cycle to fetch the ADD instruction.

Classification of branch prediction algorithms

Static Branch Prediction: The branch decision does not change over time-- we use a fixed branching policy.

Dynamic Branch Prediction: The branch decision does change over time-- we use a branching policy that varies over time.

Static Branch Prediction Algorithms

1 Don’t predict (stall the pipeline)

2- Never take the branch

3- Always take the branch

4- Delayed branch

1- Stall the pipeline by 1 clock cycle : This allows us to determine the target of the branch instruction.

JCD IF ID EX ME WB

SUB

ADD IF ID EX ME WB

Stall and decide the branch.

Pipeline Execution Speed (stall case):

Assuming only branch hazards, we can compute the average number of clock cycles per instruction (CPI) as

CPI of the pipeline = CPI of ideal pipeline + the number of idle cycles/instruction = 1 + branch penalty branch frequency = 1 + branch frequency

In general, CPI of the pipeline > 1 + branch frequency because of data and possibly structural hazards

Pros: Straightforward to implement

Cons: The time overhead is high when the instruction mix includes a high percentage of branch instructions.

2- Never take the branch. The instruction in the pipeline is flushed if it is determined that the branch should have been taken after the ID stage is carried out.

JCD IF ID EX ME WB

SUB IF ID EX ME WB

IOR IF ID EX ME WB

XOR IF ID EX ME WB

SUB instruction is always executed and then either the IOR instruction is executed next or SUB is flushed and XOR is executed.

Pipeline Execution Speed (Never take the branch case):


CPI of the pipeline = CPI of ideal pipeline + the number of idle cycles/instruction = 1 + branch penalty branch frequency misprediction rate = 1 + branch frequency misprediction rate

Pros: If the prediction is highly accurate then the pipeline can operate close to its full throughput.

Cons: Implementation is not as straightforward and requires flushing if decoding the branch address takes more than 1 clock cycle.

3- Always take the branch. The instruction in the pipeline is flushed if it is determined that the branch should have been taken after the ID stage is carried out.

JCD IF ID EX ME WB

SUB IF ID EX ME WB

IOR IF ID EX ME WB

XOR IF ID EX ME WB

address computation

Pipeline Execution Speed (Always take the branch case):


CPI of the pipeline = CPI of ideal pipeline + the number of idle cycles/instruction = 1 + branch penalty branch frequency prediction rate + branch penalty branch frequency misprediction rate = 1 + branch frequency prediction rate + 2 branch frequency misprediction rate

Pros: Better suited for the execution of loops without the compiler's intervention (but this can generally be overcome, see the next slide).

Cons: Implementation is not as straightforward, and has a higher misprediction penalty. Not as advantageous as not taking the branch since the branch address computation is not completed until after the EX segment is carried out.

Example: for (i = 0; i < 10; i++) a[i] = a[i] + 1;

“Branch always” will not work well without compiler’s helpCLR R0;loop: JCD R0 >=10,exit LDD R1,R0; ADD R1,1; ST+ R1,R0; JMP D,loop;exit:---------------------------------------------------------- “Branch always” will work well without compiler’s help CLR R0;loop: LDD R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 < 10,loop;

3- Delayed branch: Insert an instruction after a branch instruction, and always execute it whether or not the branch condition applies. Of course, this must be an instruction that can be executed without any side effects on the correctness of the program.

Pros: Pipeline is never stalled or flushed and with the correct choice branch delayed slot instruction, performance can approach that of an ideal pipeline.

Cons: It is not always possible to find a delayed slot instruction in which case a NOP instruction may have to be inserted into the delayed slot to make sure that the program's integrity is not violated. It makes compilers work harder.

Which instruction to place into the delayed branch slot?

3.1-Choose an instruction before the branch, but make sure that branch does not depend on moved instruction. If such an instruction can be found, this always pays off.

Example:

ADD R1,R2;JCD R2>10,exit;

can be rescheduled as

JCD R2,>,10,exit;ADD R1,R2; (Delay slot)

3.2-Choose an instruction from the target of the branch, but make sure that the moved instruction is executable when the branch is not taken.Example: ADD R1,R2; JCD R2 > 10,sub; JMP D, add; ….sub: SUB R4,R5;add: ADI R3,5;


ADD R1,R2; JCD R2,>,10,sub; ADI R3,5; (Delay slot) ….sub: SUB R4,R5;

3.3-Choose an instruction from the anti-target of the branch, but make sure that the moved instruction is executable when the branch is taken.Example: // ADD R3,R2; JCD R2 > 10,exit; ADD R3,R2;exit: SUB R4,R5; // ADD R4,R3;


ADD R1,R2; JCD R2,>,10,exit; ADD R3,R2; (Schedule for execution if it does not alter the program flow or output) exit: SUB R4,R5;

Dynamic Branch Prediction

--Dynamic branch prediction relies on the history of how branch conditions were resolved in the past. --History of branches is kept in a buffer. To keep this buffer reasonably small and easy to access, the buffer is indexed by some fixed number of lower order bits of the address of the branch instruction. --Assumption is that the address values in the lower address field are unique enough to prevent frequent collisions or overrides. Thus if we are trying to predict branches in a program which remains within a block of 256 locations, 8 bits should suffice.

JCD

.

.

JCD

x

x+1

x+256

Branch instructions in the instruction cache include a branch prediction field that is used to predict if the branch should be taken.

Memory Location Program Branch prediction field

x Branch instruction 0 (branch was not taken)

x+4

x+8 Branch instruction 0 (branch was not taken)

x+12

x+16

x+20 Branch instruction 1 (branch was taken)

Branch prediction:In the simplest case, the field is a 1-bit tag:

0 <=> branch was not taken last time (State A)

1 <=> branch was taken last time (State B)

A B

taken

not taken

not taken taken

While in state A predict the branch as “not to be taken”

While in state B predict the branch as “to be taken”

This works relatively well: It accurately predicts the branches in loops in all but two of the iterations CLR R0; loop: LDD R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 < 10,loop;

Assuming that we begin in state A, prediction failswhen R0 = 1 (branch is not taken when it should be) and R0 =10(branch is taken when it should not be)

Assuming that we begin in state B, prediction failswhen R0 =10 (branch is taken when it should not be)

We can modify the loop to make the branch prediction algorithm fail twice when we begin in state B as well.

CLR R0;loop:LDD R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 >=10,exit; JMP D,loop;exit:

Assuming that we begin in state B, prediction fails:when R0 = 1 (branch is taken when it should not be)and R0 =10(branch is not taken when it should not be)

What is worse is that we can make this branch prediction algorithm fail each time it makes a prediction:

LDI R0,1; loop: JCD R0 > 0,neg; LDI R0,1; JMP D,loop;neg: LDI R0,-1; JMP D,loop;

Assuming that we begin in state A, prediction failswhen R0 = 1 (branch is not taken when it should be) R0 = -1 (branch is taken when it should not be) R0 = 1 (branch is not taken when it should be) R0 = -1 (branch is taken when it should not be)and so on

2- bit prediction ( A more reluctant flip in decision )

A1 A2

taken

not taken

not taken

B2 B1

taken

not taken

taken

takennot taken

While in states A1 and A2 predict the branch as “not to be taken”

While in states B1 and B2 predict the branch as “to be taken”

CLR R0; loop: LDD R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 < 10,loop;

Assuming that we begin in state A1, prediction failswhen R0 = 1,2 (branch is not taken when it should be)and R0 = 10 (branch is taken when it should not be)

Assuming that we begin in state B1, prediction failswhen R0 = 10 (branch is taken when it should not be)

A1 A2

taken

not taken

not taken

B2 B1

taken

not taken

taken

takennot taken

2-bit predictors are more resilient to branch inversions (predictions are reversed when they are missed twice):LDI R0,1; loop: JCD R0 > 0,neg; LDI R0,1; JMP D,loop;neg: LDI R0,-1; JMP D,loop;

Assuming that we begin in state B1, prediction succeeds when R0 = 1 (branch is taken when it should be)fails when R0 = -1 (branch is taken when it should not be)succeeds when R0 = 1 (branch is taken when it should be)fails when R0 = -1 (branch is taken when it should not be)and so on…

A1 A2

taken

not taken

not taken

B2 B1

taken

not taken

taken

takennot taken

Amdahl's Law (Fixed Load Speed-up)

Let q be the fraction of a load L that cannot be speeded-up by introducing more processors and let T(p) be the amount time it takes to execute L on p processors by a linear work function, p > 1. Then

All this means is that, the maximum speed-up of a system is limited by the fraction of the work that must be completed sequentially. Thus, the execution of the work using p processors can be reduced to qT(1) under the best of circumstances, and the speed-up cannot exceed 1/q.

€

T( p) > qT(1) + (1− q)T(1)p

S(p) = T(1)T(p)

< 1

q + 1− qp

→ 1q

as p → ∞

ExampleA 4-processor computer executes instructions that are fetched from a random access memory over a shared bus as shown below:

The task to be performed is divided into two parts:

1. Fetch instruction (serial part)- it takes 30 microseconds 2. Execute instruction (parallel part)- it takes 10 microseconds to

execute:S(4) = T(1)/T(4) = 1/(0.75 + 0.25/4) = 4/3.25 = 1.23

microseconds microseconds

microsecondsmicroseconds

Now, suppose that the number of processors is doubled. Then

S(8) = T(1)/T(8) = 1/(0.75 + 0.25/8) = 8/6.25 = 1.28

Suppose that the number of processors is doubled again. Then

S(16) = T(1)/T(16) = 1/(0.75 + 0.25/16) = 16/12.25 = 1.30.

What is the limitS(p) = T(1)/T(p) = 1/(0.75 + 0.25/p) = 1/0.75 = 1.333.

Alternate Forms of Amdahl's Law

€

S = T(1)Tunenhanced + Tenhanced

= T(1)

T(1)(q + 1− qs

)→ 1

q as s → ∞.

where s is the speed-up of the computation that can be enhanced.

Example: Suppose that you've upgraded your computer from a 2 GHz processor to a 4 GHz processor. What is the maximum speed-up you expect in executing a typical program assuming that (1) the speed of fetching each instruction is directly proportional to the speed of reading an instruction from the primary memory of your computer, and reading an instruction takes four times longer than executing it, (2) the speed of executing each instruction is directly proportional to the clock speed of the processor of your computer?

Using Amdahl's Law with q = 0.8 and s = 2, we have

S = 2 /(0.2 + 0.8 x 2) = 1.111

Very disappointing as you are likely to have paid quite a bit of money for the upgrade!

Generalized Amdahl's Law

In general, a task may be partitioned into a set of subtasks, with each subtask requiring a designated number of processors to execute. In this case, the speed-up of the parallel execution of the task over its sequential execution can be characterized by the following, more general formula:

€

S( p1, p2,L , pk ) = T(1)T(p1, p2,L , pk )

< T(1)q1T(1)

p1

+ q2T(1)p2

+L + qkT(1)pk

= 1q1

p1

+ q2

p2

+L + qk

pk

where q1 + q2 +L + qk =1.

When k = 2, q1 = q, q2 = 1- q, p1 = 1, p2 = p, this formula reduces to Amdahl's Law.

Remark:The generalized Amdahl's Law can also be rewritten to express the speed-up due to different amounts of speed enhancement (Se) that can be made to different parts of a system:

€

Se (s1,s2,L ,sk ) = T(1)T(s1,s2,L ,sk )

= T(1)q1T(1)

s1

+ q2T(1)s2

+L + qkT(1)sk

< 1q1

s1

+ q2

s2

+L + qk

sk

where q1 + q2 +L + qk =1.

Example:

(a) 30% integer operations, (b) 20% floating-point operations,(c) 50% memory reference instructions

Se =1/(0.3 + 0.2/2 + 0.5 ) = 1.1

How much speed-up will you expect if you double the speed of the floating unit of your computer?Using the formula above:

Suppose that your computer executes a program that has the following profile of execution:

Example: Suppose that you have a fixed budget of $500 to upgrade each of the computers in your laboratory, and you find out that the computations you perform on your computers require

(a) 40% integer operations, (b) 60% floating-point operations,

If every dollar spent on the integer unit after $50 decreases its execution time by 2%, and if every dollar spent on the floating-point unit after $100 decreases its execution time by 1%, how would you spend the $500?

Example (Continued):

€

S = T(1)Ti(x1) + Tf (x2)

where x1 + x2 = 350

€

T i(x1) = (1− 0.02)T i(x1 −1)⇒T i(x1) = 0.98x1 Ti(0)

T f (x2) = (1− 0.01)T f (x2 −1)⇒T f (x2) = 0.99x2 Tf (0)

€

T i(0) = 0.4T(1)T f (0) = 0.6T(1)

Substituting these into the generalized Amdahl's speed-up expression gives:

€

S = T(1)0.98x1 × 0.4 × T(1) + 0.99x2 × 0.6 × T(1)

= 10.98x1 × 0.4 + 0.99x2 × 0.6

Example 8 (Continued):

So we maximize

subject to x1 + x2 = 350,

or maximize€

10.98x1 × 0.4 + 0.99x2 × 0.6

€

10.98x1 × 0.4 + 0.99350−x1 × 0.6

subject to x1 < 350.

Computing the values in the neighborhood of 120 reveals that the speed-up is maximized when x1 = 126.

From Mathematica:

Table[1/ (0.4 * 0.98^x + 0.6 * 0.99 ^(350 - x)),{ x, 120,128,1}]

{10.5398,10.5518,10.5616,10.5691,10.5744,10.5776,10.5785,10.5773,10.574}

Note: It is possible to have higher speed up with all of the money invested in one of the units if the fix cost for one of the units becomes sufficiently large.

Example (Continued):

If the changes in performance due to upgrades are specified in terms of speed rather than time, we can then use the following formulation:

Addendum:

where denotes the percentage change in speed.

€

Δss

€

t = Ls

ΔtΔx = Δt

ΔsΔsΔx =− L

s2ΔsΔx =−L

sΔss

1Δx ⇒

Δt =−Ls

Δss =−t Δs

s ⇒

Δt =T(x)−T (x −1) =−T (x −1)Δss ⇒

T(x) = (1− Δss )T(x −1)

ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining & Amdahl’s Law

Documents

Transcript of ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining & Amdahl’s Law