ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining & Amdahl’s Law
description
Transcript of ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining & Amdahl’s Law
ENEE350 Lecture Notes-Weeks 14 and 15 Pipelining & Amdahl’s Law
Pipelining is a method of processing in which a problem is divided into a number of sub-problems and solved and the solutions of the sub-problems for different instances of the problem are then overlapped.
Example: a[i] = b[i] + c[i] + d[i] + e[i] + f[i], i = 1, 2, 3,…,n
+ + + +
D D D
D D
D
b[1]
c[1]
d[1]
e[1]
f[1]
c[2]
d[2]
e[2]
f[2]
a[1]
a[2]
Adders have delay D to compute.
Computation time = 4D + (n-1)D = nD +3D
Speed-up = 4nD/{3D + nD} -> 4 for large n.
We can describe the computation process in a linear pipeline algorithmically. There are three distinct phases to this computation: (a) filling the pipeline, (b) running the pipeline in the filled state until the last input arrives, and (c) emptying the pipeline.
(linear pipeline)while(1){resetLatches(); clock = 0; //fill the pipeline for(j = 0; j <= n-1; j++) {for(k = 0; k <= j; k++) segment(k); clock++; } //execute all segments until the last input arrives while (clock <= m) {for(j = 0; j <= n-1; j++) segment(j); clock++; } //empty the pipeline for(j = 0; j <= n-1; j++) {for(k = j; k <= n-1; k++) segment(k); clock++; }}
Instruction pipelines:
fetch decode execute
0 I1
1 I2 I1
2 I3 I2 I1
3 I4 I3 I2
4 I4 I3
Goal:(i) to increase the throughput (number of instructions/sec) in executing programs(ii) to reduce the execution time (clock cycles/instruction, etc.
clock
fetch decode execute
0 I1
1 I2 I1
2 I3 I2 I1
3 I4 I3 I2 I1
4 I5 I4 I3 I2 I1
memory write backclock
Speed-up of pipelined execution of instructions over a sequential execution:
€
S(5) = CPIuNu f5
CPIpN p f1
Assuming that the systems operate at the same clock rate and use the same number of operations:
€
S(5) = CPIu
CPIp1
ExampleSuppose that the instruction mix of programs executed on a serial and pipelined machines is 40% ALU, 20% branching, and 40% memory with 4, 2, and 4 cycles per each instruction in the three classes respectively.
Then, under ideal conditions (no stalls due to hazards)
€
S(5) = CPIu
CPIp1
= 4 × 0.4 + 2 × 0.2 + 4 × 0.41
= 3.3
If, the clock speed needs to be increased for the pipeline implementation then the speed-up will have to be scaled down accordingly.
MIPS Pipeline
IF ID EX WB
Register operations
IF ID EX ME
Register/Memory operations
WB
Instruction Pipelines (Hennessy & Patterson)
Hazards
1-Structural Hazards2-Data Hazards3-Control Hazards
Structural Hazards: They arise when limited resources are scheduled to operate on different streams during the same clock period.
Structural Hazards: They arise when limited resources are scheduled to operate concurrently on different streams during the same clock period.
Example: Memory conflict (data fetch + instruction fetch) or datapath conflict (arithmetic operation + PC update)
Clock IF ID EX ME WB
0 I1
1 I2 I1
2 I3 I2 I1
3 I4 I3 I2 I1
4 I5 I4 I3 I2 I1
5 I6 I5 I4 I3 I2
6 I7 I6 I5 I4 I3
Fix: Duplicate hardware (too expensive) Stall the pipeline (serialize the operation) (too slow)
Clock IF ID EX ME WB
0 I1
1 I2 I1
2 I2 I1
3 I2 I1
4 I3 I2 I1
5 I4 I3 I2
6 I4 I3
7 I4 I3
8 I5 I4 I3
9 I6 I5 I4
Speed-up = Tserial/Tpipeline = 5nts/ {2nts + 2ts}, for odd n = 5nts/ {2nts + 3ts }, for even n -> 5/2 as the number of instructions, n, tends to infinity.
Thus, we loose half the throughput due to stalls.
Note: The pipeline time of execution can be computed using the recurrences T1 = 4 Ti = Ti-1 + 1 for even i Ti = Ti-1 + 3 for odd i
Data Hazards
They occur when the executions of two instructions may result in the incorrect reading of operands and/or writing of a result.
Read After Write (RAW) Hazard (Data Dependency)
Write After Read Hazard (WAR) (Data Anti-dependency)
Write After Write Hazard (WAW) (Data Anti-dependency)
RAW Hazards
They occur when reads are early and writes are late.
Clock IF ID EX ME WB0 I1
1 I2 I1
2 I3 I2 I1
3 I4 I3 Read I1
4 I5 I4 I3 I2 Write
5 I6 I5 I4 I3 I2
6 I7 I6 I5 I4 I3
I1: R1 = R1 + R2 I2: R3 = R1 + R2
RAW Hazards (Cont’d)They can be avoided by stalling the reads but this increases the execution time. A better approach is to use data forwarding:
Clock IF ID EX ME WB0 I1
1 I2 I1
2 I3 I2 I1
3 I4 I3 Read I1
4 I5 I4 Read I2 Write
5 I6 I5 I4 I3 I2
6 I7 I6 I5 I4 I3
I1: R1 = R1 + R2 I2: R3 = R1 + R2
WAR Hazards They occur when writes are early and reads are late
Clock IF ID EX ME WB EX ME WB
0 I1
1 I2 I1
2 I3 I2 I1
3 I4 I3 I2 I1
4 I5 I4 I3 I2 I1
5 I6 I5 I4 I3 Write Read
6 I7 I6 I5 I4 I3 I2 I1
I4 I3 I2 I1
I1: R2 = R2 + R3 ; R9 = R3 + R4 , I2: R3 = R7 + R5; R6 = R2 + R8
Branch Prediction in Pipeline Instruction Sequencing
One of the major issues in pipelined instruction processing is to schedule conditional branch instructions.
When a pipeline controller encounters a conditional branch instruction it has a choice to decode it into one of two instruction streams.
If the branch condition is met then the execution continues from the target of the conditional branch instruction;
Otherwise, it continues with the instruction that follows the conditional branch instruction.
Example: Suppose that we execute the following assembly code on a 5-stage pipeline (IF, ID,EX,ME, WB):
JCD R0 < 10, add;
SUB R0,R1; JMP D,halt;
add: ADD R0,R1;
halt: HLT;
If we assume that R0 < 10 then the SUB instruction would have been incorrectly fetched during the second clock cycle. and we will have to another fetch cycle to fetch the ADD instruction.
Classification of branch prediction algorithms
Static Branch Prediction: The branch decision does not change over time-- we use a fixed branching policy.
Dynamic Branch Prediction: The branch decision does change over time-- we use a branching policy that varies over time.
Static Branch Prediction Algorithms
1 Don’t predict (stall the pipeline)
2- Never take the branch
3- Always take the branch
4- Delayed branch
1- Stall the pipeline by 1 clock cycle : This allows us to determine the target of the branch instruction.
JCD IF ID EX ME WB
SUB
ADD IF ID EX ME WB
Stall and decide the branch.
Pipeline Execution Speed (stall case):
Assuming only branch hazards, we can compute the average number of clock cycles per instruction (CPI) as
CPI of the pipeline = CPI of ideal pipeline + the number of idle cycles/instruction = 1 + branch penalty branch frequency = 1 + branch frequency
In general, CPI of the pipeline > 1 + branch frequency because of data and possibly structural hazards
Pros: Straightforward to implement
Cons: The time overhead is high when the instruction mix includes a high percentage of branch instructions.
2- Never take the branch. The instruction in the pipeline is flushed if it is determined that the branch should have been taken after the ID stage is carried out.
JCD IF ID EX ME WB
SUB IF ID EX ME WB
IOR IF ID EX ME WB
XOR IF ID EX ME WB
SUB instruction is always executed and then either the IOR instruction is executed next or SUB is flushed and XOR is executed.
Pipeline Execution Speed (Never take the branch case):
Assuming only branch hazards, we can compute the average number of clock cycles per instruction (CPI) as
CPI of the pipeline = CPI of ideal pipeline + the number of idle cycles/instruction = 1 + branch penalty branch frequency misprediction rate = 1 + branch frequency misprediction rate
Pros: If the prediction is highly accurate then the pipeline can operate close to its full throughput.
Cons: Implementation is not as straightforward and requires flushing if decoding the branch address takes more than 1 clock cycle.
3- Always take the branch. The instruction in the pipeline is flushed if it is determined that the branch should have been taken after the ID stage is carried out.
JCD IF ID EX ME WB
SUB IF ID EX ME WB
IOR IF ID EX ME WB
XOR IF ID EX ME WB
address computation
Pipeline Execution Speed (Always take the branch case):
Assuming only branch hazards, we can compute the average number of clock cycles per instruction (CPI) as
CPI of the pipeline = CPI of ideal pipeline + the number of idle cycles/instruction = 1 + branch penalty branch frequency prediction rate + branch penalty branch frequency misprediction rate = 1 + branch frequency prediction rate + 2 branch frequency misprediction rate
Pros: Better suited for the execution of loops without the compiler's intervention (but this can generally be overcome, see the next slide).
Cons: Implementation is not as straightforward, and has a higher misprediction penalty. Not as advantageous as not taking the branch since the branch address computation is not completed until after the EX segment is carried out.
Example: for (i = 0; i < 10; i++) a[i] = a[i] + 1;
“Branch always” will not work well without compiler’s helpCLR R0;loop: JCD R0 >=10,exit LDD R1,R0; ADD R1,1; ST+ R1,R0; JMP D,loop;exit:---------------------------------------------------------- “Branch always” will work well without compiler’s help CLR R0;loop: LDD R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 < 10,loop;
3- Delayed branch: Insert an instruction after a branch instruction, and always execute it whether or not the branch condition applies. Of course, this must be an instruction that can be executed without any side effects on the correctness of the program.
Pros: Pipeline is never stalled or flushed and with the correct choice branch delayed slot instruction, performance can approach that of an ideal pipeline.
Cons: It is not always possible to find a delayed slot instruction in which case a NOP instruction may have to be inserted into the delayed slot to make sure that the program's integrity is not violated. It makes compilers work harder.
Which instruction to place into the delayed branch slot?
3.1-Choose an instruction before the branch, but make sure that branch does not depend on moved instruction. If such an instruction can be found, this always pays off.
Example:
ADD R1,R2;JCD R2>10,exit;
can be rescheduled as
JCD R2,>,10,exit;ADD R1,R2; (Delay slot)
3.2-Choose an instruction from the target of the branch, but make sure that the moved instruction is executable when the branch is not taken.Example: ADD R1,R2; JCD R2 > 10,sub; JMP D, add; ….sub: SUB R4,R5;add: ADI R3,5;
can be rescheduled as
ADD R1,R2; JCD R2,>,10,sub; ADI R3,5; (Delay slot) ….sub: SUB R4,R5;
3.3-Choose an instruction from the anti-target of the branch, but make sure that the moved instruction is executable when the branch is taken.Example: // ADD R3,R2; JCD R2 > 10,exit; ADD R3,R2;exit: SUB R4,R5; // ADD R4,R3;
can be rescheduled as
ADD R1,R2; JCD R2,>,10,exit; ADD R3,R2; (Schedule for execution if it does not alter the program flow or output) exit: SUB R4,R5;
Dynamic Branch Prediction
--Dynamic branch prediction relies on the history of how branch conditions were resolved in the past. --History of branches is kept in a buffer. To keep this buffer reasonably small and easy to access, the buffer is indexed by some fixed number of lower order bits of the address of the branch instruction. --Assumption is that the address values in the lower address field are unique enough to prevent frequent collisions or overrides. Thus if we are trying to predict branches in a program which remains within a block of 256 locations, 8 bits should suffice.
JCD
.
.
JCD
x
x+1
x+256
Branch instructions in the instruction cache include a branch prediction field that is used to predict if the branch should be taken.
Memory Location Program Branch prediction field
x Branch instruction 0 (branch was not taken)
x+4
x+8 Branch instruction 0 (branch was not taken)
x+12
x+16
x+20 Branch instruction 1 (branch was taken)
Branch prediction:In the simplest case, the field is a 1-bit tag:
0 <=> branch was not taken last time (State A)
1 <=> branch was taken last time (State B)
A B
taken
not taken
not taken taken
While in state A predict the branch as “not to be taken”
While in state B predict the branch as “to be taken”
This works relatively well: It accurately predicts the branches in loops in all but two of the iterations CLR R0; loop: LDD R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 < 10,loop;
Assuming that we begin in state A, prediction failswhen R0 = 1 (branch is not taken when it should be) and R0 =10(branch is taken when it should not be)
Assuming that we begin in state B, prediction failswhen R0 =10 (branch is taken when it should not be)
We can modify the loop to make the branch prediction algorithm fail twice when we begin in state B as well.
CLR R0;loop:LDD R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 >=10,exit; JMP D,loop;exit:
Assuming that we begin in state B, prediction fails:when R0 = 1 (branch is taken when it should not be)and R0 =10(branch is not taken when it should not be)
What is worse is that we can make this branch prediction algorithm fail each time it makes a prediction:
LDI R0,1; loop: JCD R0 > 0,neg; LDI R0,1; JMP D,loop;neg: LDI R0,-1; JMP D,loop;
Assuming that we begin in state A, prediction failswhen R0 = 1 (branch is not taken when it should be) R0 = -1 (branch is taken when it should not be) R0 = 1 (branch is not taken when it should be) R0 = -1 (branch is taken when it should not be)and so on
2- bit prediction ( A more reluctant flip in decision )
A1 A2
taken
not taken
not taken
B2 B1
taken
not taken
taken
takennot taken
While in states A1 and A2 predict the branch as “not to be taken”
While in states B1 and B2 predict the branch as “to be taken”
CLR R0; loop: LDD R1,R0; ADD R1,1; ST+ R1,R0; JCD R0 < 10,loop;
Assuming that we begin in state A1, prediction failswhen R0 = 1,2 (branch is not taken when it should be)and R0 = 10 (branch is taken when it should not be)
Assuming that we begin in state B1, prediction failswhen R0 = 10 (branch is taken when it should not be)
A1 A2
taken
not taken
not taken
B2 B1
taken
not taken
taken
takennot taken
2-bit predictors are more resilient to branch inversions (predictions are reversed when they are missed twice):LDI R0,1; loop: JCD R0 > 0,neg; LDI R0,1; JMP D,loop;neg: LDI R0,-1; JMP D,loop;
Assuming that we begin in state B1, prediction succeeds when R0 = 1 (branch is taken when it should be)fails when R0 = -1 (branch is taken when it should not be)succeeds when R0 = 1 (branch is taken when it should be)fails when R0 = -1 (branch is taken when it should not be)and so on…
A1 A2
taken
not taken
not taken
B2 B1
taken
not taken
taken
takennot taken
Amdahl's Law (Fixed Load Speed-up)
Let q be the fraction of a load L that cannot be speeded-up by introducing more processors and let T(p) be the amount time it takes to execute L on p processors by a linear work function, p > 1. Then
All this means is that, the maximum speed-up of a system is limited by the fraction of the work that must be completed sequentially. Thus, the execution of the work using p processors can be reduced to qT(1) under the best of circumstances, and the speed-up cannot exceed 1/q.
€
T( p) > qT(1) + (1− q)T(1)p
S(p) = T(1)T(p)
< 1
q + 1− qp
→ 1q
as p → ∞
ExampleA 4-processor computer executes instructions that are fetched from a random access memory over a shared bus as shown below:
The task to be performed is divided into two parts:
1. Fetch instruction (serial part)- it takes 30 microseconds 2. Execute instruction (parallel part)- it takes 10 microseconds to
execute:S(4) = T(1)/T(4) = 1/(0.75 + 0.25/4) = 4/3.25 = 1.23
microseconds microseconds
microsecondsmicroseconds
Now, suppose that the number of processors is doubled. Then
S(8) = T(1)/T(8) = 1/(0.75 + 0.25/8) = 8/6.25 = 1.28
Suppose that the number of processors is doubled again. Then
S(16) = T(1)/T(16) = 1/(0.75 + 0.25/16) = 16/12.25 = 1.30.
What is the limitS(p) = T(1)/T(p) = 1/(0.75 + 0.25/p) = 1/0.75 = 1.333.
Alternate Forms of Amdahl's Law
€
S = T(1)Tunenhanced + Tenhanced
= T(1)
T(1)(q + 1− qs
)→ 1
q as s → ∞.
where s is the speed-up of the computation that can be enhanced.
Example: Suppose that you've upgraded your computer from a 2 GHz processor to a 4 GHz processor. What is the maximum speed-up you expect in executing a typical program assuming that (1) the speed of fetching each instruction is directly proportional to the speed of reading an instruction from the primary memory of your computer, and reading an instruction takes four times longer than executing it, (2) the speed of executing each instruction is directly proportional to the clock speed of the processor of your computer?
Using Amdahl's Law with q = 0.8 and s = 2, we have
S = 2 /(0.2 + 0.8 x 2) = 1.111
Very disappointing as you are likely to have paid quite a bit of money for the upgrade!
Generalized Amdahl's Law
In general, a task may be partitioned into a set of subtasks, with each subtask requiring a designated number of processors to execute. In this case, the speed-up of the parallel execution of the task over its sequential execution can be characterized by the following, more general formula:
€
S( p1, p2,L , pk ) = T(1)T(p1, p2,L , pk )
< T(1)q1T(1)
p1
+ q2T(1)p2
+L + qkT(1)pk
= 1q1
p1
+ q2
p2
+L + qk
pk
where q1 + q2 +L + qk =1.
When k = 2, q1 = q, q2 = 1- q, p1 = 1, p2 = p, this formula reduces to Amdahl's Law.
Remark:The generalized Amdahl's Law can also be rewritten to express the speed-up due to different amounts of speed enhancement (Se) that can be made to different parts of a system:
€
Se (s1,s2,L ,sk ) = T(1)T(s1,s2,L ,sk )
= T(1)q1T(1)
s1
+ q2T(1)s2
+L + qkT(1)sk
< 1q1
s1
+ q2
s2
+L + qk
sk
where q1 + q2 +L + qk =1.
Example:
(a) 30% integer operations, (b) 20% floating-point operations,(c) 50% memory reference instructions
Se =1/(0.3 + 0.2/2 + 0.5 ) = 1.1
How much speed-up will you expect if you double the speed of the floating unit of your computer?Using the formula above:
Suppose that your computer executes a program that has the following profile of execution:
Example: Suppose that you have a fixed budget of $500 to upgrade each of the computers in your laboratory, and you find out that the computations you perform on your computers require
(a) 40% integer operations, (b) 60% floating-point operations,
If every dollar spent on the integer unit after $50 decreases its execution time by 2%, and if every dollar spent on the floating-point unit after $100 decreases its execution time by 1%, how would you spend the $500?
Example (Continued):
€
S = T(1)Ti(x1) + Tf (x2)
where x1 + x2 = 350
€
T i(x1) = (1− 0.02)T i(x1 −1)⇒T i(x1) = 0.98x1 Ti(0)
T f (x2) = (1− 0.01)T f (x2 −1)⇒T f (x2) = 0.99x2 Tf (0)
€
T i(0) = 0.4T(1)T f (0) = 0.6T(1)
Substituting these into the generalized Amdahl's speed-up expression gives:
€
S = T(1)0.98x1 × 0.4 × T(1) + 0.99x2 × 0.6 × T(1)
= 10.98x1 × 0.4 + 0.99x2 × 0.6
Example 8 (Continued):
So we maximize
subject to x1 + x2 = 350,
or maximize€
10.98x1 × 0.4 + 0.99x2 × 0.6
€
10.98x1 × 0.4 + 0.99350−x1 × 0.6
subject to x1 < 350.
Computing the values in the neighborhood of 120 reveals that the speed-up is maximized when x1 = 126.
From Mathematica:
Table[1/ (0.4 * 0.98^x + 0.6 * 0.99 ^(350 - x)),{ x, 120,128,1}]
{10.5398,10.5518,10.5616,10.5691,10.5744,10.5776,10.5785,10.5773,10.574}
Note: It is possible to have higher speed up with all of the money invested in one of the units if the fix cost for one of the units becomes sufficiently large.
Example (Continued):
If the changes in performance due to upgrades are specified in terms of speed rather than time, we can then use the following formulation:
Addendum:
where denotes the percentage change in speed.
€
Δss
€
t = Ls
ΔtΔx = Δt
ΔsΔsΔx =− L
s2ΔsΔx =−L
sΔss
1Δx ⇒
Δt =−Ls
Δss =−t Δs
s ⇒
Δt =T(x)−T (x −1) =−T (x −1)Δss ⇒
T(x) = (1− Δss )T(x −1)