CPSC614Lec 6.1
Exploiting Instruction-Level Parallelism with Software Approach
#1
E. J. Kim
CPSC614Lec 6.2
• To avoid a pipeline stall, a dependent instruction must be separated from the source instruction by a distance in clock cycles equal to the pipeline latency of that source instruction.
• Goal: to keep a pipeline full.
CPSC614Lec 6.3
Latencies
Inst. producing result
Inst. using result
Latency in cycles
FP ALU op Another FP op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
Branch: 1, Integer ALU op – branch: 1Integer load: 1 Integer ALU - integer ALU: 1
CPSC614Lec 6.4
Example
for ( i = 1000; i > 0; i = i – 1)x[i] = x[i] + s;
Loop: L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, # -8BNE R1, R2, LOOP
CPSC614Lec 6.5
Without any Scheduling
Clock cycle issued
Loop: L.D F0, 0(R1) 1stall 2
ADD.D F4, F0, F2 3stall 4stall 5
S.D F4, 0(R1) 6DADDIU R1, R1, # -8 7
stall 8BNE R1, R2, LOOP 9
stall 10
CPSC614Lec 6.6
With Scheduling
Clock cycle issued
Loop: L.D F0, 0(R1) 1DADDIU R1, R1, # -8 2ADD.D F4, F0, F2 3
stall 4BNE R1, R2, LOOP 5S.D F4, 8(R1) 6
not trivial
delayed branch
CPSC614Lec 6.7
• The actual work of operating on the array element takes 3 (load, add, store).
• The remaining 3 cycles– Loop overhead (DADDIU, BNE)– Stall
• To eliminate the 3 cycles, we need to get more operations within the loop relative to the number of overhead instructions.
CPSC614Lec 6.8
Reducing Loop Overhead• Loop Unrolling
– Simple scheme for increasing the number of instructions relative to the branch and overhead instructions
– Simply replicates the loop body multiple times, adjusting the loop termination code.
– Improves scheduling» It allows instructions from different iterations to be
scheduled together.– Uses different registers for each iteration.
CPSC614Lec 6.9
Unrolled Loop (No Scheduling)
Clock cycle issued
Loop: L.D F0, 0(R1) 1 2ADD.D F4, F0, F2 3 4 5S.D F4, 0(R1) 6 L.D F6, -8(R1) 7 8ADD.D F8, F6, F2 9 10 11S.D F8, -8(R1) 12 L.D F10, -16(R1) 13 14ADD.D F12, F10, F2 15 16 17S.D F12, -16(R1) 18 L.D F14, -24(R1) 19 20ADD.D F16, F14, F2 21 22 23S.D F16, -24(R1) 24DADDIU R1, R1, # -32 25 26BNE R1, R2, LOOP 27 28
CPSC614Lec 6.10
Loop Unrolling
• Loop unrolling is normally done early in the compilation process, so that redundant computations can be exposed and eliminated by the optimizer.
• Unrolling improves the performance of the loop by eliminating overhead instructions.
CPSC614Lec 6.11
Loop Unrolling (Scheduling)
Clock cycle issued
Loop: L.D F0, 0(R1) 1L.D F6, -8(R1) 2L.D F10, -16(R1) 3L.D F14, -24(R1) 4ADD.D F4, F0, F2 5ADD.D F8, F6, F2 6ADD.D F12, F10, F2 7ADD.D F16, F14, F2 8S.D F4, 0(R1) 9S.D F8, -8(R1) 10DADDIU R1, R1, # -32 11S.D F12, 16(R1) 12BNE R1, R2, LOOP 13S.D F16, 8(R1) 14
CPSC614Lec 6.12
Summary
• Goal: To know when and how the ordering among instructions may be changed.
• This process must be performed in a methodical fashion either by a compiler or by hardware.
CPSC614Lec 6.13
• To obtain the final unrolled code,– Determine that it is legal to move the S.D after the
DADDIU and BNE, and find the amount to adjust the S.D offset.
– Determine that unrolling the loop will be useful by finding that the loop iterations are independent, except for the loop maintenance code.
– Use different registers to avoid unnecessary constraints.
– Eliminate the extra test and branch instructions and adjust the loop termination and iteration code.
CPSC614Lec 6.14
– Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This transformation requires analyzing the memory addresses and finding that they do not refer to the same address.
– Schedule the code, preserving any dependences needed to yield the same result as the original code.
Loop Unrolling I(No Delayed Branch)
Loop: L.D F0, 0(R1) ADD.D F4, F0, F2S.D F4, 0(R1)L.D F0, -8(R1) ADD.D F4, F0, F2S.D F4, -8(R1)L.D F0, -16(R1) ADD.D F4, F0, F2S.D F4, -16(R1)L.D F0, -24(R1) ADD.D F4, F0, F2S.D F4, -24(R1)DADDIU R1, R1, # -32BNE R1, R2, LOOP
name dependence
true dependence
Loop Unrolling II(Register Renaming)
Loop: L.D F0, 0(R1) ADD.D F4, F0, F2S.D F4, 0(R1)L.D F6, -8(R1) ADD.D F8, F6, F2S.D F8, -8(R1)L.D F10, -16(R1) ADD.D F12, F10, F2S.D F12, -16(R1)L.D F14, -24(R1) ADD.D F16, F14, F2S.D F16, -24(R1)DADDIU R1, R1, # -32BNE R1, R2, LOOP
true dependence
CPSC614Lec 6.17
• With the renaming, the copies of each loop body become independent and can be overlapped or executed in parallel.– Potential shortfall in registers
• Register pressure– It arises because scheduling code to increase ILP
causes the number of live values to increase. It may not be possible to allocate all the live values to registers.
– The combination of unrolling and aggressive scheduling can cause this problem.
CPSC614Lec 6.18
• Loop unrolling is a simple but useful method for increasing the size of straight-line code fragments that can be scheduled effectively.
CPSC614Lec 6.19
Unrolling with Two-Issue
Loop: L.D F0, 0(R1) 1L.D F6, -8(R1) 2L.D F10, -16(R1) ADD.D F4, F0, F2 3L.D F14, -24(R1) ADD.D F8, F6, F2 4L.D F18, -32(R1) ADD.D F12, F10, F2 5S.D F4, 0(R1) ADD.D F16, F14, F2 6S.D F8, -8(R1) ADD.D F20, F18, F2 7S.D F12, -16(R1) 8DADDIU R1, R1, #-40 9S.D F16, 16(R1) 10BNE R1, R2, LOOP 11S.D F20, 8(R1) 12
CPSC614Lec 6.20
Static Branch Prediction
• Static branch predictors are sometimes used in processors where the expectation is that branch behavior is highly predictable at compile time.
CPSC614Lec 6.21
Static Branch Prediction
• Predict a branch taken– Simplest– Average misprediction rate for SPEC: 34% (9% ~ 59%)
• Predict on the basis of branch direction– backward-going branches: taken– forward-going branches: not taken– Unlikely to generate an overall misprediction
rate of less than 30% ~ 40%.
CPSC614Lec 6.22
Static Branch Prediction
• Predict branches on the basis of profile information collected from earlier runs.– An individual branch is often highly biased
toward taken or untaken. (bimodally distributed)
– Changing the input so that the profile is for a different run leads to only a small change in the accuracy of profile-based prediction.
CPSC614Lec 6.23
VLIW
• Very Long Instruction Word:– Rely on compiler technology to minimize the
potential data hazard stalls.– Actually format the instructions in a potential
issue packet so that the hardware need not check explicitly for dependences.
– Wide instructions with multiple operations per instruction. (64, 128 bits or more)
– Intel IA-64 architecture
CPSC614Lec 6.24
Basic VLIW Approach
• VLIWs use multiple, independent functional units.
• A VLIW packages the multiple operations into one very long instruction.
• The hardware in a superscalar for multiple issue is unnecessary.
• Uses loop unrolling, scheduling…
CPSC614Lec 6.25
• Local Scheduling: Scheduling the code within a single basic block.
• Global Scheduling: scheduling code across branches– much more complex
• Trace Scheduling: Section 4.5• Figure 4.5 VLIW instructions
CPSC614Lec 6.26
Problems
• Increase in code size• Wasted functional units
– In the previous example, only about 60% of the functional units were used.
CPSC614Lec 6.27
Detecting and Enhancing Loop-level Parallelism
• Loop level parallelism : source level• ILP : machine level code after
compliation
for (i= 1000; i< 0; i--)
x[i] = x[i] + s
CPSC614Lec 6.28
Advanced Compiler Support for Exposing and Exploiting ILP
for ( i = 1; i <= 100; i ++) { A[i + 1] = A[i] + C[i]; /* S1 */ B[i + 1] = B[i] + A[i + 1]; /* S2 */}
CPSC614Lec 6.29
Loop-Carried Dependence
• Data accesses in later iterations are dependent on data values produced in earlier iterations.
for ( i = 1; i <= 100; i ++) { A[i + 1] = A[i] + C[i]; /* S1 */ B[i + 1] = B[i] + A[i + 1]; /* S2 */}
Loop-Carried Dependences
This dependence forces successive iterations of thisloop to execute in series.
CPSC614Lec 6.30
Does a loop-carried dependence mean there is no parallelism???
• Consider:for (i=0; i< 8; i=i+1) {
A = A + C[i]; /* S1 */}
Could compute:
“Cycle 1”: temp0 = C[0] + C[1];temp1 = C[2] + C[3];temp2 = C[4] + C[5];temp3 = C[6] + C[7];
“Cycle 2”: temp4 = temp0 + temp1;temp5 = temp2 + temp3;
“Cycle 3”: A = temp4 + temp5;
• Relies on associative nature of “+”.
CPSC614Lec 6.31
for ( i = 1; i <= 100; i ++) { A[i] = A[i] + B[i]; /* S1 */ B[i + 1] = C[i] + D[i]; /* S2 */}
Loop-Carried Dependence
Despite this loop-carried dependence, this loop can be made parallel.
CPSC614Lec 6.32
A[1] = A[1] + B[1];
for ( i = 1; i <= 99; i ++) { B[i + 1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; }
B[101] = C[100] + D[100];
CPSC614Lec 6.33
Recurrence
• A recurrence is when a variable is defined based on the value of that variable in an earlier iteration, often the one immediately preceding.
• Detecting a recurrence can be important– Some architectures (especially vector computer) have
special support for executing recurrences.– Some recurrences can be the source of a reasonable
amount of parallelism.
CPSC614Lec 6.34
for ( i = 2; i <= 100; i = i + 1) Y[i] = Y[i – 1] + Y[i];
Dependence distance: 1
for ( i = 6; i <= 100; i = i + 1) Y[i] = Y[i – 5] + Y[i];
Dependence distance: 5
The larger the distance, the more potential parallelism can beobtained by unrolling the loop.
CPSC614Lec 6.35
Finding Dependences
• Determining whether a dependence actually exists => NP-Complete
• Dependence Analysis– Basic tool for detecting loop-level parallelism– Applies only under a limited set of
circumstances.– Greatest common divisor (GCD) test, points-to
analysis, interprocedural analysis,…
CPSC614Lec 6.36
Eliminating Dependent Computation
• Algebraic Simplifications of Expressions
• Copy propagation– Eliminates operations that copy values.
DADDIU R1, R2, #4DADDIU R1, R1, #4
DADDIU R1, R2, #8
CPSC614Lec 6.37
Eliminating Dependent Computation
• Tree Height Reduction– Reduces the height of the tree structure
representing a computation.
ADD R1, R2, R3ADD R4, R1, R6ADD R8, R4, R7
ADD R1, R2, R3ADD R4, R6, R7ADD R8, R1, R4
CPSC614Lec 6.38
Eliminating Dependent Computation
• Recurrences
sum = sum + x1 + x2 + x3 + x4 + x5
sum = (sum + x1) + (x2 + x3) + (x4 + x5)
CPSC614Lec 6.39
Software Pipelining
• Technique for reorganizing loops such that each iteration in the software-pipelined code is made from instructions chosen from different iterations of the original loop.
• By choosing instructions from different iterations, dependent computations are separated from one another by an entire loop body.
CPSC614Lec 6.40
Software Pipelining
• Counterpart to what Tomasulo’s algorithm does in hardware
• Software pipelining symbolically unrolls the loop and then selects instructions from each iteration.
• Start-up code before the loop and finish-up code after the loop required.
CPSC614Lec 6.41
Software Pipelining
CPSC614Lec 6.42
Software Pipelining - Example
• Show a software-pipelined version of the following loop. Omit the start-up and finish-up code.
Loop: L.D F0, 0(R1)ADD.D F4, F0, F2S.D F4, 0(R1)DADDIU R1, R1, #-8BNE R1, R2, Loop
CPSC614Lec 6.43
Software Pipelining
• Software pipelining consumes less code space.
• Loop unrolling reduces the overhead of the loop (branch, counter update code).
• Software pipelining reduces the time when the loop is not running at peak speed to once per loop at the beginning and end.
CPSC614Lec 6.44
CPSC614Lec 6.45
Hw support for more parallelism at compile time
Conditional Instructions
• Predicated instructions• Extension of instruction set• Conditional instruction: an instruction
that refers a condition, which is evaluated as part of the instruction execution– Condition is true: executed normally– False: no-op– ex) conditional move
CPSC614Lec 6.46
Example
if (A == 0) { S = T; }
BNEZ R1, LADDU R2, R3, R0
L:
CMOVZ R2, R3, R1
R1=A, R2=S, R3=Tconditional move only if thethird operand is equal to zero
CPSC614Lec 6.47
• Conditional moves are used to change a control dependence into a data dependence.
• Handling multiple branches per cycle is complex. => Conditional moves provide a way of reducing branch pressure.
• A conditional move can often eliminate a branch that is hard to predict, increasing the potential gain.
Top Related