EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism
EECS 583 – Class 9 Classic and ILP Optimization
-
Upload
margaret-emerson -
Category
Documents
-
view
23 -
download
0
description
Transcript of EECS 583 – Class 9 Classic and ILP Optimization
![Page 1: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/1.jpg)
EECS 583 – Class 9Classic and ILP Optimization
University of Michigan
October 3, 2012
![Page 2: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/2.jpg)
- 2 -
Announcements & Reading Material
Homework 2 available on course web page» Due Oct 22 (Monday) at midnight
» Look on Phorum for help with HW #2 Adding/deleting instructions Splitting a BB
Today’s class» “Compiler Code Transformations for Superscalar-Based High-
Performance Systems,” Scott Mahlke, William Chen, John Gyllenhaal, Wen-mei Hwu, Pohua Chang, and Tokuzo Kiyohara, Proceedings of Supercomputing '92, Nov. 1992, pp. 808-817
Next class (instruction scheduling)» “Machine Description Driven Compilers for EPIC Processors”,
B. Rau, V. Kathail, and S. Aditya, HP Technical Report, HPL-98-40, 1998. (long paper but informative)
![Page 3: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/3.jpg)
- 3 -
From Last Time: Code Optimization
Make the code run faster on the target processor» My (Scott’s) favorite topic !!
» Other objectives: Power, code size
Classes of optimization» 1. Classical (machine independent)
Reducing operation count (redundancy elimination) Simplifying operations Generally good for any kind of machine
» 2. Machine specific Peephole optimizations Take advantage of specialized hardware features
» 3. Parallelism enhancing Increasing parallelism (ILP or TLP) Possibly increase instructions
![Page 4: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/4.jpg)
- 4 -
From Last Time: Dead Code Elimination Remove any operation who’s
result is never consumed Rules
» X can be deleted no stores or branches
» DU chain empty or dest register not live
This misses some dead code!!» Especially in loops» Critical operation
store or branch operation
» Any operation that does not directly or indirectly feed a critical operation is dead
» Trace UD chains backwards from critical operations
» Any op not visited is dead
r1 = 3r2 = 10
r4 = r4 + 1r7 = r1 * r4
r2 = 0 r3 = r3 + 1
r3 = r2 + r1
store (r1, r3)
![Page 5: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/5.jpg)
- 5 -
Local Constant Propagation
Consider 2 ops, X and Y in a BB, X is before Y» 1. X is a move
» 2. src1(X) is a literal
» 3. Y consumes dest(X)
» 4. There is no definition of dest(X) between X and Y
Note, ignore operation format issues, so all operations can have literals in either operand position
r1 = 5r2 = ‘_x’r3 = 7r4 = r4 + r1r1 = r1 + r2r1 = r1 + 1r3 = 12r8 = r1 - r2r9 = r3 + r5r3 = r2 + 1r10 = r3 – r1
![Page 6: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/6.jpg)
- 6 -
Global Constant Propagation
Consider 2 ops, X and Y in different BBs» 1. X is a move
» 2. src1(X) is a literal
» 3. Y consumes dest(X)
» 4. X is in a_in(BB(Y))
» 5. Dest(x) is not modified between the top of BB(Y) and Y
r1 = 5r2 = ‘_x’
r1 = r1 + r2 r7 = r1 – r2
r8 = r1 * r2
r9 = r1 + r2
![Page 7: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/7.jpg)
- 7 -
Constant Folding
Simplify 1 operation based on values of src operands» Constant propagation creates opportunities for this
All constant operands» Evaluate the op, replace with a move
r1 = 3 * 4 r1 = 12 r1 = 3 / 0 ??? Don’t evaluate excepting ops !, what about floating-point?
» Evaluate conditional branch, replace with BRU or noop if (1 < 2) goto BB2 BRU BB2 if (1 > 2) goto BB2 convert to a noop
Algebraic identities» r1 = r2 + 0, r2 – 0, r2 | 0, r2 ^ 0, r2 << 0, r2 >> 0
r1 = r2
» r1 = 0 * r2, 0 / r2, 0 & r2 r1 = 0
» r1 = r2 * 1, r2 / 1 r1 = r2
![Page 8: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/8.jpg)
- 8 -
Class Problemr1 = 0r2 = 10r3 = 0
r4 = 1r7 = r1 * 4r6 = 8if (r3 > 0)
store (r1, r3)
Optimize this applying1. constant propagation2. constant folding3. dead code elimination
r2 = 0r6 = r6 * r7r3 = r2 / r6
r3 = r4r3 = r3 + r2r1 = r6
r2 = r2 + 1r1 = r1 + 1if (r1 < 100)
![Page 9: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/9.jpg)
- 9 -
Forward Copy Propagation
Forward propagation of the RHS of moves» r1 = r2
» …
» r4 = r1 + 1 r4 = r2 + 1 Benefits
» Reduce chain of dependences
» Eliminate the move Rules (ops X and Y)
» X is a move
» src1(X) is a register
» Y consumes dest(X)
» X.dest is an available def at Y
» X.src1 is an available expr at Y
r1 = r2r3 = r4
r2 = 0 r6 = r3 + 1
r5 = r2 + r3
![Page 10: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/10.jpg)
- 10 -
CSE – Common Subexpression Elimination Eliminate recomputation of an
expression by reusing the previous result
» r1 = r2 * r3
» r100 = r1
» …
» r4 = r2 * r3 r4 = r100 Benefits
» Reduce work
» Moves can get copy propagated Rules (ops X and Y)
» X and Y have the same opcode
» src(X) = src(Y), for all srcs
» expr(X) is available at Y
» if X is a load, then there is no store that may write to address(X) along any path between X and Y
r1 = r2 * r6r3 = r4 / r7
r2 = r2 + 1 r6 = r3 * 7
r5 = r2 * r6r8 = r4 / r7r9 = r3 * 7
if op is a load, call it redundantload elimination rather than CSE
![Page 11: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/11.jpg)
- 11 -
Class ProblemOptimize this applying1. dead code elimination2. forward copy propagation3. CSE
r4 = r1r6 = r15r2 = r3 * r4r8 = r2 + r5r9 = rr7 = load(r2)if (r2 > r8)
r5 = r9 * r4r11 = r2r12 = load(r11)if (r12 != 0)
r3 = load(r2)r10 = r3 / r6
r11 = r8store (r11, r7)
store (r12, r3)
![Page 12: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/12.jpg)
- 12 -
Loop Invariant Code Motion (LICM) Move operations whose source
operands do not change within the loop to the loop preheader» Execute them only 1x per
invocation of the loop» Be careful with memory
operations!» Be careful with ops not
executed every iteration
r1 = 3r5 = &A
r4 = load(r5)r7 = r4 * 3
r8 = r2 + 1r7 = r8 * r4
r3 = r2 + 1
r1 = r1 + r7
store (r1, r3)
![Page 13: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/13.jpg)
- 13 -
LICM (2) Rules
» X can be moved» src(X) not modified in loop body» X is the only op to modify dest(X)» for all uses of dest(X), X is in the
available defs set» for all exit BB, if dest(X) is live on the
exit edge, X is in the available defs set on the edge
» if X not executed on every iteration, then X must provably not cause exceptions
» if X is a load or store, then there are no writes to address(X) in loop
r1 = 3r5 = &A
r4 = load(r5)r7 = r4 * 3
r8 = r2 + 1r7 = r8 * r4
r3 = r2 + 1
r1 = r1 + r7
store (r1, r3)
![Page 14: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/14.jpg)
- 14 -
Homework 2 – Speculative LICM
r1 = 3r5 = &A
r4 = load(r5)r7 = r4 * 3r2 = r2 + 1
r8 = r2 + 1 store (r1, r7)
r1 = r1 + r7
Cannot perform LICM on load, because theremay be an alias with the store
Memory profile says that these rarelyalias
Speculative LICM:1) Remove infrequent dependence between
loads and stores2) Perform LICM on load3) Perform LICM on any consumers of the
load that become invariant (done for you!)4) Check that an alias occurred at run-time5) Insert fix-up code to restore correct execution
![Page 15: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/15.jpg)
- 15 -
Speculative LICM (2)r1 = 3
r5 = &Ar4 = load(r5)r7 = r4 * 3
if (alias)
r8 = r2 + 1store (r1, r7)
check for alias
r1 = r1 + r7
r4 = load(r5)r7 = r4 * 3
redo load and any dependentinstructions when alias occurred
Check for alias by comparing addresses of the loadand store at run-time
r2 = r2 + 1
RedoBB
SplitBB
OriginalBB
![Page 16: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/16.jpg)
- 16 -
Global Variable Migration
Assign a global variable temporarily to a register for the duration of the loop» Load in preheader
» Store at exit points Rules
» X is a load or store
» address(X) not modified in the loop
» if X not executed on every iteration, then X must provably not cause an exception
» All memory ops in loop whose address can equal address(X) must always have the same address as X
r4 = load(r5)r4 = r4 + 1
r8 = load(r5)r7 = r8 * r4
store(r5, r4)
store(r5,r7)
![Page 17: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/17.jpg)
- 17 -
Induction Variable Strength Reduction Create basic induction
variables from derived induction variables
Induction variable» BIV (i++)
0,1,2,3,4,...
» DIV (j = i * 4) 0, 4, 8, 12, 16, ...
» DIV can be converted into a BIV that is incremented by 4
Issues» Initial and increment vals
» Where to place increments
r5 = r4 - 3r4 = r4 + 1
r7 = r4 * r9
r6 = r4 << 2
![Page 18: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/18.jpg)
- 18 -
Induction Variable Strength Reduction (2) Rules
» X is a *, <<, + or – operation» src1(X) is a basic ind var» src2(X) is invariant» No other ops modify dest(X)» dest(X) != src(X) for all srcs» dest(X) is a register
Transformation» Insert the following into the preheader
new_reg = RHS(X)
» If opcode(X) is not add/sub, insert to the bottom of the preheader new_inc = inc(src1(X)) opcode(X) src2(X)
» else new_inc = inc(src1(X))
» Insert the following at each update of src1(X) new_reg += new_inc
» Change X dest(X) = new_reg
r5 = r4 - 3r4 = r4 + 1
r7 = r4 * r9
r6 = r4 << 2
![Page 19: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/19.jpg)
- 19 -
Class ProblemOptimize this applyinginduction var str reduction
r5 = r5 + 1r11 = r5 * 2
r10 = r11 + 2r12 = load (r10+0)
r9 = r1 << 1r4 = r9 - 10
r3 = load(r4+4)r3 = r3 + 1
store(r4+0, r3)r7 = r3 << 2
r6 = load(r7+0)r13 = r2 - 1r1 = r1 + 1r2 = r2 + 1
r1 = 0r2 = 0
r13, r12, r6, r10liveout
![Page 20: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/20.jpg)
- 20 -
ILP Optimization
Traditional optimizations» Redundancy elimination
» Reducing operation count
ILP (instruction-level parallelism) optimizations» Increase the amount of parallelism and the ability to overlap
operations
» Operation count is secondary, often trade parallelism for extra instructions (avoid code explosion)
ILP increased by breaking dependences» True or flow = read after write dependence
» False or (anti/output) = write after read, write after write
![Page 21: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/21.jpg)
- 21 -
Global Register Renaming
Straight-line code strategy does not work» A single use may have
multiple reaching defs Web = Collection of defs/uses
which have possible value flow between them» Identify webs
Take a def, add all uses Take all uses, add all
reaching defs Take all defs, add all uses repeat until stable soln
» Each web renamed if name is the same as another web
x =y =
= y= x
x =y =
= y = x
y =
= y
![Page 22: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/22.jpg)
- 22 -
Back Substitution Generation of expressions by
compiler frontends is very sequential» Account for operator
precedence» Apply left-to-right within
same precedence Back substitution
» Create larger expressions Iteratively substitute RHS
expression for LHS variable
» Note – may correspond to multiple source statements
» Enable subsequent optis Optimization
» Re-compute expression in a more favorable manner
r9 = r1 + r2r10 = r9 + r3r11 = r10 - r4r12 = r11 + r5r13 = r12 – r6
Subs r12: r13 = r11 + r5 – r6Subs r11: r13 = r10 – r4 + r5 – r6Subs r10 r13 = r9 + r3 – r4 + r5 – r6Subs r9 r13 = r1 + r2 + r3 – r4 + r5 – r6
y = a + b + c – d + e – f;
![Page 23: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/23.jpg)
- 23 -
Tree Height Reduction Re-compute expression as a
balanced binary tree» Obey precedence rules
» Essentially re-parenthesize
» Combine literals if possible Effects
» Height reduced (n terms) n-1 (assuming unit latency) ceil(log2(n))
» Number of operations remains constant
» Cost Temporary registers “live”
longer
» Watch out for Always ok for integer arithmetic Floating-point – may not be!!
r9 = r1 + r2r10 = r9 + r3r11 = r10 - r4r12 = r11 + r5r13 = r12 – r6
r13 = r1 + r2 + r3 – r4 + r5 – r6
r1 + r2 r3 – r4 r5 – r6
+
+
t1 = r1 + r2t2 = r3 – r4t3 = r5 – r6t4 = t1 + t2r13 = t4 + t3
r13
after back subs:
original:
final code:
![Page 24: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/24.jpg)
- 24 -
Class Problem
Assume: + = 1, * = 3
0r1
0r2
0r3
1r4
2r5
0r6
operandarrival times
r10 = r1 * r2r11 = r10 + r3r12 = r11 + r4r13 = r12 – r5r14 = r13 + r6
Back susbstituteRe-express in tree-height reduced form
Account for latency and arrival times
![Page 25: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/25.jpg)
- 25 -
Optimizing Unrolled Loops
r1 = load(r2)r3 = load(r4)r5 = r1 * r3r6 = r6 + r5r2 = r2 + 4r4 = r4 + 4if (r4 < 400) goto loop
loop: r1 = load(r2)r3 = load(r4)r5 = r1 * r3r6 = r6 + r5r2 = r2 + 4r4 = r4 + 4
r1 = load(r2)r3 = load(r4)r5 = r1 * r3r6 = r6 + r5r2 = r2 + 4r4 = r4 + 4
r1 = load(r2)r3 = load(r4)r5 = r1 * r3r6 = r6 + r5r2 = r2 + 4r4 = r4 + 4if (r4 < 400) goto loop
iter1
iter2
iter3
Unroll = replicate loop body n-1 times.
Hope to enable overlap ofoperation execution fromdifferent iterations
Not possible!
loop:
unroll 3 times
![Page 26: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/26.jpg)
- 26 -
Register Renaming on Unrolled Loop
r1 = load(r2)r3 = load(r4)r5 = r1 * r3r6 = r6 + r5r2 = r2 + 4r4 = r4 + 4
r1 = load(r2)r3 = load(r4)r5 = r1 * r3r6 = r6 + r5r2 = r2 + 4r4 = r4 + 4
r1 = load(r2)r3 = load(r4)r5 = r1 * r3r6 = r6 + r5r2 = r2 + 4r4 = r4 + 4if (r4 < 400) goto loop
iter1
iter2
iter3
loop: r1 = load(r2)r3 = load(r4)r5 = r1 * r3r6 = r6 + r5r2 = r2 + 4r4 = r4 + 4
r11 = load(r2)r13 = load(r4)r15 = r11 * r13r6 = r6 + r15r2 = r2 + 4r4 = r4 + 4
r21 = load(r2)r23 = load(r4)r25 = r21 * r23r6 = r6 + r25r2 = r2 + 4r4 = r4 + 4if (r4 < 400) goto loop
iter1
iter2
iter3
loop:
![Page 27: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/27.jpg)
- 27 -
Register Renaming is Not Enough!
Still not much overlap possible Problems
» r2, r4, r6 sequentialize the iterations
» Need to rename these 2 specialized renaming optis
» Accumulator variable expansion (r6)
» Induction variable expansion (r2, r4)
r1 = load(r2)r3 = load(r4)r5 = r1 * r3r6 = r6 + r5r2 = r2 + 4r4 = r4 + 4
r11 = load(r2)r13 = load(r4)r15 = r11 * r13r6 = r6 + r15r2 = r2 + 4r4 = r4 + 4
r21 = load(r2)r23 = load(r4)r25 = r21 * r23r6 = r6 + r25r2 = r2 + 4r4 = r4 + 4if (r4 < 400) goto loop
iter1
iter2
iter3
loop:
![Page 28: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/28.jpg)
- 28 -
Accumulator Variable Expansion
Accumulator variable» x = x + y or x = x – y
» where y is loop variant!! Create n-1 temporary
accumulators Each iteration targets a
different accumulator Sum up the accumulator
variables at the end May not be safe for floating-
point values
r1 = load(r2)r3 = load(r4)r5 = r1 * r3r6 = r6 + r5r2 = r2 + 4r4 = r4 + 4
r11 = load(r2)r13 = load(r4)r15 = r11 * r13r16 = r16 + r15r2 = r2 + 4r4 = r4 + 4
r21 = load(r2)r23 = load(r4)r25 = r21 * r23r26 = r26 + r25r2 = r2 + 4r4 = r4 + 4if (r4 < 400) goto loop
iter1
iter2
iter3
loop:r16 = r26 = 0
r6 = r6 + r16 + r26
![Page 29: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/29.jpg)
- 29 -
Induction Variable Expansion
Induction variable» x = x + y or x = x – y
» where y is loop invariant!! Create n-1 additional induction
variables Each iteration uses and
modifies a different induction variable
Initialize induction variables to init, init+step, init+2*step, etc.
Step increased to n*original step
Now iterations are completely independent !!
r1 = load(r2)r3 = load(r4)r5 = r1 * r3r6 = r6 + r5r2 = r2 + 12r4 = r4 + 12
r11 = load(r12)r13 = load(r14)r15 = r11 * r13r16 = r16 + r15r12 = r12 + 12r14 = r14 + 12
r21 = load(r22)r23 = load(r24)r25 = r21 * r23r26 = r26 + r25r22 = r22 + 12r24 = r24 + 12if (r4 < 400) goto loop
iter1
iter2
iter3
loop:r16 = r26 = 0
r6 = r6 + r16 + r26
r12 = r2 + 4, r22 = r2 + 8r14 = r4 + 4, r24 = r4 + 8
![Page 30: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/30.jpg)
- 30 -
Better Induction Variable Expansion
With base+displacement addressing, often don’t need additional induction variables» Just change offsets in each
iterations to reflect step
» Change final increments to n * original step
r1 = load(r2)r3 = load(r4)r5 = r1 * r3r6 = r6 + r5
r11 = load(r2+4)r13 = load(r4+4)r15 = r11 * r13r16 = r16 + r15
r21 = load(r2+8)r23 = load(r4+8)r25 = r21 * r23r26 = r26 + r25r2 = r2 + 12r4 = r4 + 12if (r4 < 400) goto loop
iter1
iter2
iter3
loop:r16 = r26 = 0
r6 = r6 + r16 + r26
![Page 31: EECS 583 – Class 9 Classic and ILP Optimization](https://reader035.fdocuments.net/reader035/viewer/2022062422/568131c3550346895d982ce0/html5/thumbnails/31.jpg)
- 31 -
Homework Problem
r1 = load(r2)r5 = r6 + 3r6 = r5 + r1r2 = r2 + 4if (r2 < 400) goto loop
loop:r1 = load(r2)r5 = r6 + 3r6 = r5 + r1r2 = r2 + 4
r1 = load(r2)r5 = r6 + 3r6 = r5 + r1r2 = r2 + 4
r1 = load(r2)r5 = r6 + 3r6 = r5 + r1r2 = r2 + 4if (r2 < 400) goto loop
loop:
Optimize the unrolledloop
RenamingTree height reductionInd/Acc expansion