instruction selection 745

Page ‹#›

School of Computer Science

Instruction Selection

15-745 Optimizing CompilersSpring 2007

1School of Computer Science


IR

TempMap

instructionselector

registerallocator

Assem

Asseminstructionscheduler

Compiler Backend


Intermediate Representations

Front end - produces an intermediate representation (IR)Middle end - transforms the IR into an equivalent IR that runs moreefficientlyBack end - transforms the IR into native code

IR encodes the compiler’s knowledge of the programMiddle end usually consists of several passes

FrontEnd

MiddleEnd

BackEnd

IR IRSourceCode

TargetCode

Page ‹#›


Intermediate RepresentationsDecisions in IR design affect the speed and efficiency ofthe compilerSome important IR properties

– Ease of generation– Ease of manipulation– Procedure size– Freedom of expression– Level of abstraction

The importance of different properties varies betweencompilers

– Selecting an appropriate IR for a compiler is critical


Examples:Trees, DAGs

Examples:3 address codeStack machine code

Example:Control-flow graph

Types of IntermediateRepresentationsStructural

– Graphically oriented– Heavily used in source-to-source translators– Tend to be large

Linear– Pseudo-code for an abstract machine– Level of abstraction varies– Simple, compact data structures– Easier to rearrange

Hybrid– Combination of graphs and linear code


Level of AbstractionThe level of detail exposed in an IR influences theprofitability and feasibility of different optimizations.

Two different representations of an array reference:

subscript

A i j

loadI 1 => r1sub rj, r1 => r2loadI 10 => r3mult r2, r3 => r4sub ri, r1 => r5add r4, r5 => r6loadI @A => r7add r7, r6 => r8load r8 => rAij

High level AST:Good for memory disambiguation Low level linear code:

Good for address calculation7


Level of AbstractionStructural IRs are usually considered high-levelLinear IRs are usually considered low-levelNot necessarily true:

+

*

10

j 1

- j 1

-

+

@A

load

Low level AST loadArray A,i,j

High level linear code

Page ‹#›


Abstract Syntax TreeAn abstract syntax tree is the procedure’s parsetree with the nodes for most non-terminal nodesremoved

-

x

2 y

*

x - 2 * y

When is an AST a good IR?

Structural IR


Directed Acyclic GraphA directed acyclic graph (DAG) is an AST with a uniquenode for each value

Makes sharing explicitEncodes redundancy

x

2 y

*

-

←

z /

←

w

z ← x - 2 * yw ← x / 2

Same expression twice meansthat the compiler might arrangeto evaluate it just once!

Structural IR


Pegasus IRPredicated ExplicitGAted SimpleUniform SSA

Structural IR used inCASH (Compiler forApplication SpecificHardware)

Structural IR


Stack Machine CodeOriginally used for stack-based computers, now JavaExample:

x - 2 * y becomes

Advantages– Compact form– Introduced names are implicit, not explicit– Simple to generate and execute code

Useful where code is transmittedover slow communication links (the net )

push xpush 2push ymultiplysubtract

Implicit names take upno space, where explicitones do!

Linear IR

Page ‹#›


Three Address CodeThree address code has statements of the form:

x ← y op zWith 1 operator (op ) and, at most, 3 names (x, y, & z)

Example:z ← x - 2 * y becomes

Advantages:– Resembles many machines (RISC)– Compact form

t ← 2 * yz ← x - t

Linear IR


Two Address CodeAllows statements of the form

x ← x op yHas 1 operator (op ) and, at most, 2 names (x and y)

Example:z ← x - 2 * y becomes

– Can be very compact– Destructive operations make reuse hard– Good model for machines with destructive ops (x86)

t1 ← 2t2 ← load yt2 ← t2 * t1z ← load xz ← z - t2

Linear IR


Control-flow GraphModels the transfer of control in the procedureNodes in the graph are basic blocks

– Straight-line code– Either linear or structural representation

Edges in the graph represent control flow

Example if (x = y)

a ← 2b ← 5

a ← 3b ← 4

c ← a * b

Basic blocks —Maximal lengthsequences ofstraight-line code

Hybrid IR


Using Multiple Representations

Repeatedly lower the level of the intermediate representation– Each intermediate representation is suited towards certain optimizations

Example: the Open64 compiler– WHIRL intermediate format

• Consists of 5 different IRs that are progressively more detailed– gcc

• but not explicit about it :-(

FrontEnd

MiddleEnd

BackEnd

IR 1 IR 3SourceCode

TargetCode

MiddleEnd

IR 2

Page ‹#›


Instruction Selection

IR instructionselector Assem

IR -> Assem

templates


Instruction selection exampleSuppose we have

We can generate the x86 code…

…if c = 1, 2, or 4; otherwise…

MOVE(TEMP r, MEM(BINOP(TIMES,TEMP s,CONST c)))

movl %(,s,c), %r

imull $c, %s, %rmovl (%r), %r


MOVE(TEMP r, MEM(BINOP(TIMES,TEMP s,CONST c)))MOVE(TEMP r, MEM(BINOP(TIMES,TEMP s,CONST c)))

Selection dependenciesWe can see that the selection of instructions candepend on the constantsThe context of an IR expression can also affect thechoice of instructionConsider


For

we might sometimes want to generate

What context might cause us to do this?

testl $0,(,%s,c)je L

MEM(BINOP(TIMES,TEMP s,CONST c))

Example, cont’d

Page ‹#›


Instruction selection as tree matchingIn order to take context into account, instructionselectors often use pattern-matching on IR trees

– each pattern specifies what instructions to select


If c is 1, 2, or 4

Sample tree-matching rulesIR pattern code cost

BINOP(PLUS,i,j) leal (i,j),r 1

BINOP(TIMES,i,j) movl j,r 2 imull i,r

BINOP(PLUS,i,CONST c) leal c(i),r 1

MEM(BINOP(PLUS,i,CONST c)) movl c(i),r 1

MOVE(MEM(BINOP(PLUS,i,j)),k) movl k,(i,j) 1

BINOP(TIMES,i,CONST c) leal (,i,c),r 1

BINOP(TIMES,i,CONST c) movl c,r 2 imull i,r

MEM(i) movl (i),r 1

MOVE(MEM(i),j) movl j,(i) 1

MOVE(MEM(i),MEM(j)) movl (j),t 2 movl t,(i)

…


a[x] = *y;MOVE

MEM MEM

y+

MEM

+

EBP CONST a

leal $a(%ebp),r1movl (r1),r2leal (,x,$4),r3leal (r2,r3),r4movl (y),r5movl r5,(r4)

CONST 4

*

x

(assume a is aformal parameterpassed on the stack)

r1

r2 r3

r4

Tiling an IR tree, v.1


Tiling an IR tree, v.2a[x] = *y;

MOVE

MEM MEM

y+

MEM

+

BP CONST a

movl $a(%ebp),r1leal (,x,4),r2movl (y),r3movl r3,(r1,r2)

CONST 4

*

x

r1 r2

r3

Page ‹#›


Tiling choicesIn general, for any given tree, many tilings arepossible

– each resulting in a different instruction sequenceWe can ensure pattern coverage by covering, at aminimum, all atomic IR trees


The best tiling?We want the “lowest cost” tiling

– usually, the shortest sequence– but can also take into account cost/delay of each instruction

Optimum tiling– lowest-cost tiling

Locally Optimal tiling– no two adjacent tiles can be combined into one tile of lower

cost


Locally optimal tilingsLocally optimal tiling is easy

– A simple greedy algorithm works extremelywell in practice:

– Maximal munch• start at root• use “biggest” match (in # of nodes)

– use cost to break ties


Maximal munch MOVE

MEM MEM

y+

MEM

+

BP CONST a

CONST 4

*

x

Choose the largestpattern with lowestcost, i.e., the“maximal munch”

IR pattern code costMOVE(MEM(BINOP(PLUS,i,j)),k) movl k,(i,j) 1MOVE(MEM(i),j) movl j,(i) 1MOVE(MEM(i),MEM(j)) movl (j),t 2

movl t,(i)

Page ‹#›


Maximal munchMaximal munch does not necessarily produce theoptimum selection of instructionsBut:

– it is easy to implement– it tends to work well for current instruction-set

architectures


Maximal munch is not optimumConsider what happens, for example, if two of ourrules are changed as follows:

IR pattern code cost

MEM(BINOP(PLUS,i,CONST c)) movl c,r 3 addl i,r movl (r),r

MOVE(MEM(BINOP(PLUS,i,j)),k) movl j,r 3 addl i,r movl k,(r)


If c is 1, 2, or 4

Sample tree-matching rulesRule # IR pattern cost

0 TEMP t 0

1 CONST c 1

2 BINOP(PLUS,i,j) 1

3 BINOP(TIMES,i,j) 2

4 BINOP(PLUS,i,CONST c) 1

5 MEM(BINOP(PLUS,i,CONST c)) 3

6 MOVE(MEM(BINOP(PLUS,i,j)),k) 3

7 BINOP(TIMES,i,CONST c) 1

8 BINOP(TIMES,i,CONST c) 2

9 MEM(i) 1

10 MOVE(MEM(i),j) 1

11 MOVE(MEM(i),MEM(j)) 231


Tiling an IR tree, new rulesa[x] = *y;

MOVE

MEM MEM

y+

MEM

+

BP CONST a

movl $a,r1addl %ebp,r1movl (r1),r1leal (,x,4),r2movl (y),r3movl r2,r4addl r1,r4movl r3,(r4) CONST 4

*

x

r1 r2

r3

Page ‹#›


Optimum selectionTo achieve optimum instruction selection, we mustuse a more complex algorithm

– dynamic programmingIn contrast to maximal munch, the trees arematched bottom-up


Dynamic programmingThe idea is fairly simple– Working bottom up…– Given the optimum tilings of all subtrees, generate

optimum tiling of the current tree• consider all tiles for the root of the current tree• sum cost of best subtree tiles and each tile• choose tile with minimum total cost

Second pass generates the code using the resultsfrom the bottom-up pass


Bottom-up CG, pass 1MOVE

MEM MEM

CONST 99

+

%EBP

+

MEM

+

%EBP CONST a

CONST 4

(0,0) (1,1)

(4,1)

(9,2) (1,1)

(4,3)

(9,4)

(10,6),(11,6)

(9,2)

(2,1)

(0,0) (1,1)Note: (r,c)means rule r,total cost c


Bottom-up CG, pass 2MOVE

MEM MEM

CONST 99

+

%EBP

+

MEM

+

%EBP CONST a

CONST 4

(0,0) (1,1)

(2,1)

(9,2) (1,1)

(4,3)

(9,4)

(10,6),(11,6)

(9,2)

(2,1)

(0,0) (1,1)Note: (r,c)means rule r,total cost c

Page ‹#›


Bottom-up code generationHow does the running time compare to maximalmunch?Memory usage?


ToolsA lot of tools have been developed toautomatically generate instruction selectors

– TWIG [Aho,Ganapathi,Tjiang, ’86]– BURG [Fraser,Henry,Proebsting, ’92]– BEG [Emmelmann,Schroer,Landwehr, ’89]– …

These generate bottom-up instruction selectorsfrom tree-matching rule specifications


Code-generator generatorsTwig, Burg, and Beg use the dynamicprogramming approach

– A lot of work has gone into making these cgg’shighly efficient

There are also grammar-based cgg’s that useLR(k) parsing to perform the tree-matching


Tiling a DAGHow would you tile

&x4 8

+ +

t0 t1

ld ldmov &x -> t3add t3,4 -> t4load (t4) -> t0add t3,4 -> t5load (t5) -> t1

mov x+4 -> t0mov x+8 -> t1

Page ‹#›


Peephole MatchingBasic ideaCompiler can discover local improvements locally

– Look at a small set of adjacent operations– Move a “peephole” over code & search for improvement

Classic example was store followed by load

storeAI r1 ⇒ r0,8loadAI r0,8 ⇒ r15

storeAI r1 ⇒ r0,8i2i r1 ⇒ r15

Original code Improved code




Classic example was store followed by loadSimple algebraic identities

addI r2,0 ⇒ r7mult r4,r7 ⇒ r10

mult r4,r2 ⇒ r10





Classic example was store followed by loadSimple algebraic identitiesJump to a jump

jumpI → L10L10: jumpI → L11

L10: jumpI → L11



Peephole MatchingImplementing itEarly systems used limited set of hand-coded patternsWindow size ensured quick processing

Modern peephole instruction selectors (Davidson)Break problem into three tasks

Apply symbolic interpretation & simplification systematically

ExpanderIR→LLIR

SimplifierLLIR→LLIR

MatcherLLIR→ASM

IR LLIR LLIR ASM

Page ‹#›


Peephole MatchingExpanderTurns IR code into a low-level IR (LLIR) such as RTLOperation-by-operation, template-driven rewritingLLIR form includes all direct effects (e.g., setting cc)Significant, albeit constant, expansion of size

ExpanderIR→LLIR


MatcherLLIR→ASM

IR LLIR LLIR ASM


Peephole MatchingSimplifierLooks at LLIR through window and rewrites itUses forward substitution, algebraic simplification, local constantpropagation, and dead-effect eliminationPerforms local optimization within window

This is the heart of the peephole system– Benefit of peephole optimization shows up in this step

ExpanderIR→LLIR


MatcherLLIR→ASM

IR LLIR LLIR ASM


Peephole MatchingMatcherCompares simplified LLIR against a library of patternsPicks low-cost pattern that captures effectsMust preserve LLIR effects, may add new ones (e.g., setcc)Generates the assembly code output

ExpanderIR→LLIR


MatcherLLIR→ASM

IR LLIR LLIR ASM


Example

Original IR Code

wt1xsub

t1Y2mult

ResultArg2Arg1OP

LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18

Expand

Page ‹#›


Example LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18

SimplifyLLIR Code

r13 ← MEM(r0+ @y) r14 ← 2 x r13 r17 ← MEM(r0 + @x) r18 ← r17 - r14

MEM(r0 + @w) ← r18


Example

MatchLLIR Code


MEM(r0 + @w) ← r18

ILOC CodeloadAI r0,@y ⇒ r13multI 2 x r13 ⇒ r14loadAI r0,@x ⇒ r17sub r17 - r14 ⇒ r18storeAI r18 ⇒ r0,@w


Steps of the Simplifier LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18

3-operation window



r10 ← 2r11 ← @yr12 ← r0 + r11

Page ‹#›



r10 ← 2r11 ← @yr12 ← r0 + r11

r10 ← 2r12 ← r0 + @yr13 ← MEM(r12)



r10 ← 2r12 ← r0 + @yr13 ← MEM(r12)

r10 ← 2r13 ← MEM(r0 + @y)r14 ← r10 x r13



r13 ← MEM(r0 + @y)r14 ← 2 x r13r15 ← @x

r10 ← 2r13 ← MEM(r0 + @y)r14 ← r10 x r13



r13 ← MEM(r0 + @y)r14 ← 2 x r13r15 ← @x

r14 ← 2 x r13r15 ← @xr16 ← r0 + r15

1st op it has rolledout of window

Page ‹#›



r14 ← 2 x r13r15 ← @xr16 ← r0 + r15

r14 ← 2 x r13r16 ← r0 + @xr17 ← MEM(r16)



r14 ← 2 x r13r17 ← MEM(r0+@x)r18 ← r17 - r14

r14 ← 2 x r13r16 ← r0 + @xr17 ← MEM(r16)



r17 ← MEM(r0+@x)r18 ← r17 - r14r19 ← @w

r14 ← 2 x r13r17 ← MEM(r0+@x)r18 ← r17 - r14



r18 ← r17 - r14r19 ← @wr20 ← r0 + r19

r17 ← MEM(r0+@x)r18 ← r17 - r14r19 ← @w

Page ‹#›



r18 ← r17 - r14r20 ← r0 + @wMEM(r20) ← r18

r18 ← r17 - r14r19 ← @wr20 ← r0 + r19



r18 ← r17 - r14MEM(r0 + @w) ← r18

r18 ← r17 - r14r20 ← r0 + @wMEM(r20) ← r18



r18 ← r17 - r14MEM(r0 + @w) ← r18

r18 ← r17 - r14r20 ← r0 + @wMEM(r20) ← r18


Example LLIR Code r10 ← 2 r11 ← @y r12 ← r0 + r11 r13 ← MEM(r12) r14 ← r10 x r13 r15 ← @x r16 ← r0 + r15 r17 ← MEM(r16) r18 ← r17 - r14 r19 ← @w r20 ← r0 + r19MEM(r20) ← r18

SimplifyLLIR Code


MEM(r0 + @w) ← r18

Page ‹#›


Making It All WorkDetailsLLIR is largely machine independent (RTL)Target machine described as LLIR → ASM patternActual pattern matching

– Use a hand-coded pattern matcher (gcc)– Turn patterns into grammar & use LR parser (VPO)

Several important compilers use this technologyIt seems to produce good portable instruction selectors

Key strength appears to be late low-level optimization


MOVE

MEM MEM

y

SIMD InstructionsWe’ve assumed eachtile is connected

What about SIMD instructions?Ex. TI C62x add2

+

32 bits

subword parallelism


Automatic VectorizationLoop level

Basic block level

for (i = 0; i < 1024; i++){ C[i] = A[i]*B[i];}

for (i = 0; i < 1024; i+=4){ C[i:i+3] = A[i:i+3]*B[i:i+3];}

x = a+b;y = c+d;

(x,y) = (a,c)+(b,d);

looking for independent, isomorphic operations

instruction selection 745

Documents

Transcript of instruction selection 745