Adaptive Compilation (results from current work) Fall 2003 Copyright 2011, Keith D. Cooper & Linda...

37
Adaptive Compilation (results from current work) Fall 2003 Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. Comp 512 Spring 2011 1 COMP 512, Rice University

Transcript of Adaptive Compilation (results from current work) Fall 2003 Copyright 2011, Keith D. Cooper & Linda...

Adaptive Compilation(results from current work)

Fall 2003

Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved.

Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use.

Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved.

Comp 512Spring 2011

1COMP 512, Rice University

COMP 512, Rice University

2

Roadmap

In the lecture on optimizing for code size (17), we looked at some early results on using genetic algorithms (GAs) to pick compilation orders that produce compact programs.

In the lecture on adaptive blocksize choice for dgemm with the MIPSPro compiler, we saw more adaptive behavior.

In this lecture:

• More experimental results

• Discussion of search algorithms other than GAs

• Insight into construction & use of adaptive compilers

COMP 512, Rice University

3

We are exploring new organizing principles

• Explicit objective function (chosen by the user )

• Steering algorithm controls optimizer and back end Picks & orders methods, chooses parameters Use multiple trials to explore the solution space Finds a configuration to minimize objective function

Adaptive Compilers

Vary parameters

Objectivefunction

executable code

Front end remains intact

SteeringAlgorithm

COMP 512, Rice University

4

These adaptive compilers

• Adapt to the program, the target, & the objective function

• Use power of modern computers to improve code quality

• Produce better results than their fixed-sequence siblings

Adaptive Compilers

Vary parameters

Objectivefunction

executable code

Front end remains intact

SteeringAlgorithm

• Are much more expensive (today) than fixed-sequence compilers

*

A practical way to find many optimization bugs

COMP 512, Rice University

5

What’s the Point?

These compiler configurations matter ...

• Choice of transformations Significant overlap in coverage Each input program has different opportunities Best choice depends on input, target, & objective function

• Order of transformation They create & destroy opportunities for each other We see significant variations (up & down) on simple codes Best choice depends on input, target, & objective function

• We do not know enough, today, to predict a good sequence

• Our prototype systems search the space of sequences

COMP 512, Rice University

6

Adaptive Compilation

Research prototype

• Based on MSCP compiler

• 13 transformations Run in any order (not easy) Many options & variants

• Search-based steering algorithms Hill climber Genetic algorithms Constructive greedy method Exhaustive enumeration

• Objective functions Run-time speed Code space Dynamic bit-transitions

Vary parameters

Objectivefunction

executable code

Front end

SteeringAlgorithm

• Experimental tool Exploring applications Learning about search space Designing better searches

COMP 512, Rice University

7

Experimental Results with Compilation Order

Early Experiments (1999, Lecture 17)

• Space then speed (10 transformation subset) 13% smaller code than fixed sequence (0 to 41%) Code was generally faster (26% to -25%; 5 of 14 slower)

• Speed then space 20% faster code than fixed sequence (best was 26%) Code was generally smaller (0 to 5%)

• Genetic Algorithm Evaluate each sequence Replace worst + 3 at each generation Generate new strings with crossover Apply mutation to all, save the best

• Found “best” sequence in 200 to 300 generations of size 20

Register-relative procedure abstraction gets 5% space, -2% speed

*

GA took many fewer probes to find “best” sequence than did random sampling.

NB: 10-of-10 fmin space, optimizing for size

COMP 512, Rice University

8

Experience from Early Experiments

Improving the Genetic Algorithm

• Experiments aimed at understanding & improving convergence

*

• Larger population helps Tried pools from 50 to 1000, work mostly with 50 to 100 100 generations is about right

• Use weighted selection in reproductive choice Fitness scaling to exaggerate late, small improvements

• Crossover 2 point, random crossover Apply “mutate until unique” to each new string

• Variable length chromosomes With fixed string, GA discovers NOPs Varying string rarely has useless pass

Current experiments use fixed length (9) strings… 9-of-13 space

COMP 512, Rice University

9

Improving the Search

Two kinds of experiments

• Enumerations Choose a code & a subspace Evaluate every point in that subspace

• Explorations Choose a code & a search algorithm Run a series of searches

Enumerations are critical for understanding the space

Explorations are critical for evaluating the search algorithms

COMP 512, Rice University

10

Enumerating FMIN

• One application, FMIN 150 lines of Fortran, 44 basic blocks Exhibited complex behavior in other experiments

• Ran hill-climber from many random starting points Picked the 5 most used passes from winners p, l, o, s, & n

• Enumerated the 10-of-5 subspace fmin+plosn 9,765,625 distinct combinations 34,000 to 37,000 experiments/machine/day Produced sequence & number of cycles to execute 14 CPU-months to complete

• Results shed light on fmin & the search space FMIN

Bad way to pick them ?

COMP 512, Rice University

11

Results from Enumerating FMIN

In the set of 9,765,625 sequences

• Best sequence is 986 cycles

• Seven at 987

• Worst sequence is 1716

• Unoptimized code takes 1716

• Range is 0% to 43% improvement

*

With full set of transformations:

GA: 822 in 184 generations of 100 839 in 100 generations of 50

Hill: 843 in 1,355 probes

Local minima

• 258 strict local minima (x < all Hamming-1 points)

• 27,315 non-strict local minima (x ≤ all Hamming-1 points)

• All local minima are improvements

• They come in many shapes

fmin+plosn

COMP 512, Rice University

12

Results from Enumerating FMIN

Distribution of solutions

+ 5%+ 10% + 20%

Fair number of “bad” solutions!

fmin+plosn

COMP 512, Rice University

13

Results from Enumerating FMIN

Strict local minima have many shapes

132 of these spikes, from loop peeling

minimum

Hamming-1 points

Magnitude is wall height (≥ 1)

Landscape looks like an asteroid, in 10 dimensions

fmin+plosn

COMP 512, Rice University

14

Results from Enumerating FMIN

1026 goes to 14,879

+ 5% + 10% + 20% + 30%

fmin+plosn

These are local minima, not all solutions

COMP 512, Rice University

15

Results from Enumerating FMIN

What does the space look like?

• 4-of-5 space

• fmin+plosn

• Prefix vs suffix plot

Looks like noise to me

• Not convex

• Not differentiable

• Outright hostile to the searches

Does order of p,l,o,s, & n matter?

COMP 512, Rice University

16

Results from Enumerating FMIN

What does the space look like?

This looks better

• Same space

• Different order

• Troughs, valleys, … fmin+plosn

Unfortunately, this does not exist

• Inconsistent axes…

COMP 512, Rice University

17

Results from Enumerating FMIN

What does the space look like?

• 4-of-5 space

• fmin+plosn

• Prefix vs suffix plot

Looks like noise to me

• Not convex

• Not differentiable

• Outright hostile to the searches

We still need to search this space!

COMP 512, Rice University

18

Searching FMIN

Steering algorithm needs to search the space for good sequences

• Hill climbers Start somewhere (random point) and walk downhill until

stuck To pick next point, try steepest descent or random

downhill

• Random probes Generate sequences at random If enough good sequences exist, you’re likely to hit one

• Greedy constructive algorithm Repeat until done: take the best step available In practice: consider prepending & appending one pass

• Genetic algorithm (of course) Combines big steps (crossover) & small steps (mutation) Survival of fittest, mutate until unique make it a search

*

COMP 512, Rice University

19

Hill climbing in FMIN

*

Local downhill walk

• Step moves to lowest Hamming-1 point

• Reaches minima quickly

12 steps in 10-of-5 for fmin+plosn

• Suggests a two-prong attack

pick a random point & walk to minimum

start from many random points

• Rarely finds best point

fmin+plosn

fmin+plosn

COMP 512, Rice University

20

Hill Climbing in FMIN

Narrowing the range of “good” to 5%

• Steepest descent (of Hamming-1 points) Gets stuck in local minima

• Random downhill walk (take any lower Hamming-1 point) Gets within 5% of minimum 88% of the time 7 step average, 18 step worst case

• Best downhill walk Gets within 5% of minimum 99% of the time

• Worst downhill walk Gets within 5% of minimum 38% of the time

Good News: Can reliably get within 5%Bad News: Cannot get within 1 to 2%

*

Suggests that we need to take bigger steps, as the GA does!

fmin+plosn

COMP 512, Rice University

21

Searching in FMIN

Greedy Constructive Algorithm

• Start with null sequence

• Add best option to the sequence (evaluate prefixes & suffixes)

• Repeat until desired length is reached

Ties complicate the matter

• Often, multiple additions have the same fitness values

• Each may lead to a different part of the search space

• How should the algorithm handle this situation? GC-EXH : explore all ties GC-50 : pick one at random & go forward (repeat 50

times) GC-fast : use breadth-first, not depth-first, search

COMP 512, Rice University

22

Search Results in FMIN

Instructions Executed

Sequences evaluated • GA-50 did best

fmin 9-of-13 space

COMP 512, Rice University

23

Search Results in FMIN

Instructions Executed

Sequences evaluated • GA-50 did best

• Tie breaking did not matter in this caseGC was cheap

fmin 9-of-13 space

COMP 512, Rice University

24

Search Results in FMIN

Instructions Executed

Sequences evaluated• GA-50 did best

• Tie breaking did not matter in this case

GC was cheap• HC-50 was cost

effective

fmin 9-of-13 space

COMP 512, Rice University

25

Search Results in FMIN

Instructions Executed

Sequences evaluated• GA-50 did best

• Tie breaking did not matter in this case

GC was cheap• HC-50 was cost

effective• RC-3000 is not bad• GC-fast beat RC-300

fmin 9-of-13 space

COMP 512, Rice University

26

Search Results in FMIN

Data on more programs

• Lots of trends here …

• GA wins (for more effort)

• GC & HC do well for less effort

COMP 512, Rice University

27

Search Results

Does tie-breaking ever matter in GC?

• 91,663 sequences versus 325 sequences in adpcm

• Small variation in fitness value (see next slide)

Sequences tested

COMP 512, Rice University

28

Search Results in FMIN

Factor of 282 difference in sequences evaluated

2.2% difference in instructions executed

COMP 512, Rice University

29

Road Map for our Project

Our goals

• Short term (now) Characterize the problems, the potential, & the search space Learn to find outstanding sequences quickly (search)

• Medium term (3 to 5 years) Develop practical adaptive compilers for scalar optimization Develop a framework for self-tuning optimizers Develop proxies and estimators for performance (speed )

• Long term (5 to 10 years) Apply these techniques to harder problems

Data distribution, parallelization schemes on real codes Compiling for complex environments, like the Grid

Build systems that use knowledge from search to make adaptive compilation both practical & routine

Processor

Speed

*

COMP 512, Rice University

30

Relating Code Properties to Sequences

Looking at the input program might let us predict good sequences

• Lots of work needed to generalize our understanding

• We want to look at four aspects of the problem Search strategies Objective functions Program characteristics Accumulated experience

With enough experience, we may be able to predict good sequences

Each may contribute to finding the best sequence efficiently

COMP 512, Rice University

31

Making It Practical

Full-blown search may not be the answer

Many strategies are possible for a production system

• Best of k trials

• Pick k sequences from a database of many (say 5,000) Might use program properties to pick sequence

• Incrementalize the search over many compilations

• Find a good program specific sequence and keep using it

• Find the best sequence in a fixed amount of compile time

This should be fertile ground for research & experimentation

COMP 512, Rice University

32

Another Piece of the Puzzle

Engineering Reconfigurable Compilers

• We built ours (almost) by accident Pass structure optimizer (no surprise there) Uses a definitive, low-level, linear IR Each pass does its own analysis Nothing passed along, unless it is written in the IR

• This eliminates most of the impediments to reordering Name space for PRE & LCM OSR algorithm requires constant propagation & code motion

• Need ways to specify interactions & manage them Check potential configurations for conflict Package predetermines subsequences together

• Need ways to use results of earlier searches to improve process

Lots of effort on building fast

components

COMP 512, Rice University

33

EXTRA SLIDES START HERE

COMP 512, Rice University

34

Results from FMIN

Analysis (K-L divergence testing) suggests that we can obtain a good model of the distribution with 1,000 probes of the

subspace

• GA used 6,000 in a 14-transformation subspace

• Need to test other subspaces

• Suggests that search engine can build a probabilistic model We are devising an algorithm for this one …

Near-term Experiments

• Running another subspace for FMIN

• Running subspaces on other, dissimilar codes

• Speed improvements to experimental apparatus Goal is 500,000 compiles / day in the lab Will port framework to Rice TeraScale Cluster (264 IA-64s)

• Next step is to characterize program properties

COMP 512, Rice University

35

Results from FMIN

Can we take advantage of clustering in the good sol’ns?

• Cluster is connected set of points, all below threshold value

• Clustering explains the need for multiple trials

*

COMP 512, Rice University

36

Results from FMIN

Cluster Sizes

Distance from best (as %)

#

clusters

COMP 512, Rice University

37

Results from FMIN

Looking at cluster size

• Cluster is connected set of points, all below threshold value

• Clustering explains the need for multiple trials

• Cluster numbers and shapes depend on threshold

Phase transition between 2% and 2.5% (for FMIN)

• At 10%, there are 13 clusters

12 are singular points involving a “peel” minimum

The other one contains the rest of the 10%-good points

*