Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of...

40
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware Rapidly Selecting Good Compiler Optimizations Using Performance Counters

Transcript of Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of...

Page 1: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

John Cavazos

Department ofComputer and Information Sciences

University of Delaware

Rapidly Selecting Good Compiler Optimizations

Using Performance Counters

Page 2: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Traditional Compilers

► “One size fits all” approach► Tuned for average performance► Aggressive opts often turned off► Target hard to model analytically

application

runtime system

compiler

operating system

hardware

virtualization

Page 3: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Solution

► Use performance counter characterization► Train model off-line► Counter values are “features” of program► Out-performs state-of-the-art compiler► 2 orders of magnitude faster

than pure searchapplication

runtime system

compiler

operating system

hardware

virtualization

Performance CounterInformation

Page 4: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Performance Counters

► 60 counters available► 5 categories

► FP, Branch, L1 cache, L2 cache, TLB, Others► Examples:

Mnemonic Description Avg Values►FPU_IDL (Floating Unit Idle) 0.473►VEC_INS (Vector Instructions) 0.017►BR_INS (Branch Instructions) 0.047►L1_ICH (L1 Icache Hits) 0.0006

Page 5: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Characterization of SPEC FP

Page 6: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Larger number of L1 icache misses, L1 store misses and L2 D-cache writesLarger number of L1 icache misses, L1 store misses and L2 D-cache writes

Characterization of SPEC FP

Page 7: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Characterization of MiBenchExercises the cache less than SPEC FP.Exercises the cache less than SPEC FP.

Page 8: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Characterization of MiBenchMore branches than SPEC FP and more are mispredictions.More branches than SPEC FP and more are mispredictions.

Page 9: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Characterization of 181.mcf

Problem: Greater number of memory accesses per instruction than averageProblem: Greater number of memory accesses per instruction than average

Page 10: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Characterization of 181.mcf

Problem: BUT also Branch InstructionsProblem: BUT also Branch Instructions

Page 11: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Characterization of 181.mcf

Reduce total/branch instructions and L1 I-cache and D-cache accesses.Reduce total/branch instructions and L1 I-cache and D-cache accesses.

Page 12: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Characterization of 181.mcf

Model reduces L1 cache misses which reduces L2 cache accesses.Model reduces L1 cache misses which reduces L2 cache accesses.

Page 13: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Putting Perf Counters to Use

► Important aspects of programs captured with performance counters

► Automatically construct model (PC Model)► Map performance counters to good opts

► Model predicts optimizations to apply► Uses performance counter characterization

Page 14: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Training PC Model

Compiler and

Page 15: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Programs to train model (different from test program).

Compiler and

Training PC Model

Page 16: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Baseline runs to capture performance counter values.

Compiler and

Training PC Model

Page 17: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Obtain performance counter values for a benchmark.

Compiler and

Training PC Model

Page 18: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Best optimizations runs to get speedup values.

Compiler and

Training PC Model

Page 19: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Best optimizations runs to get speedup values.

Compiler and

Training PC Model

Page 20: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

New program interested in obtaining good performance.

Compiler and

Using PC Model

Page 21: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Baseline run to capture performance counter values.

Compiler and

Using PC Model

Page 22: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Feed performance counter values to model.

Compiler and

Using PC Model

Page 23: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Model outputs a distribution that is use to generate sequences

Compiler and

Using PC Model

Page 24: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Optimization sequences drawn from distribution.

Compiler and

Using PC Model

Page 25: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

►Trained on data from Random Search

►500 evaluations for each benchmark

►Leave-one-out cross validation

►Training on N-1 benchmarks

►Test on Nth benchmark

►Logistic Regression

PC Model

Page 26: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

►Variation of ordinary regression

►Inputs

►Continuous, discrete, or a mix

►60 performance counters►All normalized to cycles executed

►Ouputs

►Restricted to two values (0,1)

►Probability an optimization is beneficial

Logistic Regression

Page 27: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

► PathScale compiler► Compare to highest optimization level► 121 compiler flags

► AMD Athlon processor► Real machine; Not simulation

► 57 benchmarks► SPEC (INT 95, INT/FP 2000), MiBench,

Polyhedral

Experimental Methodology

Page 28: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

► RAND► Randomly select 500 optimization seqs

► Combined Elimination [CGO 2006]► Pure search technique

► Evaluate optimizations one at a time► Eliminate negative optimizations in one go

► Out-performed other pure search techniques

► PC Model

Evaluated Search Strategies

Page 29: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

PC Model vs CE (MiBench/Polyhedral)

Page 30: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

1. 9 benchmarks over 20% improvement and 17% on average!2. CE uses 607 iterations (240-1550) and PC Model 25 iterations.

PC Model vs CE (MiBench/Polyhedral)

Page 31: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

PC Model vs CE (SPEC 95/SPEC 2000)

Page 32: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

1. Obtain over 25% improvement on 7 benchmarks!2. On average, CE obtains 9% and PC Model 17% over -ofast!

PC Model vs CE (SPEC 95/SPEC 2000)

Page 33: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Performance vs Evaluations

Page 34: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Random (17%)

Combined Elimination (12%)

PC Model (17%)

Performance vs Evaluations

Page 35: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

► Combined Elimination► Dependent on dimensions of space► Easily stuck in local minima

► RAND► Probabilistic technique► Depends on distribution of good points► Not susceptible to local minima

Note: CE may improve in space with many bad opts.

Why is CE worse than RAND?

Page 36: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

► Characterizing large programs hard► Performance counters effectively

summarize program's dynamic behavior► Previously* used static features [CGO 2006]

► Does not work for whole program characterization

Program Characterization

Page 37: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

► Use performance counters to find good optimization settings

► Out-performs production compiler in few evaluations (+ 3 for counters)

► 2 orders of magnitude faster than best known pure search technique

Conclusions

Page 38: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Backup Slides

Page 39: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

Static vs Dynamic Features

Page 40: Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

Dept. of Computer and Information Sciences : University of Delaware

1. L1 Cache Accesses2. L1 Dcache Hits3. TLB Data Misses4. Branch Instructions5. Resource Stalls6. Total Cycles7. L2 Icache Hits8. Vector Instructions

9. L2 Dcache Hits10. L2 Cache Accesses11. L1 Dcache Accesses12. Hardware Interrupts13. L2 Cache Hits14. L1 Cache Hits15. Branch Misses

Most Informative Performance Counters

Most Informative Features