Learning A Better Compiler

26
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning

description

Learning A Better Compiler. Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning. Predicting Unroll Factors. Loop Unrolling sensitive to unroll factor Current solution: expert design - PowerPoint PPT Presentation

Transcript of Learning A Better Compiler

Page 1: Learning A Better Compiler

Learning A Better Compiler

Predicting Unroll Factors using Supervised Classification

And

Integrating CPU and L2 Cache Voltage Scaling using Machine Learning

Page 2: Learning A Better Compiler

Predicting Unroll Factors

• Loop Unrolling sensitive to unroll factor

• Current solution: expert design– Difficult: Hand-tuned heuristics– Must be rewritten frequently

• Predict parameters with machine learning– Easy: data collection takes ~1wk

• No human time

– Algorithm does not change with compiler

Page 3: Learning A Better Compiler

Loop Unrolling

• Combines multiple iterations loop body

• Fewer Iterations Less Branching

• Allows other transformations:– Exposes adjacent memory locations– Allows instruction reordering across

iterations

Page 4: Learning A Better Compiler

Unroll Factors

• How many iterations to combine?

• Too few?– Provides little benefit

• Too large– Increased cache pressure– Increase live rangeregister pressure

Page 5: Learning A Better Compiler

QuickTime™ and a decompressor

are needed to see this picture.

Optimal Unroll Factors

Page 6: Learning A Better Compiler

Classification Problems

• Input a vector of features– E.g. nest depth, # of branches, # of ops

• Output a class– E.g. unroll factor, 1-8

• No prior knowledge required– Meaning of features/classes– Relevance of features– Relationships between features

Page 7: Learning A Better Compiler

Nearest Neighbors

• Paper describes Kernel Density Estimator

• All dimensions normalized to [0,1]

• Given a test point p:– Consider training points “close” to p

• Within fixed distance, e.g. 0.3

– Majority vote among qualifying training points

Page 8: Learning A Better Compiler

Nearest Neighbors

QuickTime™ and a decompressor

are needed to see this picture.

Page 9: Learning A Better Compiler

Support Vector Machine

• Assume two classes, easily generalized

• Transform data– Make classes linearly separable

• Find line to maximize sep. margin

• For test point:– Perform transformation– Classify based on learned line

Page 10: Learning A Better Compiler
Page 11: Learning A Better Compiler

Maximal Margin

QuickTime™ and a decompressor

are needed to see this picture.

Page 12: Learning A Better Compiler

Non-Linear SVM

QuickTime™ and a decompressor

are needed to see this picture.

Page 13: Learning A Better Compiler

Some Features

• # operands• Live range size• Critical path length• # operations• Known tripcount• # floating point ops• Loop nest level• # branches

• # memory ops• Instruction fan-in in

DAG• # instructions• Language: C, fortran• # memory ops• # Implicit instructions• & more (38 total)

Page 14: Learning A Better Compiler

Results: No Software Parallelism

QuickTime™ and a decompressor

are needed to see this picture.

Page 15: Learning A Better Compiler

Results: With Software Parallelism

QuickTime™ and a decompressor

are needed to see this picture.

Page 16: Learning A Better Compiler

Big Idea: Easy Maintenance• Performance improvements modest

– Sometimes worse, sometimes much better– Usually little change

• Requires no re-tuning to change compiler– Gathering data takes ~1wk, no human time

• General mechanism– Can be applied to all parameters– No model of system needed

• Can be applied to new transformations where expert knowledge is unavailable

Page 17: Learning A Better Compiler

Integrated CPU and L2 Cache Voltage Scaling using Machine

Learning

Page 18: Learning A Better Compiler

Dynamic Voltage Control

• Monitor system

• When activity is low, reduce power– Also reduces computational capacity– May need more energy if work takes longer

Page 19: Learning A Better Compiler

Multiple Clock Domains

• Adjust separate components independently

• Better performance/power– E.g. CPU-bound application may be able to

decrease power to memory and cache without affecting performance

• More complex DVM policy

Page 20: Learning A Better Compiler

Motivation

• Applications go through phases

• Frequency/voltages should change too

• Focus on core, L2 cache– Consume large fraction of total power

• Best policy may change over time– On battery: conserve power– Plugged in: maximize performance

Page 21: Learning A Better Compiler

Learning a DVM Policy

• Compiler automatically instruments code– Insert sampling code to record perf. Counters– Instrument code only to gather data

• Use machine learning to create policy

• Implement policy in microcontroller

Page 22: Learning A Better Compiler

ML Parameters

• Features– Clock cycles per instruction– L2 accesses per instruction– Memory access per instruction

• Select voltage to minimize:– Total energy– Energy*delay

Page 23: Learning A Better Compiler

Machine Learning Algorithm

• Automatically learn set of if-then rules– E.g: If (L2PI >= 1) and (CPI <=0) then

f_cache=1GHz

• Compact, expressive

• Can be implemented in hardware

Page 24: Learning A Better Compiler

QuickTime™ and a decompressor

are needed to see this picture.

Page 25: Learning A Better Compiler

Results

• Compared to independently managing core and L2:– Saves 22% on average, 46% max

• Learns effective rules from few features• Compiler modifications instrument code• Learned policy offline• Implemented policy in microcontroller

Page 26: Learning A Better Compiler

Conclusion

• Machine learning derives models from data automatically

• Allows easy maintenance of heuristics

• Creates models that are more effective than hand-tuned