Learning A Better Compiler

Post on 02-Jan-2016

21 views 0 download

description

Learning A Better Compiler. Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning. Predicting Unroll Factors. Loop Unrolling sensitive to unroll factor Current solution: expert design - PowerPoint PPT Presentation

Transcript of Learning A Better Compiler

Learning A Better Compiler

Predicting Unroll Factors using Supervised Classification

And

Integrating CPU and L2 Cache Voltage Scaling using Machine Learning

Predicting Unroll Factors

• Loop Unrolling sensitive to unroll factor

• Current solution: expert design– Difficult: Hand-tuned heuristics– Must be rewritten frequently

• Predict parameters with machine learning– Easy: data collection takes ~1wk

• No human time

– Algorithm does not change with compiler

Loop Unrolling

• Combines multiple iterations loop body

• Fewer Iterations Less Branching

• Allows other transformations:– Exposes adjacent memory locations– Allows instruction reordering across

iterations

Unroll Factors

• How many iterations to combine?

• Too few?– Provides little benefit

• Too large– Increased cache pressure– Increase live rangeregister pressure

QuickTime™ and a decompressor

are needed to see this picture.

Optimal Unroll Factors

Classification Problems

• Input a vector of features– E.g. nest depth, # of branches, # of ops

• Output a class– E.g. unroll factor, 1-8

• No prior knowledge required– Meaning of features/classes– Relevance of features– Relationships between features

Nearest Neighbors

• Paper describes Kernel Density Estimator

• All dimensions normalized to [0,1]

• Given a test point p:– Consider training points “close” to p

• Within fixed distance, e.g. 0.3

– Majority vote among qualifying training points

Nearest Neighbors

QuickTime™ and a decompressor

are needed to see this picture.

Support Vector Machine

• Assume two classes, easily generalized

• Transform data– Make classes linearly separable

• Find line to maximize sep. margin

• For test point:– Perform transformation– Classify based on learned line

Maximal Margin

QuickTime™ and a decompressor

are needed to see this picture.

Non-Linear SVM

QuickTime™ and a decompressor

are needed to see this picture.

Some Features

• # operands• Live range size• Critical path length• # operations• Known tripcount• # floating point ops• Loop nest level• # branches

• # memory ops• Instruction fan-in in

DAG• # instructions• Language: C, fortran• # memory ops• # Implicit instructions• & more (38 total)

Results: No Software Parallelism

QuickTime™ and a decompressor

are needed to see this picture.

Results: With Software Parallelism

QuickTime™ and a decompressor

are needed to see this picture.

Big Idea: Easy Maintenance• Performance improvements modest

– Sometimes worse, sometimes much better– Usually little change

• Requires no re-tuning to change compiler– Gathering data takes ~1wk, no human time

• General mechanism– Can be applied to all parameters– No model of system needed

• Can be applied to new transformations where expert knowledge is unavailable

Integrated CPU and L2 Cache Voltage Scaling using Machine

Learning

Dynamic Voltage Control

• Monitor system

• When activity is low, reduce power– Also reduces computational capacity– May need more energy if work takes longer

Multiple Clock Domains

• Adjust separate components independently

• Better performance/power– E.g. CPU-bound application may be able to

decrease power to memory and cache without affecting performance

• More complex DVM policy

Motivation

• Applications go through phases

• Frequency/voltages should change too

• Focus on core, L2 cache– Consume large fraction of total power

• Best policy may change over time– On battery: conserve power– Plugged in: maximize performance

Learning a DVM Policy

• Compiler automatically instruments code– Insert sampling code to record perf. Counters– Instrument code only to gather data

• Use machine learning to create policy

• Implement policy in microcontroller

ML Parameters

• Features– Clock cycles per instruction– L2 accesses per instruction– Memory access per instruction

• Select voltage to minimize:– Total energy– Energy*delay

Machine Learning Algorithm

• Automatically learn set of if-then rules– E.g: If (L2PI >= 1) and (CPI <=0) then

f_cache=1GHz

• Compact, expressive

• Can be implemented in hardware

QuickTime™ and a decompressor

are needed to see this picture.

Results

• Compared to independently managing core and L2:– Saves 22% on average, 46% max

• Learns effective rules from few features• Compiler modifications instrument code• Learned policy offline• Implemented policy in microcontroller

Conclusion

• Machine learning derives models from data automatically

• Allows easy maintenance of heuristics

• Creates models that are more effective than hand-tuned