Framework for Profile-Analysis Data-Layout Optimizations
description
Transcript of Framework for Profile-Analysis Data-Layout Optimizations
1
Framework for Profile-Analysis Data-Layout Optimizations
Shai Rubin Ras Bodik Trishul Chilimbi
Microsoft ResearchUniversity of Wisconsin University of Wisconsin
2
Data Layout Optimization (What)
CPU
Cache
Memory
References sequence: A.x, B, A.z
1 cycle
102 cycles
106 cycles
Disk
B
A
A.x
time
time
cache blocks
1
2
3
4
Memory Pages
1
2
BAA
time
time
cache blocks
B
1
2
3
4
Memory Pages
1
2
DL Optimization
A.x B A.z
A.x B A.z
A.x B A.z
A.x B A.z
A.x B A.z
A.x B A.z
AB BA.x B A.z
A.x B A.z
DL optimization: increase spatial locality of data to prevent memory faults.
Original data layout Modified data layout
A.z
B
A
A.x A.zA.z A.x
3
Data Layout
Layout Space
Data Layout Optimization (How)
Optimal for simple
loopsHeuristic
Reference Summary
Array Dep.
Analysis(static)
Ref. Trace
(dynamic)
Scientific(array based)
General purpose
(pointer based)
Compile Time
1. Compile Time 2. Runtime
Program
Optimal Layout
Enforce layout
Data Layout Optimizer“Good” Layout
Program′
4
Problems with Current Data-Layout Optimization
• Computationally hard to find the optimal layout [Petrank].
• Computationally hard to approximate the optimal layout
[Petrank].
• Implication - heuristics are not robust:– will not work for all programs.
• From our experience with heuristics:– Field Reordering [Chilimbi PLDI’99] – no improvement (on perl).
– Custom Memory Allocator [Seidl ASPLOS’98] degrades performance (on
espresso).
• Our approach: replace heuristic with feedback-driven search.
5
Data Layout Space
Searching For a Data Layout
Current program data layout
“Good” Layouts“Good” + “easy” to enforce layouts
– a “good” layout.
• Search advantage: – Robust, for each program finds a “good” layout.
Optimal data layout
– an “easy” to enforce layout.
• Problem: Perform a search in the data layout space.
• Look for:
6
Is Search Practical?
Possible layouts
Data Layout
Reference Trace
Optimizer (Heuristic)
Enforce layoutEdit Compile Execute Evaluate Continue?
End
• Not clear:
Enforce
7
Outline
• Background and Problem Definition
• Search is a solution, but may not practical
– Making the search practical
• Applications
• Summary
8
Making the Search Practical
Reference Trace
Data Layout Search Engine
Edit Compile Execute Evaluate Continue?
End
Compress(T)CST
Data Object Analysis DOA(CST,LS)NLS
Layout Selector LS(NLS,B,CST,SS)DL
Enforce LayoutAL(DL,CST)NT
EvaluateSimulate(NT)B
“good “and enforceable
layoutsClass Splitting
Linearization
Field ReorderingLayout
Space
Narrowed Space
Search Strategy
Trace
Data Layout
New Trace
Continue(B)
Benefit
Benefit
CompressedSymbolicTrace
Search Strategy
T
T
Trace
Framework for Data Layout Optimization
T
9
Trace Representation
• Problem: reference trace cannot be easily manipulated since it is too
large (>10GB, >100M references).
• Solution: compressed trace (using modified SEQUITUR).
• Example:
- Trace: acbcbcbcbdbdbdbde
• Representation advantage:
- Compact; fits into main memory [ChilimbiPLDI’01].
- Expose repetitions (we use this later).
- It produces a symbolic trace (i.e., a terminal is a data object).
SEQUITUR Representation
SacBBBAAe Bbc
ACC Cbd
10
Framework for Data-Layout Optimization
Reference Trace
Data Layout Search Engine
Compile Continue?
End
Compress(T)CST
Data Object Analysis DOA(CST,LS)NLS
Layout Selector LS(NLS,B,CST,SS)DL
Enforce LayoutEL(DL,CST)CST’
EvaluateSimulate(NT)B
“good “and enforceable
layoutsClass Splitting
Linearization
Field ReorderingLayout
Space
Narrowed Space
Search Strategy
Trace
Data Layout
Continue(B)
Benefit
Benefit
CompressedSymbolicTrace
Search Strategy
New Trace
11
Avoid re-compilation• Problem: data layout evaluation (edit+compilation+simulation).
• Solution: “pretend” that the program was edited and compiled.
A.x, B, A.z, B
A.x10A.z14B20
30,20,34,20
New concrete trace
Single symbolic trace
CompileRun
(simulate)Edit
program
Enforce Layout
• Symbolic trace + data layout concrete address trace.
A.x30A.z34B20
30,20,34,20
• Simple, but crucial for an efficient search.
User(Optimizer)
Simulate
12
Framework for Data-Layout Optimization
Reference Trace
Data Layout Search Engine
Compile Continue?
End
Compress(T)CST
Data Object Analysis DOA(CST,LS)NLS
Layout Selector LS(NLS,B,CST,SS)DL
Enforce LayoutEL(DL,CST)CST’
Evaluate Simulate(CST’)B
“good “and enforceable
layoutsClass Splitting
Linearization
Field ReorderingLayout
Space
Narrowed Space
Search Strategy
Trace
Data Layout
Continue(B)
Benefit
Benefit
CompressedSymbolicTrace
Search Strategy
New Trace
13
Memoization: Efficient Trace Simulation
• Evaluation using simulation: MissRateT=Simulate(T);
• Problem: simulation of the whole trace (T) is too expensive.
• Solution: avoids re-simulation of repeated sub-traces.
SEQUITUR Representation
SBBBAA Bbc
ACC Cbd
CSC=Simulate′(C)
CSB=Simulate′(B)
CSA = CSCCSC
CSS = CSBCSBCSBCSACSA T: bcbcbcbdbdbdbd
• Memoization:
1. Simulate each “low level” rule, compute its memoization value.− For cache simulation: memoization value = CacheState [CS].
2. Recursively compose memoization values for “higher” rules.
MissRateT = Length(T)
CSMissess
14
Outline• Background and Problem Definition
• Search is a solution, but maybe not feasible
– Making the search practical:• Trace representation
• Avoid recompilation
• Efficient simulation
• Applications
• Summary
15
Framework Application (1)• Application: an implementation of the
framework that searches in a sub-space of
the layout space.
• Field Reordering:
– Objective: reduce number of cache misses.
– Sub-space: all possible (legal) orders of fields in
(heap) objects.
– Our search strategy: (almost) exhaustive search.
16
Field Reordering: Exhaustive Search
• We compared:
– Best field order found by our iterative search.
– Field orders produced by existing heuristics:
• Fields Temporal Affinity [ChilimbiPLDI’99]
• Fields Access Frequency [TruongPACT’98].
Miss Rate Reduction
-10.00%
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
perl twolf boxsim
iteration affinity frequency
Runtime improvement: 0%-4.5%.
17
Custom Memory Allocator (CMA)
A
B
APage 1
Page 2 B
A
time
address
A B APage 1
Page 2
B A
time
address
• Objective: reduce number of page faults.
Allocator 1 Allocator 2
Poor locality Good locality
• CMA can work well if it has a good placement function:assigns dynamically allocated heap objects to memory pages (heaps).
Reference trace: ABABA
18
CMA Placement Function (PF)malloc(size s){
}
PF: Map objects to heapsPF(heap object)int
• How we can find a placement function using our framework?• A placement function defines a data layout.
• Learn by measuring the benefits of its data layout.• How: use a learning algorithm.
Learner PF(Attributes)int
Use Framework to Evaluate PF
Size
1 2
size<24size24
Decision Tree
Learner
Profiling InformationProfile(Heap objects)
runtime attributes
19
CMA Results
Program Number of heaps
Espresso 2
Boxsim 8
Twolf 5
Perl 5
Ghostscript 10
Lp_solve 6
WS Size Reduction1
02468
1012141618
Esp
ress
o
Box
sim
Tw
olf
Per
l
Gho
stS
crip
t
lp_s
olve
Benchmark
Red
uct
ion
%
test input
WS Size Reduction1
0
5
10
15
20
Esp
ress
o
Bo
xsim
Tw
olf
Pe
rl
Gh
ost
Scr
ipt
lp_
solv
e
Re
du
cti
on
%
train input test input
1Relative to original working set size.
20
Contributions and Future Work
• Formulate data layout optimization as a search process.
• Build a framework for efficient search process.
• Improve existing optimizations; enable new
optimizations.
• Framework limitations:– Difficult to handle very large traces (>0.5B references).
– Requires some guidance from the programmer (search strategy).
• Future work – Advanced search strategies that combine several optimizations.
– Other non-data-layout optimization – prefetching.