Chameleon Automatic Selection of Collections
description
Transcript of Chameleon Automatic Selection of Collections
ChameleonAutomatic Selection of Collections
Ohad Shacham Martin Vechev Eran Yahav
Tel Aviv University IBM T.J. Watson Research Center
Presented by: Yingyi Bu
Collections
Abstract data types Many implementations Different space/time tradeoffs Incompatible selection might lead to
runtime degradation Space bloat – wasted space
ArraySetHashSet LinkedSetSet
ArrayMapHashMap LinkedMapMap
ArrayListLinkedListList
LazySet
LazyMap
LazyList
Collection Bloat
Collection bloat is a non justified space overhead for storing data in collections
List s = new ArrayList();
s.add(1);
1
Bloat for s is 9
Collection Bloat
Collection-bloat is a serious problem in practice Observed to occupy 90% of the heap in real-world
applications
Hard to detect and fix Accumulation: death by a thousand cuts Correction: Need to correlate bloat to program code
How to pick the right implementation? Minimize bloat But without degrading running time
Our Vision
Programmer declares the ADT to be used Set s = new Set();
Programmer defines what metric to optimize e.g. space-time
Runtime automatically selects implementation based on metric Online: detect application usage of Set Online: select appropriate implementation of Set
ArraySetHashSet LinkedSet
Set
…
This Work
Programmer defines the implementation to be used Set s = new HashSet();
Programmer defines what metric to optimize space-time product Space = Bloat
Runtime suggests implementation based on metric Online: automatically detect application usage of HashSet() Online: automatically suggest alternative to HashSet() Offline: programmer modifies program accordingly
e.g. Set s = new ArraySet();
How Can We Calculate Bloat ?
Data structure Bloat Occupied Data – Used Data Example:
List s = new ArrayList();
s.add(1);
Bloat for s is 9
1
How to Detect Collection Bloat?
Each collection maintains a field for used data
Language runtime can find out actually occupied data Bloat = Occupied Data – Used Data
Solution: Garbage Collector Computes Bloat Online Reads used data fields from collections Low-overhead: can work online in production
ArrayList
…int size…Object[] Array……
Semantic Maps
How Collections Communicate Information to GC Includes size and pointers to actual data fields Allows for trivial support of Custom Collections
GC
Used DataOccupied Data
ArrayListSemantic map
HashMap
…elementCount…elementData…
Used DataOccupied Data
HashMapSemantic map
Example: Collections Bloat in TVLA
0
10
20
30
40
50
60
70
80
1 26 51 76 101 126 151 176 201 226 251 276 301 326 351 376 401 426 451
GC
% L
ive
Dat
a
Live Collections
Used Collections
0
10
20
30
40
50
60
70
80
1 26 51 76 101 126 151 176 201 226 251 276 301 326 351 376 401 426 451
% L
ive
Dat
a
Live Collections
0
10
20
30
40
50
60
70
80
1 26 51 76 101 126 151 176 201 226 251 276 301 326 351 376 401 426 451
% L
ive
Dat
a
Live Collections
Used Collections
Example: Collections Bloat in TVLA
0
10
20
30
40
50
60
70
80
1 26 51 76 101 126 151 176 201 226251 276 301 326 351 376 401426 451
% L
ive
Da
ta
Live Collections
Used Collections
Core Collections
Lower bound for bloat
Example: Collections Bloat in TVLA
Fixing Bloat
Must correlate all bloat stats to program point
Need Trace InformationRemember: do not want to degrade time
Correlating Code and Bloat
public final class ConcreteKAryPredicate extends ConcretePredicate {…
public void modify() {…values = HashMapFactory.make(this.values);
}…
}
public class GenericBlur extends Blur {…
public void blur(TVS structure) {…Map invCanonicName =HashMapFactory.make(structure.nodes().size());…
}}
public class HashMapFactory {public static Map make(int size) {
return new HashMap(size); } }
Ctx1 40%
Ctx2 11%
Ctx3 5%
Ctx4 7%
Ctx5 5%
Ctx6 3%
Ctx7 7%Ctx8 3%
Aggregate bloat potential per allocation context Done by the garbage collector
Trace Information
Track Collection Usage in Library: Distribution of operations Distribution of size
Aggregated per allocation context
ctx1Size = 7Get = 3Add = 9….ctx2Size = 1Contains = 100Insert = 1….ctx3Size = 103Contains = 10041Insert = 140Remove = 20…ctxi….….
But how to choose the new Collection ?
Rule Engine: user defined rules Input: Heap and Trace Statistics per-context Output: Suggested Collection for that context
Rules based on trace and heap information HashMap: #contains < X CollmaxSize < Y → ArrayMap HashMap: #contains < X CollmaxSize < Y+10 %liveHeap > Z → ArrayMap
Hashmap: maxSize < X → ArrayMap LinkedList: NoListOp → ArrayList
Hashmap: (#contains < X CollmaxSize < Y+10 %liveHeap > Z ) → ArrayMap…
Rule Engine
Overall Picture
Hashmap: maxSize < X → ArrayMap LinkedList: NoListOp → ArrayList
Hashmap: (#contains < X CollmaxSize < Y+10 %liveHeap > Z ) → ArrayMap … … …
Rule Enginectx1Size = 7Get = 3Add = 9….ctx2Size = 1Contains = 100Insert = 1….
Semantic Profiler
Program
Semantic mapsRules
RecommendationsPotential report
Correct Collection Bloat – Typical Usage
Step 1: Profile for Bloat without Context Low-overhead, can run in production If problem detected, go to step 2 Automatic
Step 2: Combine heap information with trace information per context Can switch automatically to step 2 from step 1 Higher-overhead than step 1 Automatic: prior to Chameleon - a manual step (very hard)
Step 3: Suggest fixes to user based on rules Automatic
Step 4: Programmer applies suggested fixes Manual
Chameleon on TVLA
1: HashMap:tvla...HashMapFactory:31 ;tvla.core.base.BaseTVS:50 replace with ArrayMap
…
4: ArrayList:BaseHashTVSSet:112; tvla...base.BaseHashTVSSet:60 set initial capacityP
oten
tial
Ope
ratio
nsS
ize Max 15 26 7 7
Avg 11.33 6.31 4.8 4.8Stddev 1.36 5.05 1.17 1.17
Pot
entia
lO
pera
tions
Siz
e Max 15 26 7 7Avg 11.33 6.31 4.8 4.8Stddev 1.36 5.05 1.17 1.17
0
2
4
6
8
10
12
1 2 3 4Context
% L
ive
Dat
a over
used
Implementation
Built on top of IBM’s JVM
Modifications to Parallel Mark and Sweep GC Modular changes, readily applicable to other GCs
Modifications to collection libraries
Runtime overhead Detection Phase: Negligible Correction Phase: ~2x (due to cost of getting context)
Can Use PCC by Bond & McKinley
Experimental Results – Memory
0
10
20
30
40
50
60
70
80
90
100
TVLA FindBugs PMD Bloat Fop Soot
Min
ima
l He
ap
(%
)
Experimental Results – Time
0
10
20
30
40
50
60
70
80
90
100
TVLA FindBugs PMD Bloat Fop Soot
Ru
nti
me
(%
)
Related Work
Large volume of work on SETL Automatic data structure selection in SETL [Schonberg et. al., POPL'79] SETL representation sublanguage [Dewar et. al, TOPLAS'79] …
Bloat The Causes of Bloat, The Limits of Health [ Mitchell and Sevitsky,
OOPSLA’07]
Summary
Collection selection is a real problem Runtime penalty Bloat
Chameleon integrates trace and heap information for choosing a collection implementation based on predefined rules
Using Chameleon, reduced the footprint of several applications Never degrading running time, often improving it
First step towards automatic collection selection as part of the runtime system