Chameleon Automatic Selection of Collections

ChameleonAutomatic Selection of Collections

Ohad Shacham Martin Vechev Eran Yahav

Tel Aviv University IBM T.J. Watson Research Center

Presented by: Yingyi Bu

Collections

Abstract data types Many implementations Different space/time tradeoffs Incompatible selection might lead to

runtime degradation Space bloat – wasted space

ArraySetHashSet LinkedSetSet

ArrayMapHashMap LinkedMapMap

ArrayListLinkedListList

LazySet

LazyMap

LazyList

Collection Bloat

Collection bloat is a non justified space overhead for storing data in collections

List s = new ArrayList();

s.add(1);

1

Bloat for s is 9

Collection Bloat

Collection-bloat is a serious problem in practice Observed to occupy 90% of the heap in real-world

applications

Hard to detect and fix Accumulation: death by a thousand cuts Correction: Need to correlate bloat to program code

How to pick the right implementation? Minimize bloat But without degrading running time

Our Vision

Programmer declares the ADT to be used Set s = new Set();

Programmer defines what metric to optimize e.g. space-time

Runtime automatically selects implementation based on metric Online: detect application usage of Set Online: select appropriate implementation of Set

ArraySetHashSet LinkedSet

Set

…

This Work

Programmer defines the implementation to be used Set s = new HashSet();

Programmer defines what metric to optimize space-time product Space = Bloat

Runtime suggests implementation based on metric Online: automatically detect application usage of HashSet() Online: automatically suggest alternative to HashSet() Offline: programmer modifies program accordingly

e.g. Set s = new ArraySet();

How Can We Calculate Bloat ?

Data structure Bloat Occupied Data – Used Data Example:

List s = new ArrayList();

s.add(1);

Bloat for s is 9

1

How to Detect Collection Bloat?

Each collection maintains a field for used data

Language runtime can find out actually occupied data Bloat = Occupied Data – Used Data

Solution: Garbage Collector Computes Bloat Online Reads used data fields from collections Low-overhead: can work online in production

ArrayList

…int size…Object[] Array……

Semantic Maps

How Collections Communicate Information to GC Includes size and pointers to actual data fields Allows for trivial support of Custom Collections

GC

Used DataOccupied Data

ArrayListSemantic map

HashMap

…elementCount…elementData…

Used DataOccupied Data

HashMapSemantic map

Example: Collections Bloat in TVLA

0

10

20

30

40

50

60

70

80

1 26 51 76 101 126 151 176 201 226 251 276 301 326 351 376 401 426 451

GC

% L

ive

Dat

a

Live Collections

Used Collections

0

10

20

30

40

50

60

70

80

1 26 51 76 101 126 151 176 201 226 251 276 301 326 351 376 401 426 451

% L

ive

Dat

a

Live Collections

0

10

20

30

40

50

60

70

80

1 26 51 76 101 126 151 176 201 226 251 276 301 326 351 376 401 426 451

% L

ive

Dat

a

Live Collections

Used Collections


0

10

20

30

40

50

60

70

80

1 26 51 76 101 126 151 176 201 226251 276 301 326 351 376 401426 451

% L

ive

Da

ta

Live Collections

Used Collections

Core Collections

Lower bound for bloat


Fixing Bloat

Must correlate all bloat stats to program point

Need Trace InformationRemember: do not want to degrade time

Correlating Code and Bloat

public final class ConcreteKAryPredicate extends ConcretePredicate {…

public void modify() {…values = HashMapFactory.make(this.values);

}…

}

public class GenericBlur extends Blur {…

public void blur(TVS structure) {…Map invCanonicName =HashMapFactory.make(structure.nodes().size());…

}}

public class HashMapFactory {public static Map make(int size) {

return new HashMap(size); } }

Ctx1 40%

Ctx2 11%

Ctx3 5%

Ctx4 7%

Ctx5 5%

Ctx6 3%

Ctx7 7%Ctx8 3%

Aggregate bloat potential per allocation context Done by the garbage collector

Trace Information

Track Collection Usage in Library: Distribution of operations Distribution of size

Aggregated per allocation context

ctx1Size = 7Get = 3Add = 9….ctx2Size = 1Contains = 100Insert = 1….ctx3Size = 103Contains = 10041Insert = 140Remove = 20…ctxi….….

But how to choose the new Collection ?

Rule Engine: user defined rules Input: Heap and Trace Statistics per-context Output: Suggested Collection for that context

Rules based on trace and heap information HashMap: #contains < X CollmaxSize < Y → ArrayMap HashMap: #contains < X CollmaxSize < Y+10 %liveHeap > Z → ArrayMap

Hashmap: maxSize < X → ArrayMap LinkedList: NoListOp → ArrayList

Hashmap: (#contains < X CollmaxSize < Y+10 %liveHeap > Z ) → ArrayMap…

Rule Engine

Overall Picture

Hashmap: maxSize < X → ArrayMap LinkedList: NoListOp → ArrayList

Hashmap: (#contains < X CollmaxSize < Y+10 %liveHeap > Z ) → ArrayMap … … …

Rule Enginectx1Size = 7Get = 3Add = 9….ctx2Size = 1Contains = 100Insert = 1….

Semantic Profiler

Program

Semantic mapsRules

RecommendationsPotential report

Correct Collection Bloat – Typical Usage

Step 1: Profile for Bloat without Context Low-overhead, can run in production If problem detected, go to step 2 Automatic

Step 2: Combine heap information with trace information per context Can switch automatically to step 2 from step 1 Higher-overhead than step 1 Automatic: prior to Chameleon - a manual step (very hard)

Step 3: Suggest fixes to user based on rules Automatic

Step 4: Programmer applies suggested fixes Manual

Chameleon on TVLA

1: HashMap:tvla...HashMapFactory:31 ;tvla.core.base.BaseTVS:50 replace with ArrayMap

…

4: ArrayList:BaseHashTVSSet:112; tvla...base.BaseHashTVSSet:60 set initial capacityP

oten

tial

Ope

ratio

nsS

ize Max 15 26 7 7

Avg 11.33 6.31 4.8 4.8Stddev 1.36 5.05 1.17 1.17

Pot

entia

lO

pera

tions

Siz

e Max 15 26 7 7Avg 11.33 6.31 4.8 4.8Stddev 1.36 5.05 1.17 1.17

0

2

4

6

8

10

12

1 2 3 4Context

% L

ive

Dat

a over

used

Implementation

Built on top of IBM’s JVM

Modifications to Parallel Mark and Sweep GC Modular changes, readily applicable to other GCs

Modifications to collection libraries

Runtime overhead Detection Phase: Negligible Correction Phase: ~2x (due to cost of getting context)

Can Use PCC by Bond & McKinley

Experimental Results – Memory

0

10

20

30

40

50

60

70

80

90

100

TVLA FindBugs PMD Bloat Fop Soot

Min

ima

l He

ap

(%

)

Experimental Results – Time

0

10

20

30

40

50

60

70

80

90

100

TVLA FindBugs PMD Bloat Fop Soot

Ru

nti

me

(%

)

Related Work

Large volume of work on SETL Automatic data structure selection in SETL [Schonberg et. al., POPL'79] SETL representation sublanguage [Dewar et. al, TOPLAS'79] …

Bloat The Causes of Bloat, The Limits of Health [ Mitchell and Sevitsky,

OOPSLA’07]

Summary

Collection selection is a real problem Runtime penalty Bloat

Chameleon integrates trace and heap information for choosing a collection implementation based on predefined rules

Using Chameleon, reduced the footprint of several applications Never degrading running time, often improving it

First step towards automatic collection selection as part of the runtime system

Chameleon Automatic Selection of Collections

Documents

Transcript of Chameleon Automatic Selection of Collections