A Study of superinstructions and dynamic mixin optimizations
description
Transcript of A Study of superinstructions and dynamic mixin optimizations
A Study of superinstructionsand dynamic mixin optimizations
08D37074 Salikh ZAKIROVChiba laboratory
Advisors: Etsuya SHIBAYAMAShigeru CHIBA
2
Outline
• Introduction• Superinstructions• Dynamic mixin optimization– Evaluation in compiled environment
• Conclusion
Dynamic languages
Exist from 70s, but became popular again in 90s• Provide highest productivity• Performance worse than static languages• Results in limited applicability• Hardware progress helps• However performance still lowPerformance research is important!
3
Dynamic language implementation
• Design trade-off:– performance– implementation complexity– dynamicity• responsiveness to run-time change
4
5
Typical implementation approaches
Performance
Dynamicity
InterpreterInline caching
AOT Compiler
JIT CompilerTrace compiler
The problem
Trade-off of performance with dynamicity• Popular dynamic languages are slow– We use Ruby
• Metaprogramming is essential to high productivity– Requires dynamicity
• Existing high-performance techniques has low dynamicity
6
7
Focus of my work
• Ruby language– Main implementation is VM interpreter
• Performance improvement– While keeping high dynamicity
Contributions of this work
• Researched application of superinstructions for Ruby– Found a novel approach with higher benefit• Arithmetic superinstructions
• Proposed inline caching for dynamic mixin– fine-grained state tracking– alternate caching
8
Position of superinstructions
• Known for dispatch overhead reduction– Low benefit for Ruby on modern hardware
• Arithmetic superinstructions– Novel application of superinstructions– Benefit on numeric applications
• Response to dynamic updates– does not differ from original interpreter
9
Position of dynamic mixin
• A variant of delegation– Getting popular in recent research and practice– Very slow with existing techniques
• We proposed novel optimization scheme– Fine-grained state tracking– Alternate caching
10
11
Outline
• Introduction• Superinstructions• Dynamic mixin optimization– Evaluation in compiled environment
• Conclusion
Interpreter optimization efforts
• Many techniques to optimize interpreter were proposed– Threaded interpretation– Stack top caching– Pipelining– Superinstructions
• Superinstructions– Merge code of operations executed in sequence
12
Focus of this presentation
Superinstructions (example)PUSH: // put <imm> argument on stack stack[sp++] = *pc++; goto **pc++;
ADD: // add two topmost values on stack sp--; stack[sp-1] += stack[sp]; goto **pc++;
PUSH_ADD: // add <imm> to stack top stack[sp++] = *pc++; //goto **pc++; sp--; stack[sp-1] += stack[sp]; goto **pc++;
PUSH_ADD: // add <imm> to stack top stack[sp-1] += *pc++; goto **pc++;
Dispatch eliminated
Optimizations applied
13
Superinstructions (effects)
• Effects1. Reduce dispatch overhead
a. Eliminate some jumpsb. Provide more context for indirect branch predictor by
replicating indirect jump instructions
2. Allow more optimizations within VM op
14
Prior research result:Good for reducing dispatch overhead
Superinstructions help when:• VM operations are small (~10 hwop/vmop)• Dispatch overhead is high (~50%)
Examples of successful use in prior research• ANSI C interpreter: 2-3 times improvement
(Proebsting 1995)• Ocaml: more than 50% improvement (Piumarta 1998)• Forth: 20-80% improvement (Ertl 2003)
15
Superinstructions help when:• VM operations are small (~10 hwop/vmop)• Dispatch overhead is high (~50%)
Ruby does not fit well
Hardware profiling data on Intel Core 2 Duo
60-140 hardware ops per VM op
Only 1-3% misprediction overhead on interpreter dispatch
16
BUT
Superinstructions for Ruby
• We experimentally evaluated effect of “naive” superinstructions on Ruby– Superinstructions are selected statically– Frequently occurring in training run combinations
of length 2 selected as superinstructions– Training run uses the same benchmark– Superinstructions constructed by concatenating C
source code, C compiler optimizations applied
17
Naive superinstructions effect on Ruby
18
Number of superinstructions used
Norm
alized execution time
Limited benefit
Unpredictableeffects
4 benchmarks
Branch mispredictions
19
Number of superinstructions used
Norm
alized execution time
2 benchmarks: mandelbrot and spectral_norm
Branch mispredictions, reordered
20
Number of superinstructions used, reordered by execution time
Norm
alized execution time
2 benchmarks: mandelbrot and spectral_norm
So why Ruby is slow?
• Profile of numeric benchmarks
21
Garbage collection takes significant time
Boxed floating point values dominate
allocation
Floating point value boxing
22
OPT_PLUS: VALUE a = *(sp-2); VALUE b = *(sp-1); /* ... */ if (CLASS_OF(a) == Float && CLASS_OF(b) == Float) { sp--; *(sp-1) = NEW_FLOAT(DOUBLE_VALUE(a) + DOUBLE_VALUE(b)); } else { CALL(1/*argnum*/, PLUS, a); } goto **pc++;
New “box” object is allocated on each operation
Typical Ruby 1.9 VM operation
Proposal: use superinstructions for boxing optimization
• 2 operation per allocation instead of 1
23
OPT_MULT_OPT_PLUS: VALUE a = *(sp-3); VALUE b = *(sp-2); VALUE c = *(sp-1); /* ... */ if (CLASS_OF(a) == Float && CLASS_OF(b) == Float && CLASS_OF(c) == Float) { sp-=2; *(sp-1) = NEW_FLOAT(DOUBLE_VALUE(a) + DOUBLE_VALUE(b)*DOUBLE_VALUE(c)); } else { CALL(1/*argnum*/, MULT/*method*/, b/*receiver*/); CALL(1/*argnum*/, PLUS/*method*/, a/*receiver*/); } goto **pc++;
Boxing of intermediate result eliminated
Implementation
24
• VM operations that handle floating point values directly:– opt_plus– opt_minus– opt_mult– opt_div– opt_mod
• We implemented all 25 combinations of length 2– Based on Ruby 1.9.1– Using existing Ruby infrastructure for superinstructions with
some modifications
Limitations
• Coding style-sensitive– Can be fixed by adding getVariable superinstructions
• Not applicable to other types (e.g. Fixnum, Bignum, String)– Fixnum is already unboxed– Bignum and String cannot be unboxed
• Sequences of 3 arithmetic instructions or longer virtually non-existent– No occurrences in the benchmarks
25
Results
• 0–22% faster on numeric benchmarks (avg 12%)• No slowdown on other benchmarks
26
Evaluation
27
reduction in boxing translates to reduction in GC count
GC reduction explains most of the speedup
• Slight modification produces 20% difference in performance– 4 of 9 arithmetic instructions get
merged into 2 superinstructions– 24% reduction in float allocation
Example: mandelbrot tweak
28
ITER.times do- tr = zrzr - zizi + cr+ tr = cr + (zrzr - zizi)- ti = 2.0*zr*zi + ci + ti = ci + 2.0*zr*zi
Norm
alized execution time
• Alternative solution introduce• OP-getdynamic-OP
Discussion of alternative approaches
• Faster GC– Superinstructions benefit reduced
• Tagged values– 64 bit platforms only
• Stack allocation of intermediate results• Dynamic specialization
• Type inference
29
Summary
• Naive approach to superinstructions does not produce substantial benefit for Ruby
• Floating point values boxing overhead is a problem of Ruby
• Superinstructions provide some help (upto 23%, 12% on average)– implementation of 2000 SLOC – regular, thus
automatically generatable• Dynamicity same as original interpreter
30
31
Outline
• Introduction• Superinstructions• Dynamic mixin optimization– Evaluation in compiled environment
• Conclusion
• code composition technique
Mixin
32
Server
BaseServer
Server
BaseServer
AdditionalSecurity
AdditionalSecurity
Mixin use declaration Mixin semantics
• Temporary change in class hierarchy• Available in Ruby, Python, JavaScript
Dynamic mixin
33
Server
BaseServer
Server
BaseServer
AdditionalSecurity
Dynamic mixin (2)
• Powerful technique of dynamic languages• Enables– dynamic patching– dynamic monitoring
• Can be used to implement– Aspect-oriented programming– Context-oriented programming
• Widely used in Ruby, Python– e.g. Object-Relational Mapping
34
Dynamic mixin in Ruby
• Ruby has dynamic mixin– but only “install”, no “remove” operation– because there is uncertainty in “remove”
semantics with transitive module inclusion• “remove” can be implemented easily– 23 lines of code
35
Target application
• Mixin is installed and removed frequently• Application server with dynamic features
36
class BaseServer def process() … endend
class Server < BaseServer def process() if request.isSensitive() Server.class_eval { include AdditionalSecurity } end super # delegate to superclass … # remove mixin endend
module AdditionalSecurity def process() … # security check super # delegate to superclass endend
Overhead is high
Possible reasons• Invalidation granularity– clearing whole method cache– invalidating all inline caches• next calls require full method lookup
• Inline caching saves just 1 target– which changes with mixin operations– even though mixin operations are mostly repeated
37
Our research target
• Improve performance of application which frequently uses dynamic mixin– Make invalidation granularity smaller– Make dynamic dispatch target cacheable in
presence of dynamic mixin operations
38
Proposal
• Reduce granularity of inline cache invalidation– Fine-grained state tracking
• Cache multiple dispatch targets– Polymorphic inline caching
• Enable cache reuse on repeated mixin installation and removal– Alternate caching
39
Basics: Inline caching
40
ic method
cat.speak()
class
consider a call site
cat.speak()
(executable code)
method = lookup(cat, ”speak”)method(cat)
Dynamic dispatch implementation
if (cat has type ic.class) { ic.method(cat)} else { ic.method = lookup(cat, ”speak”) ic.class = cat.class ic.method(cat)}
Inline caching
Expensive!But the result is mostly the same
Cat
Animal
subclass
cat
instance
speak() { … }
methodimplementation
speak
Cat
Inline caching: problem
41
ic method
cat.speak()
class if (cat has type ic.class) { ic.method(cat)} else { ic.method = lookup(cat, ”speak”) ic.class = cat.class ic.method(cat)}
Inline cachingCat
Animal
cat
instance
Trainingspeak() { … }
speak(){ … }
speak
Cat
• What if the method has been overridden?
Inline caching: invalidation
42
ic method
cat.speak()
classCat
Animal
cat
instance
Trainingspeak() { … }
speak(){ … }
speak
Cat
if (cat has type ic.class && state == ic.state) { ic.method(cat)} else { ic.method = lookup(cat, ”speak”) ic.class = cat.class; ic.state = state ic.method(cat)}
1 Global state
state1
speak
2
2
Single global state object• too coarse invalidation granularity
Fine-grained state tracking
• Many state objects– small invalidation extent– share as much as possible
• One state object for each family of methods called from the same call site
• State objects associated with lookup path– links updated during method lookups
43
44
Lookup procedure
• Lookup normally– noting the state objects encountered– starting from the state object in the inline cache
• Choose the last state object– other state objects• mark for one-time invalidation• mark as overridden
• Store the pointer to the state object in every class on lookup path
Invariant 1: lookup path• After any number of lookups– for any IC = (pstate, state, class, name)
• Either of the following holds:– IC is invalidated– for any class’ ∊ lookup_path(class, name)
• (class’, name) ↪ state
• Proof scheme (induction over lookups)– initial state is invalid– lookup procedure establishes invariant– on lookups that update state object links, state is linked
transitively or invalidated
45
Invariant 2: validity of IC
• After any number of updates– for any IC = (pstate, state, class, name, target)
• Either of the following holds– IC is invalidated– target = lookup(class, name)
• Proof scheme (induction over updates)– any update either• does not affect lookup• invalidates IC
46
method
class
pstate
speak *1*
State object allocation
47
speak() { *1* }
Animal
Cat1
speak
ic Noimplemmentation
here
if (cat has type ic.class && ic.pstate.state == ic.state ) { ic.method(cat)} else { ic.method, ic.pstate = lookup(cat, ”speak”, ic.pstate) ic.class = cat.class; ic.state = state method(cat)} inline caching code
1
cat.speak()
state1
Cat
speak() { *1* }
Animal
Cat
speak
ic method
class
pstate
cat.speak()
state
speak *1*speak *2*
112
Mixin installation
48
1Training
speak() { *2* }22
Cat
if (cat has type ic.class && ic.pstate.state == ic.state ) { ic.method(cat)} else { ic.method, ic.pstate = lookup(cat, ”speak”, ic.pstate) ic.class = cat.class; ic.state = state method(cat)} inline caching code
Training
speak() { *2* } Cat
speak
speak() { *1* }
Animal
pstate
if (cat has type ic.class && ic.pstate.state == ic.state ) { ic.method(cat)} else { ic.method, ic.pstate = lookup(cat, ”speak”, ic.pstate) ic.class = cat.class; ic.state = state method(cat)} inline caching code
method
class
cat.speak()
state2
speak *2*
23
speak *1*
3
Mixin removal
49
32ic
Cat
speak() { *1* }
Animal
Cat
speak
Training
speak() { *2* }method
pstate
state
• Detect repetition• Conflicts detected by state
check
speak *1*speak *2*
34
Alternate caching
50
A
34
super Animal
alternate cache
speak
…
34
Training
ic
class
cat.speak()
Cat
Inline cache contents oscillates
speak() { *1* }
Animal
Cat
speak
Training
speak() { *2* }method
class
pstate
state
• Use multiple entries in inline cache
Polymorphic caching
51
4ic 3
super Animal
alternate cache
speak
…
34
Training
cat.speak()
Cat Cat
*1* *2*
3 4
now PIC handles oscillating state value
52
Invariant 3: validity of alternate caching
• After any number of updates– for any IC = (pstate, state, class, name, target)
• Either of the following holds– value(pstate) != state (IC is invalid)– target = lookup(class, name)
• Proof scheme (induction on updates)– any update either• introduce fresh value(pstate) for invalidation• preserves correctness of cached target
Cat
speak
Training
speak() { *2* }
speak() { *1* }
Animal
State object merge
53
executablecode
cat.speak()S
Overridden by
One-time invalidation
animal.speak()
cat
instance
animal
instance
while(true) {
remove mixin}
54
Alternate caching limitations
• Independent mixins do not conflict• If two mixins override the same method– scopes need to be properly nested
1
2
2
3
1
1
2
3
4
5
dynamicmixinscope
State object value(for some method)
Overheads of proposed scheme
• Increased memory use– 1 state object per polymorphic method family– additional method entries– alternate cache– polymorphic inline cache entries
• Some operations become slower– Lookup needs to track and update state objects– Explicit state object checks on method dispatch
55
Generalizations (beyond Ruby)
• Delegation object model– track arbitrary delegation pointer change
• Thread-local delegation– allow for thread-local modification of delegation
pointer– by having thread-local state object values
56
Evaluation
• Implementation based on Ruby 1.9.2– about 1000 lines of code
• Hardware– Intel Core i7 860 2.8 GHz
57
Evaluation: microbenchmarks
Single method call overhead
• Inline cache hit– state checks 1%– polymorphic inline
caching 49% overhead
• Full lookup– 2x slowdown
58
Dynamic mixin-heavy microbenchmark
100%
23% 17% 15%
Normalized execution time
59
(smaller is better)
Evaluation: application
• Application server with dynamic mixin on each request
60
baseline
method cach
e state ch
ecks fgst
fgst + PIC
fgst + PIC + alte
rn
100%
70%58% 60% 52%
Normalized execution time(smaller is better)
Evaluation
• Fine-grained state tracking considerably reduces overhead
• Alternate caching brings only small improvement– Number of call sites affected by mixin is low– Lookup cost / inline cache hit cost is low• about 1.6x on Ruby
61
Related work
• Dependency tracking in Self– focused on reducing recompilation,
rather than reducing method lookups• Inline caching for Objective-C– state object associated with method, no
dynamic mixin support
62
Summary
• We proposed combination of techniques– Fine-grained state tracking– Alternate caching– Polymorphic inline caching
• To increase efficiency of inline caching– with frequent dynamic mixin installation and
removal
63
64
Outline
• Introduction• Superinstructions• Dynamic mixin optimization– Evaluation in compiled environment
• Conclusion
Evaluation in compiled environment
• Dynamic mixin optimizations– applicable to compiled systems too
• In order to confirm this hypothesis– we implemented a dynamic compiler for the
language IO– we evaluated the performance of inline caching
65
Idea: efficient dynamic mixin
if (s == 1)
if (s == 2)
State guard
Inlinedmethod f1
f2handleinline cachemiss
f2
Repeated dynamic mixin install / removal
Control flow graph of compiled call site server.f()
s = 1 s = 2f1f1f1
continue66
Dynamic compiler
• Source language – IO– only subset implemented
• Compilation uses profile collected in PICs• PICs are initialized by interpreted execution
67
Microbenchmark results
• Overhead of state checks is reasonable (less than 16 CPU cycles)
• The most common case commonly have just 1 cycle overhead
68
Summary and ongoing work
• We evaluated inline caching optimization for dynamic mixin in compiled environment
• We verified that optimization is effective• The most common case has very low overheadFuture work• The dynamic mixin switch operation may be
slow – need to be addressed in future work• Full IO language support
69
Conclusion
• We researched two techniques– Superinstructions for Ruby– Dynamic mixin optimization
• Our techniques improve performance• While keeping the advantages– high dynamicity– low implementation complexity
70
Publications
• S. Zakirov, S. Chiba, E. Shibayama. How to select superinstructions, IPSJ PRO 2010.
• S. Zakirov, S. Chiba, E. Shibayama. Optimizing dynamic dispatch with fine-grained state tracking, DLS 2010.
71