A Study of superinstructions and dynamic mixin optimizations

71
A Study of superinstructions and dynamic mixin optimizations 08D37074 Salikh ZAKIROV Chiba laboratory Advisors: Etsuya SHIBAYAMA Shigeru CHIBA

description

A Study of superinstructions and dynamic mixin optimizations. 08D37074 Salikh ZAKIROV Chiba laboratory Advisors: Etsuya SHIBAYAMA Shigeru CHIBA. Outline. Introduction Superinstructions Dynamic mixin optimization Evaluation in compiled environment Conclusion. Dynamic languages. - PowerPoint PPT Presentation

Transcript of A Study of superinstructions and dynamic mixin optimizations

Page 1: A  Study of superinstructions and dynamic mixin optimizations

A Study of superinstructionsand dynamic mixin optimizations

08D37074 Salikh ZAKIROVChiba laboratory

Advisors: Etsuya SHIBAYAMAShigeru CHIBA

Page 2: A  Study of superinstructions and dynamic mixin optimizations

2

Outline

• Introduction• Superinstructions• Dynamic mixin optimization– Evaluation in compiled environment

• Conclusion

Page 3: A  Study of superinstructions and dynamic mixin optimizations

Dynamic languages

Exist from 70s, but became popular again in 90s• Provide highest productivity• Performance worse than static languages• Results in limited applicability• Hardware progress helps• However performance still lowPerformance research is important!

3

Page 4: A  Study of superinstructions and dynamic mixin optimizations

Dynamic language implementation

• Design trade-off:– performance– implementation complexity– dynamicity• responsiveness to run-time change

4

Page 5: A  Study of superinstructions and dynamic mixin optimizations

5

Typical implementation approaches

Performance

Dynamicity

InterpreterInline caching

AOT Compiler

JIT CompilerTrace compiler

Page 6: A  Study of superinstructions and dynamic mixin optimizations

The problem

Trade-off of performance with dynamicity• Popular dynamic languages are slow– We use Ruby

• Metaprogramming is essential to high productivity– Requires dynamicity

• Existing high-performance techniques has low dynamicity

6

Page 7: A  Study of superinstructions and dynamic mixin optimizations

7

Focus of my work

• Ruby language– Main implementation is VM interpreter

• Performance improvement– While keeping high dynamicity

Page 8: A  Study of superinstructions and dynamic mixin optimizations

Contributions of this work

• Researched application of superinstructions for Ruby– Found a novel approach with higher benefit• Arithmetic superinstructions

• Proposed inline caching for dynamic mixin– fine-grained state tracking– alternate caching

8

Page 9: A  Study of superinstructions and dynamic mixin optimizations

Position of superinstructions

• Known for dispatch overhead reduction– Low benefit for Ruby on modern hardware

• Arithmetic superinstructions– Novel application of superinstructions– Benefit on numeric applications

• Response to dynamic updates– does not differ from original interpreter

9

Page 10: A  Study of superinstructions and dynamic mixin optimizations

Position of dynamic mixin

• A variant of delegation– Getting popular in recent research and practice– Very slow with existing techniques

• We proposed novel optimization scheme– Fine-grained state tracking– Alternate caching

10

Page 11: A  Study of superinstructions and dynamic mixin optimizations

11

Outline

• Introduction• Superinstructions• Dynamic mixin optimization– Evaluation in compiled environment

• Conclusion

Page 12: A  Study of superinstructions and dynamic mixin optimizations

Interpreter optimization efforts

• Many techniques to optimize interpreter were proposed– Threaded interpretation– Stack top caching– Pipelining– Superinstructions

• Superinstructions– Merge code of operations executed in sequence

12

Focus of this presentation

Page 13: A  Study of superinstructions and dynamic mixin optimizations

Superinstructions (example)PUSH: // put <imm> argument on stack stack[sp++] = *pc++; goto **pc++;

ADD: // add two topmost values on stack sp--; stack[sp-1] += stack[sp]; goto **pc++;

PUSH_ADD: // add <imm> to stack top stack[sp++] = *pc++; //goto **pc++; sp--; stack[sp-1] += stack[sp]; goto **pc++;

PUSH_ADD: // add <imm> to stack top stack[sp-1] += *pc++; goto **pc++;

Dispatch eliminated

Optimizations applied

13

Page 14: A  Study of superinstructions and dynamic mixin optimizations

Superinstructions (effects)

• Effects1. Reduce dispatch overhead

a. Eliminate some jumpsb. Provide more context for indirect branch predictor by

replicating indirect jump instructions

2. Allow more optimizations within VM op

14

Page 15: A  Study of superinstructions and dynamic mixin optimizations

Prior research result:Good for reducing dispatch overhead

Superinstructions help when:• VM operations are small (~10 hwop/vmop)• Dispatch overhead is high (~50%)

Examples of successful use in prior research• ANSI C interpreter: 2-3 times improvement

(Proebsting 1995)• Ocaml: more than 50% improvement (Piumarta 1998)• Forth: 20-80% improvement (Ertl 2003)

15

Page 16: A  Study of superinstructions and dynamic mixin optimizations

Superinstructions help when:• VM operations are small (~10 hwop/vmop)• Dispatch overhead is high (~50%)

Ruby does not fit well

Hardware profiling data on Intel Core 2 Duo

60-140 hardware ops per VM op

Only 1-3% misprediction overhead on interpreter dispatch

16

BUT

Page 17: A  Study of superinstructions and dynamic mixin optimizations

Superinstructions for Ruby

• We experimentally evaluated effect of “naive” superinstructions on Ruby– Superinstructions are selected statically– Frequently occurring in training run combinations

of length 2 selected as superinstructions– Training run uses the same benchmark– Superinstructions constructed by concatenating C

source code, C compiler optimizations applied

17

Page 18: A  Study of superinstructions and dynamic mixin optimizations

Naive superinstructions effect on Ruby

18

Number of superinstructions used

Norm

alized execution time

Limited benefit

Unpredictableeffects

4 benchmarks

Page 19: A  Study of superinstructions and dynamic mixin optimizations

Branch mispredictions

19

Number of superinstructions used

Norm

alized execution time

2 benchmarks: mandelbrot and spectral_norm

Page 20: A  Study of superinstructions and dynamic mixin optimizations

Branch mispredictions, reordered

20

Number of superinstructions used, reordered by execution time

Norm

alized execution time

2 benchmarks: mandelbrot and spectral_norm

Page 21: A  Study of superinstructions and dynamic mixin optimizations

So why Ruby is slow?

• Profile of numeric benchmarks

21

Garbage collection takes significant time

Boxed floating point values dominate

allocation

Page 22: A  Study of superinstructions and dynamic mixin optimizations

Floating point value boxing

22

OPT_PLUS: VALUE a = *(sp-2); VALUE b = *(sp-1); /* ... */ if (CLASS_OF(a) == Float && CLASS_OF(b) == Float) { sp--; *(sp-1) = NEW_FLOAT(DOUBLE_VALUE(a) + DOUBLE_VALUE(b)); } else { CALL(1/*argnum*/, PLUS, a); } goto **pc++;

New “box” object is allocated on each operation

Typical Ruby 1.9 VM operation

Page 23: A  Study of superinstructions and dynamic mixin optimizations

Proposal: use superinstructions for boxing optimization

• 2 operation per allocation instead of 1

23

OPT_MULT_OPT_PLUS: VALUE a = *(sp-3); VALUE b = *(sp-2); VALUE c = *(sp-1); /* ... */ if (CLASS_OF(a) == Float && CLASS_OF(b) == Float && CLASS_OF(c) == Float) { sp-=2; *(sp-1) = NEW_FLOAT(DOUBLE_VALUE(a) + DOUBLE_VALUE(b)*DOUBLE_VALUE(c)); } else { CALL(1/*argnum*/, MULT/*method*/, b/*receiver*/); CALL(1/*argnum*/, PLUS/*method*/, a/*receiver*/); } goto **pc++;

Boxing of intermediate result eliminated

Page 24: A  Study of superinstructions and dynamic mixin optimizations

Implementation

24

• VM operations that handle floating point values directly:– opt_plus– opt_minus– opt_mult– opt_div– opt_mod

• We implemented all 25 combinations of length 2– Based on Ruby 1.9.1– Using existing Ruby infrastructure for superinstructions with

some modifications

Page 25: A  Study of superinstructions and dynamic mixin optimizations

Limitations

• Coding style-sensitive– Can be fixed by adding getVariable superinstructions

• Not applicable to other types (e.g. Fixnum, Bignum, String)– Fixnum is already unboxed– Bignum and String cannot be unboxed

• Sequences of 3 arithmetic instructions or longer virtually non-existent– No occurrences in the benchmarks

25

Page 26: A  Study of superinstructions and dynamic mixin optimizations

Results

• 0–22% faster on numeric benchmarks (avg 12%)• No slowdown on other benchmarks

26

Page 27: A  Study of superinstructions and dynamic mixin optimizations

Evaluation

27

reduction in boxing translates to reduction in GC count

GC reduction explains most of the speedup

Page 28: A  Study of superinstructions and dynamic mixin optimizations

• Slight modification produces 20% difference in performance– 4 of 9 arithmetic instructions get

merged into 2 superinstructions– 24% reduction in float allocation

Example: mandelbrot tweak

28

ITER.times do- tr = zrzr - zizi + cr+ tr = cr + (zrzr - zizi)- ti = 2.0*zr*zi + ci + ti = ci + 2.0*zr*zi

Norm

alized execution time

• Alternative solution introduce• OP-getdynamic-OP

Page 29: A  Study of superinstructions and dynamic mixin optimizations

Discussion of alternative approaches

• Faster GC– Superinstructions benefit reduced

• Tagged values– 64 bit platforms only

• Stack allocation of intermediate results• Dynamic specialization

• Type inference

29

Page 30: A  Study of superinstructions and dynamic mixin optimizations

Summary

• Naive approach to superinstructions does not produce substantial benefit for Ruby

• Floating point values boxing overhead is a problem of Ruby

• Superinstructions provide some help (upto 23%, 12% on average)– implementation of 2000 SLOC – regular, thus

automatically generatable• Dynamicity same as original interpreter

30

Page 31: A  Study of superinstructions and dynamic mixin optimizations

31

Outline

• Introduction• Superinstructions• Dynamic mixin optimization– Evaluation in compiled environment

• Conclusion

Page 32: A  Study of superinstructions and dynamic mixin optimizations

• code composition technique

Mixin

32

Server

BaseServer

Server

BaseServer

AdditionalSecurity

AdditionalSecurity

Mixin use declaration Mixin semantics

Page 33: A  Study of superinstructions and dynamic mixin optimizations

• Temporary change in class hierarchy• Available in Ruby, Python, JavaScript

Dynamic mixin

33

Server

BaseServer

Server

BaseServer

AdditionalSecurity

Page 34: A  Study of superinstructions and dynamic mixin optimizations

Dynamic mixin (2)

• Powerful technique of dynamic languages• Enables– dynamic patching– dynamic monitoring

• Can be used to implement– Aspect-oriented programming– Context-oriented programming

• Widely used in Ruby, Python– e.g. Object-Relational Mapping

34

Page 35: A  Study of superinstructions and dynamic mixin optimizations

Dynamic mixin in Ruby

• Ruby has dynamic mixin– but only “install”, no “remove” operation– because there is uncertainty in “remove”

semantics with transitive module inclusion• “remove” can be implemented easily– 23 lines of code

35

Page 36: A  Study of superinstructions and dynamic mixin optimizations

Target application

• Mixin is installed and removed frequently• Application server with dynamic features

36

class BaseServer def process() … endend

class Server < BaseServer def process() if request.isSensitive() Server.class_eval { include AdditionalSecurity } end super # delegate to superclass … # remove mixin endend

module AdditionalSecurity def process() … # security check super # delegate to superclass endend

Page 37: A  Study of superinstructions and dynamic mixin optimizations

Overhead is high

Possible reasons• Invalidation granularity– clearing whole method cache– invalidating all inline caches• next calls require full method lookup

• Inline caching saves just 1 target– which changes with mixin operations– even though mixin operations are mostly repeated

37

Page 38: A  Study of superinstructions and dynamic mixin optimizations

Our research target

• Improve performance of application which frequently uses dynamic mixin– Make invalidation granularity smaller– Make dynamic dispatch target cacheable in

presence of dynamic mixin operations

38

Page 39: A  Study of superinstructions and dynamic mixin optimizations

Proposal

• Reduce granularity of inline cache invalidation– Fine-grained state tracking

• Cache multiple dispatch targets– Polymorphic inline caching

• Enable cache reuse on repeated mixin installation and removal– Alternate caching

39

Page 40: A  Study of superinstructions and dynamic mixin optimizations

Basics: Inline caching

40

ic method

cat.speak()

class

consider a call site

cat.speak()

(executable code)

method = lookup(cat, ”speak”)method(cat)

Dynamic dispatch implementation

if (cat has type ic.class) { ic.method(cat)} else { ic.method = lookup(cat, ”speak”) ic.class = cat.class ic.method(cat)}

Inline caching

Expensive!But the result is mostly the same

Cat

Animal

subclass

cat

instance

speak() { … }

methodimplementation

speak

Cat

Page 41: A  Study of superinstructions and dynamic mixin optimizations

Inline caching: problem

41

ic method

cat.speak()

class if (cat has type ic.class) { ic.method(cat)} else { ic.method = lookup(cat, ”speak”) ic.class = cat.class ic.method(cat)}

Inline cachingCat

Animal

cat

instance

Trainingspeak() { … }

speak(){ … }

speak

Cat

• What if the method has been overridden?

Page 42: A  Study of superinstructions and dynamic mixin optimizations

Inline caching: invalidation

42

ic method

cat.speak()

classCat

Animal

cat

instance

Trainingspeak() { … }

speak(){ … }

speak

Cat

if (cat has type ic.class && state == ic.state) { ic.method(cat)} else { ic.method = lookup(cat, ”speak”) ic.class = cat.class; ic.state = state ic.method(cat)}

1 Global state

state1

speak

2

2

Single global state object• too coarse invalidation granularity

Page 43: A  Study of superinstructions and dynamic mixin optimizations

Fine-grained state tracking

• Many state objects– small invalidation extent– share as much as possible

• One state object for each family of methods called from the same call site

• State objects associated with lookup path– links updated during method lookups

43

Page 44: A  Study of superinstructions and dynamic mixin optimizations

44

Lookup procedure

• Lookup normally– noting the state objects encountered– starting from the state object in the inline cache

• Choose the last state object– other state objects• mark for one-time invalidation• mark as overridden

• Store the pointer to the state object in every class on lookup path

Page 45: A  Study of superinstructions and dynamic mixin optimizations

Invariant 1: lookup path• After any number of lookups– for any IC = (pstate, state, class, name)

• Either of the following holds:– IC is invalidated– for any class’ ∊ lookup_path(class, name)

• (class’, name) ↪ state

• Proof scheme (induction over lookups)– initial state is invalid– lookup procedure establishes invariant– on lookups that update state object links, state is linked

transitively or invalidated

45

Page 46: A  Study of superinstructions and dynamic mixin optimizations

Invariant 2: validity of IC

• After any number of updates– for any IC = (pstate, state, class, name, target)

• Either of the following holds– IC is invalidated– target = lookup(class, name)

• Proof scheme (induction over updates)– any update either• does not affect lookup• invalidates IC

46

Page 47: A  Study of superinstructions and dynamic mixin optimizations

method

class

pstate

speak *1*

State object allocation

47

speak() { *1* }

Animal

Cat1

speak

ic Noimplemmentation

here

if (cat has type ic.class && ic.pstate.state == ic.state ) { ic.method(cat)} else { ic.method, ic.pstate = lookup(cat, ”speak”, ic.pstate) ic.class = cat.class; ic.state = state method(cat)} inline caching code

1

cat.speak()

state1

Cat

Page 48: A  Study of superinstructions and dynamic mixin optimizations

speak() { *1* }

Animal

Cat

speak

ic method

class

pstate

cat.speak()

state

speak *1*speak *2*

112

Mixin installation

48

1Training

speak() { *2* }22

Cat

if (cat has type ic.class && ic.pstate.state == ic.state ) { ic.method(cat)} else { ic.method, ic.pstate = lookup(cat, ”speak”, ic.pstate) ic.class = cat.class; ic.state = state method(cat)} inline caching code

Page 49: A  Study of superinstructions and dynamic mixin optimizations

Training

speak() { *2* } Cat

speak

speak() { *1* }

Animal

pstate

if (cat has type ic.class && ic.pstate.state == ic.state ) { ic.method(cat)} else { ic.method, ic.pstate = lookup(cat, ”speak”, ic.pstate) ic.class = cat.class; ic.state = state method(cat)} inline caching code

method

class

cat.speak()

state2

speak *2*

23

speak *1*

3

Mixin removal

49

32ic

Cat

Page 50: A  Study of superinstructions and dynamic mixin optimizations

speak() { *1* }

Animal

Cat

speak

Training

speak() { *2* }method

pstate

state

• Detect repetition• Conflicts detected by state

check

speak *1*speak *2*

34

Alternate caching

50

A

34

super Animal

alternate cache

speak

34

Training

ic

class

cat.speak()

Cat

Inline cache contents oscillates

Page 51: A  Study of superinstructions and dynamic mixin optimizations

speak() { *1* }

Animal

Cat

speak

Training

speak() { *2* }method

class

pstate

state

• Use multiple entries in inline cache

Polymorphic caching

51

4ic 3

super Animal

alternate cache

speak

34

Training

cat.speak()

Cat Cat

*1* *2*

3 4

now PIC handles oscillating state value

Page 52: A  Study of superinstructions and dynamic mixin optimizations

52

Invariant 3: validity of alternate caching

• After any number of updates– for any IC = (pstate, state, class, name, target)

• Either of the following holds– value(pstate) != state (IC is invalid)– target = lookup(class, name)

• Proof scheme (induction on updates)– any update either• introduce fresh value(pstate) for invalidation• preserves correctness of cached target

Page 53: A  Study of superinstructions and dynamic mixin optimizations

QQ

Cat

speak

Training

speak() { *2* }

speak() { *1* }

Animal

State object merge

53

executablecode

cat.speak()S

Overridden by

One-time invalidation

animal.speak()

cat

instance

animal

instance

while(true) {

remove mixin}

Page 54: A  Study of superinstructions and dynamic mixin optimizations

54

Alternate caching limitations

• Independent mixins do not conflict• If two mixins override the same method– scopes need to be properly nested

1

2

2

3

1

1

2

3

4

5

dynamicmixinscope

State object value(for some method)

Page 55: A  Study of superinstructions and dynamic mixin optimizations

Overheads of proposed scheme

• Increased memory use– 1 state object per polymorphic method family– additional method entries– alternate cache– polymorphic inline cache entries

• Some operations become slower– Lookup needs to track and update state objects– Explicit state object checks on method dispatch

55

Page 56: A  Study of superinstructions and dynamic mixin optimizations

Generalizations (beyond Ruby)

• Delegation object model– track arbitrary delegation pointer change

• Thread-local delegation– allow for thread-local modification of delegation

pointer– by having thread-local state object values

56

Page 57: A  Study of superinstructions and dynamic mixin optimizations

Evaluation

• Implementation based on Ruby 1.9.2– about 1000 lines of code

• Hardware– Intel Core i7 860 2.8 GHz

57

Page 58: A  Study of superinstructions and dynamic mixin optimizations

Evaluation: microbenchmarks

Single method call overhead

• Inline cache hit– state checks 1%– polymorphic inline

caching 49% overhead

• Full lookup– 2x slowdown

58

Page 59: A  Study of superinstructions and dynamic mixin optimizations

Dynamic mixin-heavy microbenchmark

100%

23% 17% 15%

Normalized execution time

59

(smaller is better)

Page 60: A  Study of superinstructions and dynamic mixin optimizations

Evaluation: application

• Application server with dynamic mixin on each request

60

baseline

method cach

e state ch

ecks fgst

fgst + PIC

fgst + PIC + alte

rn

100%

70%58% 60% 52%

Normalized execution time(smaller is better)

Page 61: A  Study of superinstructions and dynamic mixin optimizations

Evaluation

• Fine-grained state tracking considerably reduces overhead

• Alternate caching brings only small improvement– Number of call sites affected by mixin is low– Lookup cost / inline cache hit cost is low• about 1.6x on Ruby

61

Page 62: A  Study of superinstructions and dynamic mixin optimizations

Related work

• Dependency tracking in Self– focused on reducing recompilation,

rather than reducing method lookups• Inline caching for Objective-C– state object associated with method, no

dynamic mixin support

62

Page 63: A  Study of superinstructions and dynamic mixin optimizations

Summary

• We proposed combination of techniques– Fine-grained state tracking– Alternate caching– Polymorphic inline caching

• To increase efficiency of inline caching– with frequent dynamic mixin installation and

removal

63

Page 64: A  Study of superinstructions and dynamic mixin optimizations

64

Outline

• Introduction• Superinstructions• Dynamic mixin optimization– Evaluation in compiled environment

• Conclusion

Page 65: A  Study of superinstructions and dynamic mixin optimizations

Evaluation in compiled environment

• Dynamic mixin optimizations– applicable to compiled systems too

• In order to confirm this hypothesis– we implemented a dynamic compiler for the

language IO– we evaluated the performance of inline caching

65

Page 66: A  Study of superinstructions and dynamic mixin optimizations

Idea: efficient dynamic mixin

if (s == 1)

if (s == 2)

State guard

Inlinedmethod f1

f2handleinline cachemiss

f2

Repeated dynamic mixin install / removal

Control flow graph of compiled call site server.f()

s = 1 s = 2f1f1f1

continue66

Page 67: A  Study of superinstructions and dynamic mixin optimizations

Dynamic compiler

• Source language – IO– only subset implemented

• Compilation uses profile collected in PICs• PICs are initialized by interpreted execution

67

Page 68: A  Study of superinstructions and dynamic mixin optimizations

Microbenchmark results

• Overhead of state checks is reasonable (less than 16 CPU cycles)

• The most common case commonly have just 1 cycle overhead

68

Page 69: A  Study of superinstructions and dynamic mixin optimizations

Summary and ongoing work

• We evaluated inline caching optimization for dynamic mixin in compiled environment

• We verified that optimization is effective• The most common case has very low overheadFuture work• The dynamic mixin switch operation may be

slow – need to be addressed in future work• Full IO language support

69

Page 70: A  Study of superinstructions and dynamic mixin optimizations

Conclusion

• We researched two techniques– Superinstructions for Ruby– Dynamic mixin optimization

• Our techniques improve performance• While keeping the advantages– high dynamicity– low implementation complexity

70

Page 71: A  Study of superinstructions and dynamic mixin optimizations

Publications

• S. Zakirov, S. Chiba, E. Shibayama. How to select superinstructions, IPSJ PRO 2010.

• S. Zakirov, S. Chiba, E. Shibayama. Optimizing dynamic dispatch with fine-grained state tracking, DLS 2010.

71