Context Threading: A flexible and efficient dispatch …Research supported by IBM CAS, NSERC, CITO 1...
Transcript of Context Threading: A flexible and efficient dispatch …Research supported by IBM CAS, NSERC, CITO 1...
1Research supported by IBM CAS, NSERC, CITO
Context Threading: A flexible and efficient dispatch technique for
virtual machine interpreters
Marc BerndlBenjamin VitaleMathew Zaleski
Angela Demke Brown
Context Threading 2
Interpreter performance
• Why not just JIT?• High performance JITs still interpret• People use interpreted languages that don’t
yet have JITs• They still want performance!
• 30-40% of execution time is due to branch misprediction
• Our technique eliminates 95% of branch mispredictions
Context Threading 3
Overview
Motivation•Background: The Context Problem•Existing Solutions•Our Approach•Inlining•Results
Context Threading 4
A Tale of Two Machines
LoadedProgram
VirtualProgram load
Return Address
Wayness(Conditional)
Execution Cycle
BytecodeBodies
Pipeline
Target Address(Indirect)
Pre
dict
ors
Execution Cycle
Virtual Machine Interpreter
Real MachineCPU
Context Threading 5
Interpreter
LoadedProgram
Bytecodebodies
Internal Representation
fetch
dispatchLoadParms
execute
Execution Cycle
Context Threading 6
Running Java Example
void foo(){int i=1;do{i+=i;} while(i<64);
}
0: iconst_01: istore_12: iload_13: iload_14: iadd5: istore_16: iload_17: bipush 649: if_icmplt 212: return
Java Source Java Bytecode
Javac compiler
Context Threading 7
Switched Interpreter…iload_1iload_1iaddistore_1iload_1bipush 64if_icmplt 2…
VirtualProgram
while(1){switch(*vPC++){
}};
Switched BodyImplementation
case iload_1:..
break;
case iadd:..
break;
-7
<if_icmplt>64<bipush><iload_1><istore_1><iadd><iload_1><iload_1>
InternalRepresentation
vPC
Simple, portable and extremely slow
Context Threading 8
Direct Threaded Interpreter
-7
&&if_icmplt64&&bipush&&iload_1&&istore_1&&iadd&&iload_1&&iload_1
…iload_1iload_1iaddistore_1iload_1bipush 64if_icmplt 2…
DTT - DirectThreading Table
VirtualProgram
vPC
iload_1:..
goto *vPC++;
iadd:..
goto *vPC++;
Target of computed goto is data-driven
Context Threading 9
Context Problem
-7
&&if_icmplt
64
&&bipush
&&iload_1
&&istore_1
&&iadd
&&iload_1
&&iload_1
DTT - DirectThreading Table
iload_1:..
goto *vPC++;
<empty>
<empty>
<empty>
Indirect Branch Predictors
vPCiload_1
iadd
bipush
-7
&&if_icmplt
64
&&bipush
&&iload_1
&&istore_1
&&iadd
&&iload_1
&&iload_1
&&iload_1&&iadd&&bipush
-7
&&if_icmplt
64
&&bipush
&&iload_1
&&istore_1
&&iadd
&&iload_1
&&iload_1
Context Threading 10
Existing Solutions
BodyBodyBodyBodyBody
GOTO *PC
????
Piumarta & Ricardi :Bodies Replicated
Super InstructionReplicate
iload_1GOTO *PC
1
iload_1GOTO *PC
2
1
1
2
2
Ertl & Gregg:Bodies and Dispatch
Replicated
Limited to relocatable virtual instructions
Context Threading 11
Overview
MotivationBackground: The Context ProblemExisting Solutions
•Our Approach•Inlining•Results
Context Threading 12
Key Observation
• Virtual and native control flow have same branch types• Linear (not really a branch)• Conditional• Calls and Returns• Indirect
• Hardware has predictors for each typeSolution: Leverage hardware predictors
Context Threading 13
Essence of our Solution
iload_1:..
ret;
iadd:..
ret;..call iload_1call istore_1call iaddcall iload_1call iload_1
CTT - ContextThreading Table (generated code)
Bytecode bodies (ret terminated)
Return Branch Predictor Stack
…iload_1iload_1iaddistore_1iload_1bipush 64if_icmplt 2…
Package bodies as subroutines and call them
Context Threading 14
Context Threading
iload_1:…
ret;
iadd:…
ret;
call bipushcall if_icmplt
call iload_1call istore_1call iaddcall iload_1call iload_1
CTT load timegenerated code
Bytecode bodies (ret terminated)
if_cmplt:…
goto *vPC++;
64
-7
…iload_1iload_1iaddistore_1iload_1bipush 64if_icmplt 2…
DTT containsaddresses in CTT
vPC
Generate calls into the CTT at load time
Context Threading 15
The Context Threading Table
• A sequence of calls• A sequence of generated instructions• An internal representation of the
program’s control flow• Virtual branches are also control flow
Can virtual branches go into the CTT?
Context Threading 16
Virtual Branches
target:
……
…
…call …call iload_1call if_icmplt…
Context Threading Table
5
DTT
vPCif(icmplt)pc=target;
goto *vPC++;
VirtualBranch
Context problem is worse for virtual branches
Context Threading 17
Specialized Branch Inlining
Conditional Branch Predictor Entry
……target:
…call …call iload_1
if(icmplt)goto target:
…
Branch Inlined Into the CTT
5
DTT
vPC
target:…
…
Inlining conditional branches provides context
Context Threading 18
Tiny Inlining
• Context Threading is a dispatch technique • But, we inline branches
• Some non-branching bodies are very small• Why not inline those?
Inline all tiny linear bodies into the CTT
Context Threading 19
What can go in the CTT?
• Calls to bodies• Inlined bodies• Mixed-Mode virtual machine?• Partially Inlined bodies• Calls to subroutines that aren’t
bytecodes• Generated code
Performance?
Context Threading 20
Overview
MotivationBackground: The Context ProblemExisting SolutionsOur ApproachInlining
•Results
Context Threading 21
Experimental Setup
• Two Virtual Machines on two hardware architectures.• VM: Java/SableVM, OCaml interpreter
• Arch: P4, PPC
• Branch Misprediction
• Execution Time
Is our technique effective and general?
Context Threading 22
Mispredicted Taken Branches
0.00
0.25
0.50
0.75
1.00
compre
ss dbjackjava
cjess
mpeg
mtrt raysc
imarkso
otSubroutine Branch Inlining Tiny Inlining
Nor
mal
ized
to
Dir
ect
Thr
eadi
ng
95% mispredictions eliminated on average
Context Threading 23
Execution time
0.00
0.25
0.50
0.75
1.00
compre
ss db jack
javac jes
smpe
gmtrt ray
scim
ark soot
Subroutine Branch Inlining Tiny Inlining
Nor
mal
ized
to
Dir
ect
Thr
eadi
ng
27% average reduction in execution time
Context Threading 24
Execution Time (geomean)
0.00
0.25
0.50
0.75
1.00
java/p4
java/ppc
ocaml/p
4
ocaml/p
pc
Subroutine Branch Inlining Tiny Inlining
Nor
mal
ized
to
Dir
ect
Thr
eadi
ng
Our technique is effective and general
Context Threading 25
Conclusions
•Context Problem: branch mispredictions due to mismatch between native and virtual control flow
•Solution: Generate control flow code into the Context Threading Table
•Results•Eliminate 95% of branch mispredictions•Reduce execution time by 30-40%