1 Java Bytecode Optimization Optimizing Java Bytecode for Embedded Systems Stefan Hepp.

24
1 Java Bytecode Optimization Optimizing Java Bytecode for Embedded Systems Stefan Hepp

Transcript of 1 Java Bytecode Optimization Optimizing Java Bytecode for Embedded Systems Stefan Hepp.

1

Java Bytecode Optimization

Optimizing Java Bytecode

for Embedded Systems

Stefan Hepp

2

Overview

■ Toolchain JOP, JIT vs. ahead-of-time compilation

■ Existing open source tools

■ JOPtimizer framework and code representations

■ Inlining

■ Results

3

Toolchain Overview

■ Sourcecode compiled with javac to Java bytecode

■ Optimization defered to JVM, profiling information and JIT compiler is used

■ Not feasable on embedded processors like JOP

4

Toolchain Overview

■ Ahead-of-timeoptimization needed

■ Optimization of bytecode for target platform

■ Output is Java bytecode

■ Profiling vs. static WCET

5

Toolchain Overview

■ Advantages over JIT Runtime is not critical

No warm-up phase to gather profiling information and to do JIT compiling

■ Disadvantages Less accurate/no profiling

information available at design-time, class hierarchy may change dynamically

Target platform must be known

6

Existing Tools

ProGuard Soot (-O) Soot (-w -W) Purdue Bloat Jopt, OthersBytecode (%) -26,2% -2,4% 6,3%Ext. RAM (%) -26,1% -1,2% 3,5%Sieve (%) 1,1% -2,7% -2,8%Kfl (%) 0,1% 0,0% 11,2%UdpIp (%) 2,1% 3,8% 15,7%Lift (%) 1,0% -5,3% 7,8%Exec Time (s) 3 62 61 165Pro None Large speedup Easy to use

Contra

Very fast, reduces memory

Nearly no speedup

Very slow, no speedup

Slow, more memory, not always optimal

Very slow, bytecode not readable by BCEL

Not available, commercial or only obfuscators

■ Soot framework looks promising, but not designed for embedded systems and very complex

■ Other open source tools usually only remove unused methods and obfuscate code

7

JOPtimizer

■ JOPtimizer: a new framework for optimizations

Intermediate code representations

Inlining which respects method size restrictions

8

Assumtions

■ Assumptions about embedded applications

No dynamic class loading or class modifications at runtime

Reflection is not used

All class files are available at compile-time (except native classes)

■ Allows more optimizations (but assumtions can be disables)

■ Exclude “library” code (like java.*)

Define: library classes must not extend/reference application classes

9

Java Class Files

■ A class consists of:

ConstantPool: indexed table of constants (numbers, Strings, class names, method names, signatures, .. )

Classname, super-class, interfaces (references to CP)

Fields, methods: name, signature, flags

Method code as attribute of methods

Stack architecture with variable length encoding

■ Parsing and compiling of classfiles done by existing Libraries (BCEL, ASM, ...)

10

The JVM Instruction Set

■ (partially) typed stack instructions

■ 32bit (int, float, reference, byte, short, ..) and 64bit (long, double) variables

■ exception-handling, synchronization, subroutines

■ Stack- and variable table entries always 32bit

■ No indirect jumps, stack size must be static

private Map m;

private void test(int i) { int j[] = new int[2]; float a = 2.0f;

j[0] = i * (int) a;

m.put(this, j);}

private test(I)V ICONST_2 NEWARRAY T_INT ASTORE 2 FCONST_2 FSTORE 3 ALOAD 2 ICONST_0 ILOAD 1 FLOAD 3 F2I

IMULIASTOREALOAD 0GETFIELD #4ALOAD 0ALOAD 2INVOKEINTERFACE #7 POPRETURN

11

Stackcode Representation

■ Internal representation (“stackcode”) Types and constant values as parameters of

instructions to reduce number of different instructions (~40 stackcode instructions)

Stack emulation to determine operand types for all instructions (swap, dup, ..)

Variables and types instead of 32-bit slots

Constant values instead of references into CP

split basic blocks at exception handler ranges too

■ Still a stack architecture

■ Stackcode can be mapped directly to bytecode (allows analysis of code size and execution time)

12

Quadcode Representation

■ Stack creates implicit dependencies between instructions and blocks, makes optimizations more complex

■ Quadruple form of code (“quadcode”)

Create local variable per stack slot, emulate stack to determine the arguments of instructions

Instructions with types and constants as parameter

Instructions to manupulate stack not needed (pop) or replaced with copy instructions (load, swap, dup, ..)

13

Quadcode Representation

■ Quadcode representation enables simpler implementation of optimizations, but code cannot be mapped to bytecode directly

■ Stackcode and Quadcode similar to Soot internal representations (Baf, Jimple, Shimple)

public int calc(int a, int b) { copy.ref s0, l0 // load.ref l0 // aload_0 getfield.'Test.fField' s0, s0 // getfield 'Test.fField' // getfield

#3 copy.float s1, 2.0f // push.float 2.0f // fconst_2 binop.float.div s0, s0, s1 // binop.float.div // fdiv copy.float l3, s0 // store.float l3 // fstore_3 copy.int s0, l1 // load.int l1 // iload_1 return.int s0 // return.int // ireturn }

14

Creation of Bytecode

■ Transformation back from quadcode to bytecode

Create complete expressions from instructions (“Decompile” code), compile expression trees to JVM instructions like javac (Soot does this (Grimp))

Create stack form of quadruple instructions, compile to bytecode (JOPtimizer does this, optional in Soot)

Per quadcode instruction: load parameters on stack, execute operation and store result back

■ load/store elimination and local variable allocation for stackcode needed before bytecode can be created

■ Decompilation method of Soot gets slightly better results

15

Inlining

■ Invocations are expensive on JOP

■ Inline methods to eliminate invokation overhead

■ Inlining is not always possible

Callee code restrictions

Code size and variable table size restrictions of JOP

■ Inlining comes at a price

Caller code size increases, makes caller cache miss more costly

Overall program size increases if callee is not removed (p.e. is called somewhere else)

16

Inlining methods

1. Traverse callgraph bottom-up (leaves first)

2. Find and devirtualize invocations

static, final, private invokations not virtual

Check class hierarchy for overloading methods

3. Check if inlining is admissible

4. Estimate gain

5. Replace invocation with copy of callee

insert nullpointer-check for callee class reference

map local variables of callee above caller variables

17

Inlining Checks

■ Inlining is not possible if

new code size or variable table size of caller exceeds platform limits

the callee uses exception handlers or synchronized code

throwing an exception clears the stack

stack of the caller needs to be saved and restored if an exception is handled within the inlined method (NYI in JOPtimizer)

the method or class is excluded from inlining by configuration (caller or callee, p.e. Native class)

18

Inlining Checks (cont.)

■ Check field- and method references in callee code Must be accessible from caller Else make field or method public if possible

Always possible for fields as they are not virtual in Java

All overloading methods must be made public too

If a private method is made public, all invocations have to be changed from invokespecial to invokevirtual (luckily only methods of callee class have to be searched)

Naming conflicts or dynamic class loading can prevent changes, thus preventing inlining

class A public a() tmp = new C() invoke tmp.b()

class B public b() if (v == null) invoke B.c()

private c()

class C extends B private c()

19

Inlining Checks (cont.)

■ Estimate gain of inlining

Depends on cache state

Possible degredation of performance if inlined method is seldom invoked

Calculate gain based on invocation frequency and cache state estimations

Decrease weight of callees with multiple call sites to reduce increase of application code size

Select method with highest (positive) weight for inlining

■ Add inlined invocations to inlining candidate list, repeat inlining (check with new codesize)

20

Benchmark Results

■ Inlining of stackcode, jbe @60Mhz

■ Inlining limited by maximal code size imposed by JOP's memory cache

■ Removing of unused code should be implemented

ProGuard Soot (-O) Soot (-w -W) JOPtimizerBytecode (%) 26,2% -2,4% 6,3% 12,4%Ext. RAM (%) 26,1% -1,2% 3,5% 6,9%Exec Time (s) 3 62 61 5,5Sieve (%) 1,1% -2,7% -2,8% 0,0%Kfl (%) 0,1% 0,0% 11,2% 13,3%UdpIp (%) 2,1% 3,8% 15,7% 8,9%Lift (%) 1,0% -5,3% 7,8% 14,0%

21

Inlining Improvements

■ Many improvements possible

Type analysis/callgraph thinning for better devirtualization

Better cache state and invocation frequency estimation(WCET-driven?)

Run optimizations to reduce code size prior to inlining

Allow inlining of synchronized code/exception handlers

Try to find invocations with highest gain application-wide

...

22

Summary

■ Optimizing code at runtime not feasible for (realtime) embedded systems

■ Existing open source tools not designed for embedded systems

■ Inlining implemented in JOPtimizer which takes target platform into account (code restrictions, caching, ..), up to 14% speedup of JBE benchmark

■ load/store elimination and local variable allocation needed for further optimizations to be implemented

■ Still many improvements possible ..

23

Q&A

Thanks for your attention!

Questions?

24

Transformations