Stack Based Allocation in the Azul JVMcliffc.org/blog/wp-content/uploads/2017/12/2004_SBA.pdf · 3...
Transcript of Stack Based Allocation in the Azul JVMcliffc.org/blog/wp-content/uploads/2017/12/2004_SBA.pdf · 3...
1
www.azulsystems.com
| ©2005 Azul Systems, Inc.
©2005 Azul Systems, Inc.
Stack Based Allocation in the Azul JVM™
Stack Based Allocation in the Azul JVM™
Dr. Cliff [email protected]. Cliff [email protected]
3
www.azulsystems.com
| ©2005 Azul Systems, Inc.
BackgroundBackground
• The Azul JVM™ is based on Sun HotSpot─ a State-of-the-Art Java VM
• Java is a GC'd language• HotSpot uses a fast Generational GC
─ with “test and bump” allocation
• Test and bump allocation streams through memory• Streaming memory is hard on caches• Azul is looking at using Stack Based Allocation
• The Azul JVM™ is based on Sun HotSpot─ a State-of-the-Art Java VM
• Java is a GC'd language• HotSpot uses a fast Generational GC
─ with “test and bump” allocation
• Test and bump allocation streams through memory• Streaming memory is hard on caches• Azul is looking at using Stack Based Allocation
4
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Cost of using a Generational GCCost of using a Generational GC
• Typical allocation is pointer compare & bump• Deallocation is “free”
─ Requires scanning the live set─ Live set is much smaller than what was allocated
• Allocation streams through young generation─ Similar to streaming array access in Fortran
• Young generation >> L1 cache size─ Implies caches are flushed repeatedly
• Cache miss cost is spread out─ Hard to spot
• Typical allocation is pointer compare & bump• Deallocation is “free”
─ Requires scanning the live set─ Live set is much smaller than what was allocated
• Allocation streams through young generation─ Similar to streaming array access in Fortran
• Young generation >> L1 cache size─ Implies caches are flushed repeatedly
• Cache miss cost is spread out─ Hard to spot
5
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Cost of using a Generational GCCost of using a Generational GC
• Allocation streams through young generation• Cache miss for each new object
─ Can cover with prefetch
• Read data is dead─ Object must be initialized─ Can prevent read with “cache-line-zero” ops
• Must evict for each new object─ Evicted data is mostly dead─ Occasional live object
• Result: wasted read AND wasted write
• Allocation streams through young generation• Cache miss for each new object
─ Can cover with prefetch
• Read data is dead─ Object must be initialized─ Can prevent read with “cache-line-zero” ops
• Must evict for each new object─ Evicted data is mostly dead─ Occasional live object
• Result: wasted read AND wasted write
6
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Generational GCGenerational GC
Memory usage, over time
Fresh heap, hotcached memory
Allocation Point
Free heap, cold uncached memory
L1 Cache
Allocation Point
7
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Generational GCGenerational GC
Fresh heap, hotcached memory
Allocation Point
Memory usage, over time
Free heap, cold uncached memory
Allocation Point
L1 Cache
8
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Generational GCGenerational GC
Fresh heap, hotcached memory
Memory usage, over time
Free heap, cold uncached memory
Allocation Point
Reading dead memory Filling cache lines
L1 Cache
9
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Generational GCGenerational GC
Fresh heap, hotcached memory
Allocation Point
Old, cold heap,mostly dead
Memory usage, over time
Free heap, cold uncached memory
Allocation Point
L1 Cache
10
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Generational GCGenerational GC
Fresh heap, hotcached memory
Allocation Point
Old, cold heap,mostly dead
Memory usage, over time
Free heap, cold uncached memory
Allocation Point
Writing mostlydead cache lines
L1 Cache
11
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Generational GCGenerational GC
Free heap, cold uncached memory
Memory usage, over time
Fresh heap, hotcached memory
Allocation Point
Old, cold heap,mostly dead
Allocation Point
L1 Cache
12
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Stack Based AllocationStack Based Allocation
• Objects are allocated on the stack─ Stack space reserved on function entry
• Deallocate is “free” on function return• Has similar direct costs to Generational GC
─ Orthogonal to Generational GC
• Recycles memory instead of streams memory ─ Has a smaller cache footprint─ Fewer misses, stalls
• Requires knowledge about object lifetime─ Error if object “escapes” function return
• Objects are allocated on the stack─ Stack space reserved on function entry
• Deallocate is “free” on function return• Has similar direct costs to Generational GC
─ Orthogonal to Generational GC
• Recycles memory instead of streams memory ─ Has a smaller cache footprint─ Fewer misses, stalls
• Requires knowledge about object lifetime─ Error if object “escapes” function return
13
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Stack Based AllocationStack Based Allocation
Allocation Point
Call Sam
Free stack
Sam's frame
L1 Cache
SP
Sam { a = new; b = new; Ted(); bad = Uli(); bad.crash;}
14
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Stack Based AllocationStack Based Allocation
Allocation Point
Allocate object a
Free stack
SP
a
a
L1 Cache
Sam's frame Sam { a = new; b = new; Ted(); bad = Uli(); bad.crash;}
15
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Stack Based AllocationStack Based Allocation
Allocation Point
Allocate object b
Free stack
SP
a b
a b
L1 Cache
Sam { a = new; b = new; Ted(); bad = Uli(); bad.crash;}
16
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Stack Based AllocationStack Based Allocation
Call Ted
Ted's frame
Allocation Point
Free stack
SP
a b
Sam's frame
a b
L1 Cache
Sam { a = new; b = new; Ted(); bad = Uli(); bad.crash;}Ted { c = new; return;}
17
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Stack Based AllocationStack Based Allocation
Allocate object c
Free stack
SP
a b
Allocation Point
c
Sam's frame
a b c
L1 Cache
Ted's frame Sam { a = new; b = new; Ted(); bad = Uli(); bad.crash;}Ted { c = new; return;}
18
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Stack Based AllocationStack Based Allocation
Exit Ted, free c
Free stack
SP
a b
a b c
L1 Cache
Allocation Point
Sam { a = new; b = new; Ted(); bad = Uli(); bad.crash;}Ted { c = new; return;}
19
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Stack Based AllocationStack Based Allocation
Call Uli
Uli's frame
Allocation Point
Free stack
SP
a b
Sam's frame
a b
L1 Cache
c
Sam { a = new; b = new; Ted(); bad = Uli(); bad.crash;}Uli { d = new; return d;}
20
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Stack Based AllocationStack Based Allocation
Allocate object d
Allocation Point
Free stack
SP
a b
Sam's frame
a b
L1 Cache
d Sam { a = new; b = new; Ted(); bad = Uli(); bad.crash;}Uli { d = new; return d;}
21
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Stack Based AllocationStack Based Allocation
Exit Uli, return d
Free stack
SP
a b
a b d
L1 Cache
Allocation Point
d Sam { a = new; b = new; Ted(); bad = Uli(); bad.crash;}Uli { d = new; return d;}
22
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Stack Based AllocationStack Based Allocation
• Fast allocate and free• Better cache behavior• Only works for objects with stack lifetime• How do we know object lifetime?• Escape Analysis
─ Precise, conservative─ Expensive at compile time
• Escape Detection─ Imprecise, optimistic─ Expensive at run time (requires write barrier)
• Fast allocate and free• Better cache behavior• Only works for objects with stack lifetime• How do we know object lifetime?• Escape Analysis
─ Precise, conservative─ Expensive at compile time
• Escape Detection─ Imprecise, optimistic─ Expensive at run time (requires write barrier)
23
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Escape DetectionEscape Detection
• Escape Analysis─ Conservatively correct─ Objects known to never escape─ Allows objects to be en-registered─ Fairly expensive (for a JIT) except in trivial cases
• Escape Detection─ Optimistic; some objects may escape─ Requires a complex write barrier and fix-up
• Write barrier can be done in hardware• Requires minor hardware change
─ Azul is implementing a Java™ VM with custom hardware
• Escape Analysis─ Conservatively correct─ Objects known to never escape─ Allows objects to be en-registered─ Fairly expensive (for a JIT) except in trivial cases
• Escape Detection─ Optimistic; some objects may escape─ Requires a complex write barrier and fix-up
• Write barrier can be done in hardware• Requires minor hardware change
─ Azul is implementing a Java™ VM with custom hardware
24
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Escape Detection HardwareEscape Detection Hardware
• Allocate objects in stack frames• Steal some address bits for frame depth (FID)• FID bits stored in object pointer• Hardware ignores FID on a load (getfield)• Hardware does range check on a store (putfield)
─ Storing old FID into new FID object is OK─ Storing new FID into old FID object TRAPS
• Trap requires expensive fixup─ Crawl only new frames on this thread's stack ─ Move object, adjust FIDs
• Allocate objects in stack frames• Steal some address bits for frame depth (FID)• FID bits stored in object pointer• Hardware ignores FID on a load (getfield)• Hardware does range check on a store (putfield)
─ Storing old FID into new FID object is OK─ Storing new FID into old FID object TRAPS
• Trap requires expensive fixup─ Crawl only new frames on this thread's stack ─ Move object, adjust FIDs
25
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Fixup is ExpensiveFixup is Expensive
• Fixup must be rare─ Object factories always escape object!
• Tag allocation sites with relative allocation depth• Allocate object in frames with longer lifetimes
─ Or directly in heap if needed
• Start small• Adjust tag upwards with each failure• Rapidly converges in practice!• No expensive analysis
• Fixup must be rare─ Object factories always escape object!
• Tag allocation sites with relative allocation depth• Allocate object in frames with longer lifetimes
─ Or directly in heap if needed
• Start small• Adjust tag upwards with each failure• Rapidly converges in practice!• No expensive analysis
26
www.azulsystems.com
| ©2005 Azul Systems, Inc.
How Big are Frames?How Big are Frames?
• Check for frame overflow on allocation─ Same as test-and-bump allocation
• If it fits, fine. If not...• Allocate an overflow area on the side
─ Stamp same FID into objects─ Hardware checks FID, not actual stack address
• Adjust call prolog code ─ Make a larger frame for next time
• Rapidly converges in practice!
• Check for frame overflow on allocation─ Same as test-and-bump allocation
• If it fits, fine. If not...• Allocate an overflow area on the side
─ Stamp same FID into objects─ Hardware checks FID, not actual stack address
• Adjust call prolog code ─ Make a larger frame for next time
• Rapidly converges in practice!
27
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Never-Exit FramesNever-Exit Frames
• Some top-level frames never exit─ Also loops, and JIT may inline
• But will accumulate dead objects─ Objects only freed on a frame exit!
• Periodically need a thread-local GC─ Toss out dead objects─ Compact overflow areas─ Can adjust frame sizes
• Some top-level frames never exit─ Also loops, and JIT may inline
• But will accumulate dead objects─ Objects only freed on a frame exit!
• Periodically need a thread-local GC─ Toss out dead objects─ Compact overflow areas─ Can adjust frame sizes
28
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Inlining helps Site TaggingInlining helps Site Tagging
• Allocation site behavior may depend on call path• Same site may want different depth tags depending
on caller• Hot code gets compiled, with inlining• Inlining clones allocation sites• Each inlined clone gets its own depth tag
• Allocation site behavior may depend on call path• Same site may want different depth tags depending
on caller• Hot code gets compiled, with inlining• Inlining clones allocation sites• Each inlined clone gets its own depth tag
29
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Reverse: Tags Guide InliningReverse: Tags Guide Inlining
• Depth tags tell empirical lifetime─ Of objects allocated here
• Likely can prove actual=empirical with Escape Analysis
• Inline up to tagged depth• Run Escape Analysis on inlined code
─ E.A. is cheap in a single compilation unit─ E.A., if successful, is better than Detection─ Can en-register whole objects
• Avoid E.A. If allocation site escapes
• Depth tags tell empirical lifetime─ Of objects allocated here
• Likely can prove actual=empirical with Escape Analysis
• Inline up to tagged depth• Run Escape Analysis on inlined code
─ E.A. is cheap in a single compilation unit─ E.A., if successful, is better than Detection─ Can en-register whole objects
• Avoid E.A. If allocation site escapes
30
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Prelimary ResultsPrelimary Results
• Modest sized runs of SpecJVM98, SpecJBB, SJAS• About ½ of all objects stack allocated• With 4meg overflow areas
─ Typical stack GC needed every 6-8sec─ Stack GC takes <1msec
• Some anomalies observed• Lots of stack-allocated 2K and 4K arrays
─ May be faster to heap allocate─ Because of Azul fast zero'ing hardware
─ Frames forced too large too quick
• JIT inlines allocation into loops
• Modest sized runs of SpecJVM98, SpecJBB, SJAS• About ½ of all objects stack allocated• With 4meg overflow areas
─ Typical stack GC needed every 6-8sec─ Stack GC takes <1msec
• Some anomalies observed• Lots of stack-allocated 2K and 4K arrays
─ May be faster to heap allocate─ Because of Azul fast zero'ing hardware
─ Frames forced too large too quick
• JIT inlines allocation into loops
31
www.azulsystems.com
| ©2005 Azul Systems, Inc.
Software Escape BarrierSoftware Escape Barrier
• Azul has precise per-frame E.D. hardware• Possible to make an software E.D. barrier• Compare raw stack addresses & trap
─ Requires all stacks be on same side of heap
• Store FID just prior to object header in stack• Trap checks FID before declaring an escape
─ load, load, compare, branch
• Trap can fail repeatedly for non-escaping─ Two objects in same frame but wrong order─ No good overflow solution
• Azul has precise per-frame E.D. hardware• Possible to make an software E.D. barrier• Compare raw stack addresses & trap
─ Requires all stacks be on same side of heap
• Store FID just prior to object header in stack• Trap checks FID before declaring an escape
─ load, load, compare, branch
• Trap can fail repeatedly for non-escaping─ Two objects in same frame but wrong order─ No good overflow solution
32
www.azulsystems.com
| ©2005 Azul Systems, Inc.
SummarySummary
• First GC-specific hardware in a long time• Looks good on paper• Looks good in prelimary results• Real Thing is a little ways off yet
─ Needs tuning─ Performance-proof in large runs
• First GC-specific hardware in a long time• Looks good on paper• Looks good in prelimary results• Real Thing is a little ways off yet
─ Needs tuning─ Performance-proof in large runs