Effec%ve Soware - cvut.cz · 2018. 2. 25. · » JVM is used to implement also other languages...
Transcript of Effec%ve Soware - cvut.cz · 2018. 2. 25. · » JVM is used to implement also other languages...
David Šišlák [email protected]
Effec%veSo*ware
Lecture2:Virtualmachine,JVM,bytecode,(de-)compilers,disassembler,profiling
26thFebruary2018 ESW–Lecture2 2
Introduc%on–VirtualMachine
» Virtualmachinemodel(.NET,JVM–Scala,Jython,JRuby,Clojure,…)• sourcecode• compiledintoVMbytecode• hybridrun-Omeenvironment(plaQormdependentVMimplementaOon)
– interpretedbytecode– compliedassembly-code(naOveCPUcode)– automatedplaQormcapabilityopOmizaOons(e.g.useofSIMD)
» comparisonofbytecodetoassembly-code• (+)plaQormindependence(portable)–architecture(RISC/CISC,bits),OS• (+)reflecOon–observe,modifyownstructureatrun-Ome• (+)smallsize• (-)slowerexecuOon–interpretedmode,compilaOonlatencies• (-)lesscontrolonassemblycode–lessopOonsforcustomopOmizaOon
26thFebruary2018 ESW–Lecture2 5
JAVAVirtualMachine–MemoryLayout
Thread specific Shared by many threads
26thFebruary2018 ESW–Lecture2 6
JAVAVirtualMachine–MemoryLayout
26thFebruary2018 ESW–Lecture2 8
JAVAVirtualMachine–Stack-orientedMachine
» stack-oriented-stackmachinemodelforpassingparametersandoutputforinstrucOons
(2+3)×11+1» JVMbytecode–sequenceofinstruc(onscomposedof
• opcode–operaOoncode,whatshouldbedone• opcodespecificparameters–somehasnoparams,somemulOple
26thFebruary2018 ESW–Lecture2 9
JAVAVirtualMachine-Frame
» frame» eachthreadhasstackwithframes(outsideofheap,fixedlength)
StackOverflowErrorvs.OutOfMemoryError» frameiscreatedeachOmemethodisinvoked(destroyedaderreturn)
- interpretedframeperexactlyonemethod- compliedframeincludesallin-linedmethods
» framesizedeterminedatcompile-Ome(inclassfileforinterpreted)» variables(anytype)
» {this}–instancecallonly!» {methodparameters}» {localvariables}
» operandstack(anytype)» LIFO
» referencetorun-%meconstantpool(classdef)» method+classisassociated
26thFebruary2018 ESW–Lecture2 10
JAVAVirtualMachine–Opcodes
» JVMopcode(1Byteonlyalways):» loadandstore(aload_0,istore,aconst_null,…)» arithmeOcandlogic(ladd,fcmpl,…)» typeconversion(i2b,d2i,…)» objectmanipulaOon(new,puQield,geQield,…)» stackmanagement(swap,dup2,…)» controltransfer(ifeq,goto,…)» methodinvocaOon(invokespecial,areturn,…)–framemanipulaOon» excepOonsandmonitorconcurrency(athrow,monitorenter,…)
» prefix/suffix–i,l,s,b,c,f,danda(reference)» variablesasregisters–e.g.istore_1(variable0isthisforinstancemethod)
VS.
CPU assembly-code JVM bytecode
26thFebruary2018 ESW–Lecture2 11
JAVAVirtualMachine
» JVMisusedtoimplementalsootherlanguagesthanJAVA» Erlang->Erjang» JavaScript->Rhino» Python->Jython» Ruby->Jruby» Scala,Clojure–funcOonalprogramming» others
» bytecodeisverifiedbeforeexecuted:» branches(jumps)arealwaystovalidlocaOons–onlywithinmethod» anyinstrucOonoperatesonafixedstacklocaOon(helpsJITfor
registersmapping)» dataisalwaysiniOalizedandreferencesarealwaystype-safe» accesstoprivate,packageiscontrolled
26thFebruary2018 ESW–Lecture2 12
JAVAVirtualMachine–ObjectOrientedLanguage
» Classfile–productofsourcecodecompilaOon• onepereachclass• methodbytecodeisincluded
26thFebruary2018 ESW–Lecture2 13
JAVAVirtualMachine–Example1–SourceCode
26thFebruary2018 ESW–Lecture2 14
JAVAVirtualMachine–Example1–ClassFileContent
26thFebruary2018 ESW–Lecture2 15
JAVAVirtualMachine–Example1–DisassembledConstants
» javap–JAVAdisassemblerincludedinJDK
26thFebruary2018 ESW–Lecture2 16
JAVAVirtualMachine–Example1–DisassembledFields
26thFebruary2018 ESW–Lecture2 17
JAVAVirtualMachine–Example1–DisassembledMethod
» geQield• takes1reffromstack• buildanindexintorunOmepoolofclassinstancebyreferencethis
» areturn• takes1reffromstack• pushontothestackofcallingmethod
opcode offset in bytecode for the method employeeData
26thFebruary2018 ESW–Lecture2 18
JAVAVirtualMachine–Example1–DisassembledConstructor
26thFebruary2018 ESW–Lecture2 19
JAVAVirtualMachine–Example1–Decompiler
» procyon–open-sourceJAVAdecompiler
Original source code De-compiled source code
26thFebruary2018 ESW–Lecture2 20
JAVAVirtualMachine–Example2–SourceCode
26thFebruary2018 ESW–Lecture2 21
JAVAVirtualMachine–Example2–daysInMonthBytecode
26thFebruary2018 ESW–Lecture2 22
JAVAVirtualMachine–Example2–daysInMonthBytecode
26thFebruary2018 ESW–Lecture2 23
JAVAVirtualMachine–Example2–computeBytecode
No optimization during source code compilation !
26thFebruary2018 ESW–Lecture2 24
JAVAVirtualMachine–SourceCodeCompila%on
» sourcecodecompila%on(sourcecode=>bytecode)» bytecodeisnotbeperthanyoursourcecode
» invariantsinlooparenotremoved» noopOmizaOonslike
» loopunrolling» algebraicsimplificaOon» strengthreducOon
» opOonallybytecodecouldbemodifiedbeforeexecuOonbyJVM• e.g.ProGuard–obfuscatorincludingbytecodeopOmizaOons
– shrinker–compactcode,removedeadcode– opOmizer
• modifyaccesspapern(private,staOc,final)• inlinebytecode
– obfuscator–renaming,layoutchanges– preverifier–ensureclassloading
Test yourself - compute method is simplified - faster interpretation - better JIT output
obfuscation = make code difficult to be understood by humans but with the same functionality
26thFebruary2018 ESW–Lecture2 26
JAVAVirtualMachine–BytecodeCompila%oninrun-%me
» Just-in-%me(JIT)» convertsbytecodeintoassemblycodeinrun-Ome» checkOpenJDKsourcesforverydetailedinformaOon
hpp://openjdk.java.net» JITincludesadap%veop%miza%on(adapOveOeredcompilaOonsinceversion7)
» balancetrade-offbetweenJITandinterpreOnginstrucOons» monitorsfrequentlyexecutedparts“hotspots”includingdataoncaller-callee
relaOonshipforvirtualmethodinvocaOon» triggersdynamicre-compilaOonbasedoncurrentexecuOonprofile» inlineexpansiontoremovecontextswitching» opOmizebranches» canmakeriskyassumpOon(e.g.skipcode)->
» unwindtovalidstate» deopOmizepreviouslyJITedcodeevenifcodeisalreadyexecuted
» Ahead-of-TimeCompilaOon(AOT)–removewarm-upphase• compileintoassemblycodepriortolaunchingthevirtualmachine
26thFebruary2018 ESW–Lecture2 27
JAVAVirtualMachine–JITCompila%on
» Just-in-6me(JIT)compilers–asynchronous(3C1,7C2threadsfor32cores)» C1compiler–muchfasterthanC2
» simplifiedinlining,usingCPUregistry» window-basedopOmizaOonoversmallsetofinstrucOons» intrinsicfuncOonswithvectoroperaOons(Math,arraycopy,…)
» C2compiler–high-endfullyopOmizingcompiler» deadcodeeliminaOon,loopunrolling,loopinvarianthoisOng,commonsub-expressioneliminaOon,constantpropagaOon
» fullinlining,fulldeopOmizaOon(backtolevel0)» escapeanalysis,nullcheckeliminaOon,» papern-basedloopvectorizaOonandsuperwordpacking(SIMD)
» JITcompila%on%ers
» on-stackreplacement(OSR)–opOmizaOonduringexecuOonofamethod» startatbytecodejumptargets(goto,if_)
26thFebruary2018 ESW–Lecture2 28
AssemblyCode
» reasonstostudyassemblycode(bothJavaandC/C++)• educaOonalreasons
– predictefficientcodingtechniques• debuggingandverificaOon
– howwellthecodelookslike• opOmizecode
– forspeed• avoidpoorlycompiledpaperns• datafitsintocache• predictablebranchesornobranches• usevectorprogramingifpossible(SIMD)
» 256bitregisterswithAVX2sinceIntelSandyBridge» 512bitAVX-512sinceIntelKnightLanding(XeonPhi)
– forsize• primarilycodecacheefficiency
26thFebruary2018 ESW–Lecture2 29
JAVAVirtualMachine–Example2–TieredCompila%on
» -XX:+PrintCompilaOon(-XX:+PrintInlining)
{millis from start} {compilation_task_id} {flags} {tier} {class:method} (bytecode size)@OSR {removing not rentrant/zombie}
Notice standard compilation path 0 -> 3 -> 4
26thFebruary2018 ESW–Lecture2 30
JVM–Example2–daysInMonthAssemblyCode–Tier3
» -XX:+UnlockDiagnosOcVMOpOons-XX:+PrintAssembly» allexamplesareinJVM864-bit,IntelHaswellCPU,AT&Tsyntax%er3-C1withinvoca%on&backedgecounters+MethodDataOopcounter
because:count="256"iicount="256”hot_count="256”stackiniOalizaOon,invoca%oncounterinMDO(0xDC)+triggerC2(Oer4)
0x1ff8 >> 3 = 1024 invocations trigger tier 4 (C2)
month, year stacking banging technique, StackOverflowException stack allocation, saving registers
26thFebruary2018 ESW–Lecture2 31
JVM–Example2–daysInMonthAssemblyCode–Tier3
ESI is month input
default jump
26thFebruary2018 ESW–Lecture2 32
JVM–Example2–daysInMonthAssemblyCode–Tier3
targetformonth=4,backedgecountertrackinginMDO(0x290):jumptarget,inlinedTLABalloca%onofIntegerobject:
no space in TLAB -> new TLAB + external allocation with header init returns after the inlined allocation
EBX=30 is retVal
RAX Integer instance address Object structure (64-bit JVM): - header 12 or 16 Bytes - object data super class first, type grouped
8B - mark word
4B / 8B – Klass ref.
… object data
Array object structure (64-bit JVM): - header 16 or 20 Bytes - sequence of array values
8B - mark word
4B / 8B – Klass ref.
sequence of values
4B – array length
0x10 Integer instance size
object initialization, header filed with prototype mark
26thFebruary2018 ESW–Lecture2 33
JVM–Example2–daysInMonthAssemblyCode–Tier3
inlinedIntegerconstructorwithsupers,invocaOoncountsinMDOs(0xDC) Integer::<init>,Number::<init>,Object::<init>
-currentlyinOer3(C1countersinMDO) invocation cnt of Integer::<init> in daysInMonth for inline
invocation cnt in Integer::<init> + trigger its C2 (tier 4)
invocation cnt of Number::<init> in Int::<init> for inline
invocation cnt in Number::<init> + trigger its C2 (tier 4)
invocation cnt of Object::<init> in Numb::<init> for inline
invocation cnt in Object::<init> + trigger its C2 (tier 4)
RAX.value = EBX (retVal)
26thFebruary2018 ESW–Lecture2 34
JVM–Example2–daysInMonthAssemblyCode–Tier3
finalcleanupandreturn,RAXcontainsreturnvalue(pointertoIntegerinstance)» OrdinaryObjectPointer(Oop)–flexiblereferencetoanobject» safepoint–OopsinperfectlydescribedstatebyOopMap(GCmaps)
• Oopcanbesafelymanipulatedexternallywhilethreadissuspended• ininterpretedmode–betweenany2bytecodes• inC1/C2compiled–endofallmethods(notin-lined),non-countedloopbackedge,
duringJVMrun-Omecall• parked,blockedonIO,monitororlock• whilerunningJNI(donotneedthreadsuspension)• globalsafepoint(allthreads)–stoptheworld
– GC,printthreads,threaddumps,heapdump,getallstacktrace– enableBiasedLocking,RevokeBias– classredefiniOon(e.g.instrumentaOon),debug
• localsafepoint(justexecu%ngthread)– de-opOmizaOon,enable/revokebiaslocking,OSR
stack dealocation, reload register safepoint poll check
26thFebruary2018 ESW–Lecture2 35
JVM–TimeToSafePoint
» TimeToSafePoint(TTSP)–howlongittakestoentersafepoint-XX:+PrintSafepointStaOsOcs-XX:+PrintGCApplicaOonStoppedTime-XX:PrintSafepointStaOsOcsCount=1
TTSPoverheadinprofilerwhilecallingGetStackTraceexamplewith5threads:
TTS
P
26thFebruary2018 ESW–Lecture2 36
JVM–Example2–daysInMonthAssemblyCode–Tier4
%er4–C2compiler–noprofilecountersbecause:count="5376"iicount="5376”hot_count="5376”
stackiniOalizaOon,uselookuptablejumpfortableswitch
default (>=12)
month, year
26thFebruary2018 ESW–Lecture2 37
JVM–Example2–daysInMonthAssemblyCode–Tier4
targetformonth=4Integer.<init>,Number.<init>,Object.<init>-iicount=“5376”->Inline(hot)op%mizedbranching,inlinedTLABalloca%on,inlinedconstructors,nonulling,cachingop%miza%on
EBP=30 is retVal
TLAB Integer object allocation, ref in RAX
MarkWord fetch from class and then store compressed OOP to Integer class
RAX.value = EBX (retVal)
final cleanup
RAX contains return value (pointer to Integer instance)
cache optimization 3 cache lines ahead
26thFebruary2018 ESW–Lecture2 38
JVM–Example2–daysInMonthAssemblyCode–Tier4
targetfordefaultclassIllegalArgumentExcepOonnoprofile->uncommon->reinterpretremapinputs,returnbacktoreinterpreterthendiscardOer3version
26thFebruary2018 ESW–Lecture2 39
JVM–Example2–computeAssemblyCode–Tier4OSR
OSR@10–OnStackReplacementatbytecode10%er4–C2(beforetherewasOer3OSR@10because60416loopsandOer3)
because:backedge_count=”101376"hot_count=”101376”copy4localsonstackfromOer3OSR@10toregs
RSI compiled stack of tier 3 OSR @10
26thFebruary2018 ESW–Lecture2 40
JVM–Example2–computeAssemblyCode–Tier4OSR
loopcriteriathenthereisinlinedOer4daysOfMonth(lookupjump)becausethecallishotendingwithaddiOonintoaccumulatororeinterpretonendofcyclejump(unstableif_bytecode),save3localstostack
EBX is local I; 0xF4240 = 1_000_000
26thFebruary2018 ESW–Lecture2 41
JVM–Example2–computeAssemblyCode–Tier4
%er4–C2because:count=”2”backedge_count=”150528”
usecombinaOonoffullinline,deadcodeelimina%on,objectescape,loopinvarianthois%ng,strengthreduc%on
30_000_000
RAX contains return value (primitive int)
26thFebruary2018 ESW–Lecture2 42
JavaVirtualMachine–Performance
» requireswarm-uptou%lizebenefitsofC2(orC1)» compilerscannotdoallmagic->writebegeralgorithms
» 32-bitvs64bitsJVMs• 32-bit(max~3GBheap)
– smallermemoryfootprint– slowerlong&doubleoperaOons
• 64-bitmax32GBvirtualmemory(withdefaultObjectAlignmentInBytes)- fasterperformanceforlong&double– slightincreaseofmemoryfootprint– compressedOOPsareslightlyslowerforreferencesuponusage– compressedOOPslessmemory->lessfrequentGC->fasterprogram
• 64-bit>32GBvirtualmemory(largeheap)– fastreferenceusage– wasOngalotofmemory(48GB~32GBwithcompressedOOPs)
26thFebruary2018 ESW–Lecture2 43
JavaVirtualMachine–CPUandMemoryProfiling
» profiling• CPU–Omespentinmethods• memory–usage,allocaOons
» modes• sampling
– periodicsamplingofstacksofrunningthreadstoesOmateslowest– noinvocaOoncounts,no100%accuracy(varioussamplingerrors)– nobytecode(&assemblycode)modificaOons– 1-2%impacttostandardperformance(TTSP,threaddumps,analysis)
• tracing(instrumetaOon)-methodentry,exit,traceObjAllocaOons– instrumentedbytecode->affectedperformance->affectedcompilerop%miza%ons
» jvisualvm• JVMmonitoring,troubleshooOngandprofilingtool• includedinallJDKs• profiledthreadlimit32
26thFebruary2018 ESW–Lecture2 44
JVM–Example2–CPUTracingofdaysOfMonth
assemblycodeof%er4–C2(beforetherewasverycomplexOer3)inlineddaysInMonthrootMethodEntrytracking
749 Bytes of assembly code for each rootMethodEntry
26thFebruary2018 ESW–Lecture2 45
JVM–Example2–CPUTracingofdaysOfMonth
addiOonalrootMethodEntryandrootMethodExittrackingsforInteger::<init>andNumber::<init>
inlinedrootMethodExitaderIntegerinstance.value=retVal
313 Bytes of assembly code for each rootMethodEntry
26thFebruary2018 ESW–Lecture2 46
JVM–Example2–CPUTracingOutcome
26thFebruary2018 ESW–Lecture2 47
JVM–Example2–ProfilingPerformance
» CPUtracingofcomputeresultsintomuchslowercode• noobjectescapefromdaysInMonthcall• noinvarianthoisOng• nostrengthreducOon(fullloopremainsthere)
» objectallocaOonissimilarwithtraceObjAllocinjectedcalls
» recommendedapproach• dosamplingfirst• idenOfyperformanceboplenecks(wheremostOmeisspent)
– itcouldbeoutsideofJVM(e.g.latencyofexternalDB,filesystem)• focuswithtracingjusttoidenOfiedparts
26thFebruary2018 ESW–Lecture2 48
JVM–JavaMissionControl
jmc–JRockitJVM,includedincommercialJDKs,samplinginFlightrecorder
26thFebruary2018 ESW–Lecture2 49
ApproachtoPerformanceTes%ng
» testrealapplica%on–ideallythewayitisused• microbenchmarks–measureverysmallunits
– warm-up–tomeasurerealcode,notcompilersitself,biasedlocks• keepinmindcaching
– bewareofcompilers–useresults,reorderingofoperaOons– synchronizaOon–mulO-threadedbenchmarks– varypre-calculatedrightparametersaffecOngcomplexity–differentopOmizaOoninreality
• macrobenchmarks–measureapplicaOoninput/output– leastperformingcomponentaffectsthewholeapplicaOon
» understandthroughput,elapsedandresponse%me• outlierscanoccur–e.g.GC• useexisOnggeneratorsthanwriOngown
26thFebruary2018 ESW–Lecture2 50
ApproachtoPerformanceTes%ng
» understandvariability–changesoverOme• internalstate• backgroundeffects–load,network• probabilisOcanalysis–workswithuncertainty
» testearly,testo*en–ideallypartofdevelopmentcycle• ideallysomeproperlyrepeatedmesobenchmarking• automatetests–scripted• propertestcoverageoffuncOonalityandinputs• testontargetsystem–differentcodeondifferentsystems