VMAD: a Virtual Machine for Advanced Dynamic Analysis of ...structions and callbacks to the VM....

11
HAL Id: inria-00534748 https://hal.inria.fr/inria-00534748 Submitted on 24 Feb 2011 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. VMAD: a Virtual Machine for Advanced Dynamic Analysis of Programs Alexandra Jimborean, Matthieu Herrmann, Vincent Loechner, Philippe Clauss To cite this version: Alexandra Jimborean, Matthieu Herrmann, Vincent Loechner, Philippe Clauss. VMAD: a Virtual Machine for Advanced Dynamic Analysis of Programs. [Research Report] 2010, pp.10. inria-00534748

Transcript of VMAD: a Virtual Machine for Advanced Dynamic Analysis of ...structions and callbacks to the VM....

  • HAL Id: inria-00534748https://hal.inria.fr/inria-00534748

    Submitted on 24 Feb 2011

    HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

    L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

    VMAD: a Virtual Machine for Advanced DynamicAnalysis of Programs

    Alexandra Jimborean, Matthieu Herrmann, Vincent Loechner, Philippe Clauss

    To cite this version:Alexandra Jimborean, Matthieu Herrmann, Vincent Loechner, Philippe Clauss. VMAD: a VirtualMachine for Advanced Dynamic Analysis of Programs. [Research Report] 2010, pp.10. �inria-00534748�

    https://hal.inria.fr/inria-00534748https://hal.archives-ouvertes.fr

  • VMAD: a Virtual Machine for Advanced Dynamic Analysis of Programs

    Alexandra Jimborean, Matthieu Herrmann, Vincent Loechnerand Philippe ClaussINRIA Nancy-Grand Est (CAMUS), LSIIT Lab., CNRS

    University of Strasbourg, France{alexandra.jimborean, vincent.loechner, philippe.clauss}@inria.fr, [email protected]

    Abstract

    In this paper, we present a virtual machine, VMAD(Virtual Machine for Advanced Dynamic analysis), en-abling an efficient implementation of advanced profil-ing and analysis of programs. VMAD is organized asa sequence of basic operations where external modulesassociated to specific profiling strategies are dynami-cally loaded when required. The program binary fileshandled by VMAD are previously instrumented at com-pile time to include necessary data, instrumentation in-structions and callbacks to the VM. Dynamic informa-tion, such as memory locations of launched modules,are patched at startup in the binary file. The LLVMcompiler has been extended to automatically instrumentprograms according to both VMAD and the handledprofiling strategies. VMAD’s potential is illustrated bypresenting a profiling strategy dedicated to loop nests.It collects all memory addresses that are accessed dur-ing a selected number of successive iterations of eachloop. The collected addresses are consumed by ananalysis process trying to interpolate addresses succes-sively accessed through each memory reference as alinear function of some virtual loop indices. This pro-filing strategy using VMAD has been run on some ofthe SPEC2006 and Pointer Intensive benchmark suites,showing a very low time overhead, in most cases.

    1. Introduction

    Runtime code analysis and optimization is becom-ing a main strategy used to face the ever extendingand changing variety of existing processor architec-tures and execution environments that an applicationcan meet. Unlike static compilers, that have to takeconservative decisions from restricted available infor-mation extracted from the source code, runtime profil-ers and optimizers lie on information extracted at ex-ecution time. While nowadays processors are provid-ing more and more computing resources at the price of

    more and more usage complexity – particularly with theadvent of multi/many-core processors – efficient pro-gram optimizations request accurate and advanced run-time analyses. However, such analyses inevitably incurstime overhead that has to be minimized.

    Many tools for the instrumentation of programs ex-ist: Valgrind [10], DynamoRIO [3], PIN [9], Dyninst[4], PEBIL [6], Strata [5], but have strong limitations inthe kind of possible instrumentations and analyses thatcan be implemented without reaching a huge runtimeoverhead. This is mainly due to the fact that all thesetools, excepting PEBIL, use software dynamic transla-tion (SDT) to handle the insertion of instrumenting in-structions. This approach necessarily introduces an in-evitable runtime overhead. These tools are discussedfurther in the related work section.

    In this paper, we present VMAD, a virtual machine(VM) handling x8664 binary files especially tailoredat compile time to include instructions and data relatedto the VM functioning. VMAD provides the followingfeatures:

    • instrumentation management and analysis phasesare separated modules that are loaded only if re-quested;

    • these modules can be either generated offline orat runtime thanks to a dedicated just-in-time com-piler module;

    • they can be activated or deactivated at any timeduring the target application run;

    • several instances of the same module can be run si-multaneously while targeting different parts of theapplication code;

    • instrumentations and analyses can include anykind of tasks and their time and space scopes canbe managed accurately;

    • the VM is able to switch between several versionsof the same code extract in order to avoid any over-

  • head due to unnecessary instrumentation instruc-tion runs.

    For the binary code to be tailored for the VM, thecompiler handles specific pragmas that have to be in-serted in the source code to delimit the code regions ofinterest and to specify the required analysis. We haveextended the LLVM compiler [7] to handle such prag-mas. This extension allows the developper to initiatedynamic low-level analyses that focus on specific partsof the source code in order to explain runtime bottle-necks. To our knowledge, VMAD is the first proposalproviding low-level instrumentation initiated from thesource code with almost negligible runtime overhead.

    To show VMAD’s potential, we present one ad-vanced instrumentation and one analysis processes thathave been implemented. The instrumentation, which isdedicated to loop nests, consists in collecting all mem-ory addresses that are accessed during a selected num-ber of successive iterations of each loop of the nest.This instrumentation is quite specific since it has to oc-cur only on some non contiguous phases of the loopnest execution,i.e., only when the memory accesses areoccurring while the enclosing loops are all executing it-erations that have been selected to be instrumented.

    The analysis process makes use of the accessed ad-dresses collected through the instrumentation. It triesto interpolate addresses successively accessed througheach memory reference as a linear function. For perfor-mance reasons, it runs while the instrumentation occursby consuming each new collected address and verifyingthat it fits a linear function of some virtual loop indices.

    The implementation of such instrumentation andanalysis processes involves to extend the LLVM com-piler to handle a new pragma, by developing a num-ber of dedicated program transformation passes. Ob-viously it also involves developing the modules thatwill be loaded by the VM and mostly defining the con-trol and management of an instrumentation, or definingthe computations of some behaviour modelling strategywhen the module implements an analysis.

    The remainder of the paper is organized as fol-lows. In section 2, we present an overview of our static-dynamic framework. VMAD is detailed in section 3,while the static preparation phase of the code is pre-sented in section 4. The instrumentation and analy-sis processes used to show our system abilities are de-scribed in section 5. Finally, we give some benchmarkresults in section 6, related work in section 7 and weconclude in section 8.

    Application binary file

    instructions

    code

    labels

    callbacks

    data for the VM

    Virtual Machine

    headers reading

    relevant modules

    loading

    modules

    executions

    initalization

    activation

    deactivation

    termination

    reinitialization

    specific

    operations

    module 0

    module 1

    module 2

    ...}

    application

    source code

    Figure 1: Framework overview.

    2. Framework overview

    We consider as an “instrumentation process” a pro-cess attached to a particular kind of original instructionsin order to monitor their run – as for instance a pro-cess dedicated to collect all addresses accessed throughany memory instruction of a code extract. An “analy-sis process” is devoted to use information collected byan instrumentation process for some computations – asfor instance to model the program behavior as a graph.In the following, we use the term “profiling process” todenote either an instrumentation or an analysis process.

    VMAD has been built by taking great care of per-formance and runtime overhead. Hence it does not useany software dynamic translation that would delay therun of the input program as it is done in Valgrind [10],DynamoRIO [3], Pin [9], Dyninst [4] or Strata [5]. Fur-ther, instrumentation instructions are not inserted on-the-fly by replacing some NOP instructions that havebeen previously inserted at compile time, as it is donewith PEBIL [6]. Rather, several copies of the samecode extracts are built at compile time, each copy corre-sponding to a phase in the whole program run in whichthe profiling processes will either operate fully, partiallyor be completely deactivated. The price to pay with thisapproach is the larger size of the program binary file.However, great care can also be taken of minimizing thesize of the copies by inserting branches to the originalcode whenever possible. Beside performance, anothernoticeable benefit with this approach is the opportunityto create any advanced instrumentation for which therelated instrumented copy can be far different than theoriginal code.

    The static-dynamic collaborative framework is de-picted in figure 1. At compile time, the C/C++ source

    2

  • code with some dedicated pragmas is translated into theLLVM intermediate representation (IR) with additionalspecific metadata. A LLVM pass creates copies of tar-geted code extracts where instrumentation instructionsare inserted only if convenient information can be foundin the IR. This is not always possible. For example, inthe LLVM IR, all data accesses are made from an in-finite set of registers, and memory accesses can onlybe actually identified in the final assembly/binary code.Further, the number and kind of memory accesses areobviously dependent on the optimizations applied bythe compiler – particularly register allocation. Hencesome instrumentations can only be inserted in the finalassembly code during a devoted phase of the compilerbackend. During this phase, some additional code anddata are inserted which will allow the VM to take con-trol of the profiling processes and also to pass the re-quested parameters to the relevant modules.

    Besides instrumentation instructions and labels at-tached to some targeted instructions and control struc-tures –e.g., memory accesses, loops – additional codecan consist in somedecision blocksproviding a way totoggle between instrumented or non-instrumented codesnippets, following some conditionals. Such blocks canalso contain some updates of variables used by the cur-rent profiling process, as for instance a counter provid-ing the number of times an instrumentation has alreadyoccurred. Some callbacks are also inserted in the codein order to invoke VMAD and its related modules whennecessary.

    Inserted data, organized as headers, will informthe VM about the kind of profiling process for whichthe input program has to be managed. It also pro-vides all necessary information as addresses in the codewhere instrumentations occur, values and addresses tobe patched, pointers to the relevant parameters, ... Com-mon symbols are used by both the VM and the binaryfile in order for the VM to recover all necessary entrypoints in the binary file thanks to the dynamic linker.

    At runtime, we useLD PRELOADto load VMAD’sdynamic shared library at the startup of the handled pro-gram, to provide its own version of the C-library entrypoint libc start main. It allows VMAD to first be setup by reading all relevant information in the input pro-gram binary file.

    VMAD first scans all headers found in the input bi-nary file. Then VMAD loads the required modules andmakes some updates by patching the binary code. Eachmodule grants write permissions to itself before patch-ing. At this step, the input program starts its run. Whenneeded, VMAD is activated through callbacks that havebeen inserted in the code at compile time.

    / / backup of t h e s t a c k red zone :sub $0x80 ,% r s p/ / backup of t h e s c r a t c h r e g i s t e r s :. . ./ / s t a c k a d j u s t e m e n t ( x8664 conven t ion ) :mov %rsp ,% rbpmov $ 0 x f f f f f f f f f f f f f f f 0 ,% r s iand %r s i ,% r s p/ / move 0 to %rax ( amd x8664 conven t io n )mov $0x0 ,% rax/ / r e g i s t e r s f o r t h e call/ / $0x0 w i l l be pa tched :mov $0x0 ,% r d i / / a d d r e s s of t h e modulemov $0x0 ,% r s i / / a d d r e s s of t h e o p e r a t i o n/ / f u n c t i o n call :/ / 1st pa ram ete r = r d i ( conven t ion )/ / 2nd param ete r = r s ic a l l q ∗% r s imov %rbp ,% r s p / / s t a c k r e a d j u s t e m e n t/ / r e s t o r a t i o n of t h e s c r a t c h r e g i s t e r s :. . ./ / r e s t o r a t i o n of t h e s t a c k red zone :add $0x80 ,% r s p

    Figure 2: callback in x8664 assembly code.

    3. The virtual machine VMAD

    As described in the previous section, VMADmakes use of three kinds of specific information thatis inserted at compile time in the program binary file:

    • instrumentation code that has been inserted beforeor after original instructions in order to monitortheir run;

    • blocks of instructions dedicated to control theglobal behavior of the profiling process, organizedasdecision blocksand containing callbacks to in-voke some VM operations defined in modules;

    • some data corresponding to module parameters,pointers and memory addresses.

    Each instrumentation collaborates with a dynamicmodule. Both share information using a fixed sizeheader previously inserted in the binary file. At startup,as soon as headers and associated parameters have beenread from the input binary file, the VM loads the rele-vant modules and instantiates them. Parameters fetchedat this point arestatic data, and are accessed either bythe VM – for instance to know which modules are re-quired – or by modules that have been loaded by theVM – for instance to get memory addresses that have tobe patched – or by instrumenting instructions – for in-stance to allocate memory for register savings. On theother hand,dynamic datais accessed through pointers

    3

  • set up by the relevant VM modules which allocate thememory they require.

    Each module is structured at least with five mainentry points:init which role is to instantiate a profilingprocess,quit to kill such an instance,on, off andresetto activate, deactivate and reset a profiling process. Ad-ditional operations can also be provided by a module.These operations are invoked thanks to callbacks previ-ously inserted in the input binary code and patched byinit in order to point to the relevant instance of the mod-ule and operation. These callbacks are inserted at somecontrol points in the program binary file and have thecommon form showed figure 2.

    As soon as the instrumented program has beenlaunched, VMAD performs the following operations.

    1. Thanks to a commonly defined symbol, it reads allthe headers in the program binary file. For eachheader:

    (a) VMAD gets the type of instrumentation itwill have to manage;

    (b) it gets all the necessary parameters;

    (c) it loads the module related to the type of in-strumentation;

    (d) it updates the module pointer inside theheader.

    2. Each loaded module performs the following oper-ations:

    (a) it allocates the memory it requires and up-dates the related pointer inside the header;

    (b) since by default, the program is able to runwithout the VM, it patches the code in orderto branch to instrumented code;

    (c) it patches all the callbacks in the code to pointto the convenient functions of the module.

    3. VMAD calls the originallibc start main.

    After this last point, the program runs and VM op-erations are only invoked through callbacks. Hencedeactivation – branching to non-instrumented code –and profiling ending are decided from decision blocksand invoked VM operations. Reactivation can be de-cided from a delayed signal, whose associated handlerpatches the code in order to branch to instrumentedcode.

    4. Preparing the code at compile time usingLLVM

    Code analysis and optimization starts, in our ap-proach, at the level of the original source code. The

    programmer may guard regions of code with specificpragmas, aimed to identify the code sections to be pro-cessed.

    Our work relies on the LLVM compiler and Clangfront-end, which has to be extended to handle the newlyintroduced pragmas. Hence, the steps required for theclang front-end to be aware of a new pragma implyfirst to augment the parser, by defining a new class forthe pragma. The class contains a specification for thepragma handler, which describes the action to be taken,once the compiler encounters the pragma in the sourcecode. Shortly, the handler verifies the syntax of thepragma and calls the corresponding action. This has tobe available in the list of actions taken by the compilerand to contain the semantics of the pragma.

    For instance, a pragma may be attached to a spe-cific structure such as a compound statement, a func-tion or a loop. In this case, the action means markingthis certain structure with a symbol indicating the exis-tence of the pragma. In this respect, we have defined anew data structurePragmaCollectorwhich keeps trackof all the associated pragmas. The next step is code gen-eration. As we are extending the LLVM compiler, thesource code is converted into LLVM IR. The decision tobe taken at this point is whether to create a new instruc-tion for the pragma, to use LLVM intrinsics, annotatedattributes or the LLVM metadata1. Due to various dis-advantages presented by the first three options, we se-lected the last one: metadata information is inserted inthe generated code to mark the presence of the pragma.

    In what follows, we add a LLVM pass taking thecorresponding actions, based on the metadata. In thecase of instrumentation, we want to create different ver-sions of the code, namely original and instrumented.The approach is to select the code carrying the metadatainformation and to duplicate it. One of the versions re-mains in the original format, with minimal alterations,while the second version is enhanced with instrumen-tation code. Depending on the type of instrumentation,the LLVM IR may not contain sufficient information.Should one aim to detect dependencies between datastructures or to model the code, such as to interchangeloops, the LLVM IR offers great support and malleabil-ity. Processing the code represented in LLVM IR is ad-visable when higher level information is required.

    Although the LLVM IR provides interesting infor-mation, low level instrumentation remains impossiblefor several performance related mechanisms, such astracing memory behaviour. This is due to the fact thatLLVM IR uses an infinite number of virtual registers,which will be later mapped either to physical registersor to memory locations. As registers are not yet allo-

    1Metada information can be used since LLVM version 2.64

  • cated in the LLVM IR, one cannot distinguish whetherthe LLVM “load” and “store” instructions representmemory or register accesses. Additionally, LLVM IRis in static single assignment (SSA) form, which sim-plifies the analysis of the control flow graph, but intro-duces a number of unnecessary “load” and “copy” in-structions. These are eliminated when generating thecode for a specific target architecture. Also, SSA PHIinstructions are not supported by traditional instructionsets, hence the compiler replaces them with instructionspreserving their semantics, but which are not present inthe LLVM bytecode [8]. Under these circumstances,one needs to convert the LLVM bytecode to be instru-mented into assembly code.

    The code which is not instrumented is representedin the LLVM IR, conserving the higher level infor-mation for further compiler passes. On the otherhand, regions marked for instrumentation are clonedand extracted into new functions. They are analyzedinstruction-wise and a specific macro is inserted beforeeach of the tackled instructions. Macros will be ex-panded in instrumenting code.

    Once various versions of the code have been gen-erated, a mechanism to choose one or another for ex-ecution is compulsory. In this respect, the code has tobe prepared to interact with the VM, as the executionflow is dynamically guided. Versions of code are pre-ceded by a decision block, containing calls to the VMwith parameters describing the code structure. At run-time, the VM selects one code version, based on thecurrent values of the parameters, and patches the codeaccordingly. During the execution, additional calls tothe VM are performed to provide information about thecurrent status. These are specific to the targeted type ofinstrumentation and are strategically placed at the keypoints of the program. The callbacks to the VM rep-resent a compromise between performance and gener-ality, as pointers to the adequate module functions arepatched at start-up. On the other hand, it ensures thegenericness of our framework, allowing the VM to han-dle a multitude of actions.

    As the code is patched dynamically, at compiletime one has to mark points of high importance such asbeginning and end of cloned and instrumented regions,or decision blocks. Also, the original flow is modifiedstatically to include newly inserted structures and calls.Our strategy is to insert these code snippets in x8664assembly code, to ensure that no modifications appearin the last phases of code generation. Furthermore, spe-cific instructions are coded in hexadecimal representa-tion, on the precise number of bits required for patch-ing. In figure 2, the two lines marked asaddress of themoduleandaddress of the operationare illustrated, for

    Figure 3: Headers and parameters.

    clarity purposes, as written in x8664 assembly code,whereas they are actually inserted in the semanticallyequivalent hexadecimal form, as shown in figure??.The latter representation allows an accurate control ofthe exact number of generated bits, which is crucial fordynamic code patching. In our example, both instruc-tions Mov $0, %rdi and Mov $0, %rsi , wouldeach represent $0 on only 8 bits. However, VMAD re-quires 64 bits to be allocated for patching the addresses,since they can be 64-bits pointers. Therefore, the com-piler must insert these instructions in hexadecimal form.

    In addition to preparing the code, a number ofheaders and parameters are annexed to the generated bi-nary code. The list of headers is specific to the typeof instrumentation, as they determine the modules tobe loaded in the VM. Moreover, they are linked to thecorresponding parameters, containing higher level in-formation statically available, but which would be pro-hibitively time-expensive to identify in the binary rep-resentation. At the same time, the compiler transmits asparameters instrumentation specific information, suchas addresses of the code snippets added to the originalcode.

    As illustrated in figure 3, the list of headers is intro-duced by its length, given byvmadheadersnb. Eachheader has a fixed size containing three entries: theunique ID of the instrumentation, the type of the instru-mentation, the address of associated parameters.

    The list of parameters may have a varying lengthand include information such as labels indicating sec-tions of code which are patched by the VM, or specificdetails, for instance characteristics of a loop, its depthor parent loop. The parameters may differ dependingon the instrumentation type.

    All in all, any compiler can be adapted to interactwith the VM, following the steps presented above, to

    5

  • provide the binary code in a specific format, accompa-nied by the list of headers and parameters.

    5. Instrumenting loop nests and analyzingmemory accesses

    We underline VMAD’s features by creating aframework tailored to emphasize its complexity. Itconsists in instrumenting and profiling the memory ac-cesses of critical pieces of code –loop nests– in a mini-mal amount of time overhead. We focus on loop nests,as they represent the major part of the whole executiontime of compute-intensive applications. The goal of ourinstrumentation framework is to collect the memory ad-dresses that are accessed during samples of the run iter-ations and, if possible, compute a linear function inter-polating their values. Additionally, the loop trip countsof each loop, except the outermost, are also collected fora few runs to be linearly interpolated. Such a profilingis particularly useful for nestedwhile-loops accessingmemory through indirect references or pointers. If lin-ear modelling of the memory accesses and of the looptrip counts is obtained, such loop nests can then be op-timized and parallelized usingfor-loops dedicated ap-proaches as the polyhedral model [1, 2].

    5.1. Loop nest instrumentation

    Efficient and advanced loop nest instrumentationconsists in profiling only a subset of the executed it-erations of each loop. The complexity of the method isoutlined in the case of nested loops, as instrumentationdepends not only on the iteration of the current loop, butalso of the parent loops. For a thorough understanding,consider the loop nest in figure 4. The first three itera-tions of each loop are instrumented. One may easily no-tice that instrumented and non-instrumented iterationsalternate, hence the execution has to switch from onecode version to another at runtime. Once the outermostloop profile has been completed, the execution can con-tinue with a non-profiled version of the loop nest, thusinducing no overhead for the remaining iterations.

    For linear interpolation of memory accesses wehave to profile the first three iterations of each loop. Thefirst two are dedicated to collecting data and computingthe affine function coefficients, while the third iterationprovides data to verify the interpolation correctness. Asan extension, one may consider beneficial to set the in-strumentation on/off instruction-wise. As in the caseof a loop containing an if-then-else structure, sufficientinformation is acquired in three iterations concerninga subset of the monitored instructions, but this mightbe not enough for some conditionally executed instruc-

    tions. Therefore, one may consider instrumenting onlythese instructions in subsequent iterations, avoiding un-necessary overhead. Similarly, once the instrumentationphase ended, it may be later restarted by the VM, if re-quired.

    Statically, our LLVM pass creates copies of thecritical loop nests, extracts them in new functions andconverts to x8664 assembly code. A second pass anal-yses the functions and precedes each instruction access-ing memory with instrumentation code that computesthe actual memory location being accessed, and makesa call to the VM to transmit the data collected. Figure5 illustrates the structure of the code of figure 2 and thelinks between different versions. BlocksOi , O j andOkrepresent the original version of the code, whileIi , I jandIk represent the instrumented bodies of each loop.The instrumented and original versions are connectedtogether at their entry point, where a choice is madeat runtime deciding which version to run, based on thevalues of the virtual iterators. One decision block is as-sociated to each loop, represented byDi , D j and Dk,correspondingly. More precisely, ifi = 1 , the value ofiteratori allows instrumentation, therefore its body willbe instrumented, blockIi . But if j = 3 , the body of thesecond loop will be non-instrumented (O j ) as well asthe body of its subloop (Ok). The same strategy is ap-plied to the other iterations as well. As a general rule, ifa parent loop is non-instrumented, all its subloops willbe non-instrumented. On the other hand, if the parentloop is instrumented, its subloops may or may not beinstrumented, depending on the value of their own iter-ators. To handle the trip counts of the considered loopsunitary, eitherwhile-loops orfor-loops with more thanone exit point, we introduce “virtual iterators”. Theyare maintained to mirror the actual number of executediterations.

    At compile time, we mark the beginning and theend of original and instrumented versions of the loopnests with labels, which are appended in the list of pa-rameters given to the VM. Additionally, callbacks to theVM are performed:

    • in each decision block – to decide the version toexecute;

    • at the end of each instrumented iteration – to sendthe acquired data to the VM for processing;

    • at the end of the loop nest – to inform the VM thatits execution has finished.

    Lastly, a list of headers and parameters isprepared, notifying the VM regarding the mod-ules required for this instrumentation: modulevmadloop, vmadgathermemoryaddresses and

    6

  • Figure 4: Loop nest instrumentation.

    INITwithout Runtime System

    with Runtime System

    Figure 5: Code structure.

    vmadinterpolation. One instance of each of thesemodules is created per loop, at runtime. The first mod-ule encloses the mechanisms necessary for handlingloops, the second one collects the memory accessesperformed inside the loops, while the last moduleperforms the interpolation. Each module depends onthe previous one to complete its task, however, theirimplementation is scalable in the same time. Theprofiling process can easily be extended with variousother instrumentations, as the loops framework maybe used with new purposes, independent on the othermodules. As well, the memory locations acquired inthe vmadgathermemoryaddressesmay be used asfoundation of new types of analysis.

    The list of parameters contains specific informa-tion, such as addresses of the code to be patched atstartup, the structure of the loop nest or memory loca-tions accessed from the loop body.

    At runtime, the VM parses the list of headers, loadsthe solicited modules and patches the code to enable in-strumentation. During execution, it is triggered by thecallbacks to select among versions of code and to pro-cess the information collected from instrumentation.

    5.2. Analyzing memory accesses

    Our instrumentation tracks the memory accessesperformed inside loops and computes associated affinefunctions of the loop bounds, interpolating the accessedaddresses, when possible.

    Since the number of accessed locations can be veryhigh, considering a memory intensive loop nest, it isrecommended that the acquired data is processed imme-diately by the interpolation process, rather than storedfor a later utilization.

    For each instrumented loop, a buffer is created atcompile time, to (re)store the state of the machine be-fore the interpolation process. At runtime, the VM al-locates space to be populated dynamically with the ac-cessed memory locations and, for each loop, it patchesthe corresponding address to store collected informa-tion. Additionally, it creates a stack per loop nest andpushes a structure to accommodate the function coeffi-cients together with the depth of the embedding loop,for each of the instructions accessing the memory andfor the loop bounds. Hence, each loop pushes on thestack sufficient space for:

    • coefficients of the functions interpolating on thesubloops upper bounds,

    • coefficients of the interpolation functions on thememory locations accessed by the instructionscontained in the loop nest; for each instruction,these coefficients correspond to the indices of theenclosing loops, plus a constant.

    As the instrumented iterations of a subloop are exe-cuted, the VM reads the values of the memory locationsfrom the designated buffer and the corresponding func-tion coefficients are computed and stored in the associ-ated positions. As soon as the execution of the subloopends, its structure is popped and the values of the coef-ficients, as well as the total number of iterations of thesubloop, are propagated in the structure of the parentloop.

    As a new iteration of the parent loop begins, theVM pushes a new structure on the stack and computesthe new memory access function. Communication withthe VM is achieved by means of a dirty flag, which in-dicates that a new memory location is available in thebuffer.

    In the same manner, the process is repeated until allloops finish their execution. At this time, all the coef-ficients have been computed and the functions verifiedfor linearity.

    7

  • 6. Experiments

    In this section, we present the results of our exper-iments running the interpolation of memory accesses inloops. We targeted all C codes from the SPEC CPU2006 benchmark suite [12] and four codes from thePointer-Intensive benchmarks [11]. LLVM with theClang front-end cannot handle Fortran codes and allC++ codes are Objective-C codes that we failed to com-pile with LLVM. We also failed in compiling thegccbenchmark using our ownMakefile , since SPEC re-quires some specific flags to be set. We added ourpragma in the source codes for some loop nests in someof the most time consuming functions [13]. We ranthe benchmarks using theref input files to computeVMAD’s runtime overhead, and using thetest inputfiles to get output files with the interpolation results,since runs using theref files would have produceda huge amount of data for this instrumentation. Theexecution platform is a 3.4 Ghz AMD Phenom II X4965 micro-processor with 4GB of RAM running Linux2.6.32.

    Our measurements are shown in table 1. We raneach program in its original form and in its instrumentedform to check the runtime overhead induced by usingVMAD. For each instrumented loop nest, the dynamicprofiling is activated each time its enclosing function isinvoked. For each program, the second column showsVMAD’s runtime overhead, the third column shows thefunctions in which loop nests were instrumented, thefourth column shows the total number of instrumentedloops per function, the fifth column shows the totalnumber of instrumented memory instructions in the pro-gram, the sixth column shows the total number of timesinstrumented memory accesses ran effectively and fi-nally, the last column shows the number of memory ac-cesses that were identified to be linear.

    For most programs, VMAD induces a verylow runtime overhead which is even negligible forperlbench , bzip2 , milc , hmmer, h264ref andlbm . For the programssjeng and sphinx3 , thesignificant overheads are mainly due to the fact thatthe instrumented loops execute only few iterations, butthey are enclosed by functions that are called manytimes. Thus all iterations are run while being fully in-strumented since each call represents a very low exe-cution time. However, the profiling strategy could beimproved in order to consider specifically such cases bydeactivating the instrumentation after a few calls. Pro-grammilc is showing an opposite behavior since a fewmemory instructions are run a lot of times. In such acase the runtime overhead is quite low. For the Pointer-Intensive benchmarks, the execution times are too small

    # pragma instrument_mem_add{

    for (mrA = groupA.head, mrPrevA = NULL;{

    ....

    for (mrB = groupB.head, mrPrevB = NULL;

    mrB != NULL;

    mrPrevB = mrB, mrB = (*mrB).next) {

    ...

    gp = D[(*mrA).module] + D[(*mrB).module]

    - CAiBj(mrA, mrB);

    ....

    if (gp > gpMax) {

    gpMax = gp;

    maxA = mrA; maxPrevA = mrPrevA;

    maxB = mrB; maxPrevB = mrPrevB;

    }

    0 | 0 | 140736985202616 |

    0 | 64 | 24881880 |

    0 | 140736985202632 |

    64 | 0 | 24881848 |

    4 | 0 | 6345856 |

    4 | 6345896 |

    0 | 0 | 140736985202244 |

    0 | 0 | 4234836 |

    0 | 0 | 140736985202244 |

    0 | 0 | 140736985202572 |

    0 | 0 | 140736985202568 |

    0 | 64 | 24881872 |

    0 | 0 | 140736985202616 |

    Figure 6: Code extract fromks and its correspondinginterpolation functions.

    – the order of milliseconds – to get relevant overheadmeasures: either a large runtime overhead is obtainedsince VMAD inevitably induces a fixed minimum over-head (bc ), or even a speedup is obtained (ft ).

    We also noticed that this particular instrumentationprocess makes any program’s binary file larger than theoriginal version with 400 more bytes per instrumentedmemory instruction, on average. For instance, programmilc instrumented version is about 267 kbytes versus191 kbytes for the non-instrumented version. In figure6, it is shown an extract of programks and some inter-polation functions that were computed by our profilingprocess. For instance, one of the memory accesses canbe modeled as 64i +24,881,848 wherei denotes a vir-tual outer loop index.

    7. Related work

    Most of the tools providing dynamic profiling, asPin [9], Dyninst [4], Strata [5], DynamoRio [3] orValgrind [10], are using software dynamic translation(SDT) to handle the insertion of instrumenting instruc-tions. This approach necessarily introduces an in-evitable runtime overhead. To our knowledge, VMADis the first proposal providing low-level instrumentationinitiated from the source code with negligible runtimeoverhead in most of the cases.

    One of the most popular tool is Pin [9], which isa software system that performs runtime binary instru-mentation of Linux and Windows applications. Pin’saim is to provide an instrumentation platform for build-ing a wide variety of program analysis tools, called pin-tools. A pintool consists of instrumentation, analysis,and callback routines. A just-in-time (JIT) compileris used to insert instrumentation into a running appli-cation. The JIT compiler recompiles and instrumentssmall chunks of binary instructions immediately priorto executing them. Pin overwrites the entry point ofprocedures with jumps to dynamically generated instru-mentation.

    Although Pin is similar to VMAD in the sense thatit provides a way to implement some advanced profiling

    8

  • Program Runtime # instrumented functions # instrumented # instrumented # instrumented # linearoverhead loops instructions memory memory

    accesses accesses

    perlbench 0.073% S regmatch 17 3,873 404,388 8,420S find byclass 36

    bzip2 0.24% mainsort 7 502 1,053 608fallbackSort 18

    mcf 20.76% primal beampp 2 138 4,054,863 2,848,589replaceweakerarc 1refreshpotential 3

    milc 0.081% mult su3na 2 195 1,988,256,195 1,988,256,000mult su3nn 2

    main 3gaugefix 1

    imp gaugeaction 4setuplayout 4

    hmmer 0.062% P7Viterbi 1 742 845 0P7SmallViterbi 4

    P7ParsingViterbi 1ShadowTrace 3P7Forward 3

    ReadSELEX 5P7Fastmodelmaker 2

    printivec 1seqdecode 2

    sjeng 182% std eval 2 662 1,155,459,440 1,032,148,267setupattackers 5

    libquantum 3.88% quantumtoffoli 1 42 203,581 203,078quantumsigmax 1

    quantumcnot 1h264ref 0.49% SetupFastFullPelSearch 1 349 32,452,013 30,707,102

    FastFullPelBlockMotionSearch 1FullPelBlockMotionSearch 4

    BPredPartitionCost 2lbm 0% LBM compareVelocityField 3 136 358 0

    LBM storeVelocityField 3storeValue 1

    sphinx3 172% mgaueval 5 194 78,437,958 51,566,707vector gautbl eval logs3

    anagram -5.37% BuildMask 3 53 159 134bc 183% bc divide 3 142 302,034 243,785

    YY DECL 1ft -8.46% MST 4 36 36 22ks 29.7% SwapSubsetandReset 1 102 42,298 29,524

    FindMaxandwap 2UpdateDS 2

    Table 1: Measures made on some of the C programs of the SPEC CPU2006 (first part) and Pointer-Intensive (secondpart) benchmark suites.

    9

  • strategies thanks to the pintools, it focuses exclusivelyon runtime instrumentation of binaries. Thus it does notrequire the source code to be available. On the otherhand, some advanced dynamic analyses, like the inter-polation of memory accesses in loops presented in thispaper, would be very difficult to implement with Pin. Ofcourse, the compile-time phase of our framework playsan important role in providing such a wider scope ofanalysis opportunities. This phase is also important inthe runtime overhead minimization, while Pin needs atruntime to parse the binary code, insert instructions andcompile some code snippets. VMAD can be seen as ahigh level version of Pin, where low level instrumenta-tions are initiated from the source code.

    The PEBIL toolkit [6] is closer to VMAD sinceit does not use SDT but static binary instrumentation.However, the instrumentation strategy is different sincePEBIL uses function relocation to acquire enough spaceat instrumentation points to insert correct full-lengthbranch instructions at runtime. We use two differentstrategies to transfer control from the application to theinstrumentation code: at compile time, we insert branchinstructions branching initially to their following in-structions and that are patched at runtime by VMAD;we also insert callbacks in the instrumented code snip-pets that are correctly patched at runtime to point to theproper loaded module functions of VMAD.

    8. Conclusion

    In this paper, we presented VMAD, a dynamic pro-filing infrastructure where advanced analyses can beimplemented with almost negligible runtime overhead,since it does not use software dynamic translation un-like most of the dynamic profiling tools. We extendedthe LLVM compiler to handle specific pragmas allow-ing the developer to initiate low-level instrumentationfrom specific parts of the source code. Some LLVMdedicated passes duplicate the targeted code snippetsinto instrumented and non-instrumented versions. Inthe instrumented versions, instrumentation instructions,data, and callbacks are inserted. At runtime, the vir-tual machine VMAD loads the necessary profiling mod-ules and patches the application code such that all ad-dress references to VMAD’s functions and data are cor-rect. VMAD’s potential has been shown by implement-ing a profiling strategy interpolating memory accessesin loops. Almost negligible runtime overhead has beenobtained by running VMAD with some SPEC2006 andPointer-Intensive benchmark programs. To our knowl-edge, VMAD is the first proposal allowing developersto initiate low-level analyses from selected parts of thesource code.

    For our next developments, we plan to focus onmodules that perform code transformations and extendVMAD accordingly. For instance, memory access in-terpolation can be followed by modules performing datadependence analysis and loop parallelization. Anothergoal is to generate analysis and transformation moduleson-the-fly, since VMAD has also been tailored to sup-port this feature.

    References

    [1] U. Banerjee. Loop Transformations for RestructuringCompilers - The Foundations. Kluwer Academic Pub-lishers, 1993. ISBN 0-7923-9318-X.

    [2] C Bastoul. Code generation in the polyhedral model iseasier than you think. InPACT’13 IEEE InternationalConference on Parallel Architecture and CompilationTechniques, pages 7–16, Juan-les-Pins, France, Septem-ber 2004.

    [3] D. Bruening, T. Garnett, and S. Amarasinghe. An in-frastructure for adaptive dynamic optimization. InCGO’03: Proc. of the int. symposium on Code generation andoptimization, pages 265–275, 2003.

    [4] B. Buck and J. K. Hollingsworth. An api for runtimecode patching. Int. J. High Perform. Comput. Appl.,14(4):317–329, 2000.

    [5] N. Kumar, B. R. Childers, and M. L. Soffa. Low over-head program monitoring and profiling. InPASTE ’05:Proc. of the 6th ACM SIGPLAN-SIGSOFT workshop onProgram analysis for software tools and engineering,pages 28–34. ACM, 2005.

    [6] M. Laurenzano, M. Tikir, L. Carrington, and A. Snavely.PEBIL: Efficient static binary instrumentation for linux.In ISPASS-2010 2010 IEEE Int. Symposium on Perfor-mance Analysis of Systems and Software, 2010.

    [7] LLVM compiler infrastructure.http://llvm.org .[8] Register allocation in the LLVM compiler.

    http://llvm.org/docs/CodeGenerator.html#regalloc .

    [9] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser,G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood.Pin: building customized program analysis tools withdynamic instrumentation. InPLDI ’05: Proc. of theACM SIGPLAN conference on Programming languagedesign and implementation, pages 190–200, 2005.

    [10] N. Nethercote and J. Seward. Valgrind: A frameworkfor heavyweight dynamic binary instrumentation. InProc. of ACM SIGPLAN Conf. on Programming Lan-guage Design and Implementation (PLDI), 2007.

    [11] Pointer intensive benchmark suite.http://pages.cs.wisc.edu/ ˜ austin/ptr-dist.html .

    [12] SPEC CPU2006. http://www.spec.org/cpu2006 .

    [13] R. P. Weicker and J. L. Henning. Subroutine profiling re-sults for the CPU2006 benchmarks.SIGARCH Comput.Archit. News, 35(1):102–111, 2007.

    10