Optimizing real-world applications with GCC Link Time...

Optimizing real-world applications with GCC Link Time Optimization

Taras GlekMozilla [email protected]

Jan HubickaSuSECR [email protected]

Abstract

GCC has a new infrastructure to support a link timeoptimization(LTO). The infrastructure is designed to al-low linking of large applications using a special mode(WHOPR) which support parallelization of the compi-lation process. In this paper we present overview of thedesign and implementation of WHOPR and present testresults of its behavior when optimizing large applica-tions. We give numbers oncompile time, memory us-age and code quality comparisons to the classical file byfile basedoptimizationmodel. In particular we focuson Firefox web browser. We show main problems seenonly when compiling a large application, such as startuptime and code size growth.

1 Introduction

Link Time Optimization (LTO)is a compilation mode inwhich an intermediate language (anIL) is written to theobject files and the optimizer is invoked during the link-ing stage. This allows thecompilerto extend the scopeof inter-procedural analysis andoptimizationto encom-pass the whole program visible at link-time. This givesthecompilermore freedom than the file-by-file compila-tion mode, in which each compilation unit is optimizedindependently, without any knowledge of the rest of theprogram being constructed.

Development of the LTO infrastructure in theGNU Compiler Collection (GCC) started in 2005[LTOproposal] and the initial implementation wasfirst included in GCC 4.5.0, released in 2009. Theinter-proceduraloptimizationframework was indepen-dently developed starting in 2003 [Hubicka04], and wasdesigned to be used both independently and in tandemwith LTO.

The LTO infrastructure represents an important changeto thecompiler, as well as the whole tool-chain. It con-sists of the following components:

1. A middle-end (the part of GCC back-end indepen-dent of target architecture) extension that supportsstreaming an intermediate language representingthe program to disk,

2. A new compiler front-end (the LTO front-end),which is able to read back the intermediate lan-guage, merge multiple units together, and processthem in thecompiler’soptimizer and code genera-tion backend,

3. A linker plugin integrated into the Gold linker,which is able to call back into the LTO front-endduring linking [Plugin],

(The plugin interface is designed to be independentof both the Gold linker and the rest of GCC’s LTOinfrastructure; thus the effort to the extend the tool-chain for plugin support can be shared with othercompilerswith LTO support. Currently it is usedalso by LLVM [Lattner].)

4. Modifications to the GCC driver (collect2) tosupport linking of LTO object files using either thelinker plugin or direct invocation of the LTO front-end,

5. Various infrastructure updates, including a newsymbol table representation and support for merg-ing of declarations and types within the middle-end, and

6. Support for using the linker plugin for other com-ponents of the tool-chain—such asar and nm.(Libtool was also updated to support LTO.)

The inter-proceduraloptimization infrastructure con-sists of the following major components:

1. Callgraph and varpooldata structuresrepresentingthe program in optimizer friendly form,

2. Inter-procedural dataflow support,

1

http://sciencewise.info/resource/Optimization/Optimization_by_Wikipedia

http://sciencewise.info/resource/Compilers/Compilers_by_Wikipedia












http://sciencewise.info/resource/Data_structures/Data_structures_by_Wikipedia

3. A pass manager capable of executing inter-procedural and local passes, and

4. A number of different inter-proceduraloptimizationpasses.

Sections2 and 3 contains an overview with more de-tails on the most essential components of both infras-tructures.

GCC is the third free software C/C++compiler withLTO support (LLVM and Open64 both supported LTOin their initial respective public releases). GCC 4.5.0’sLTO support was sufficient tocompile small- tomedium-sized C, C++ and Fortran programs, but hadwere several deficiencies, including incompatibilitieswith various language extensions, issues mixing multi-ple languages, and the inability to output debug informa-tion. In Section4 we describe an ongoing effort to makeLTO useful for large real-world applications, discuss ex-isting problems and present early benchmarks. We focuson two applications as a running example thorough thepaper: the GCCcompileritself and the Mozilla Firefoxbrowser.

2 Design and implementation of the Link TimeOptimization in GCC

The link-timeoptimizationin GCC is implemented bystoring the intermediate language into object files. In-stead of producing “fake” object files of custom format,GCC produces standard object files in the target format(such asELF) with extra sections containing the inter-mediate language, which is used for LTO. This “fat” ob-ject format makes it easier to integrate LTO into existingbuild systems, as one can, for instance, produce archivesof the files. Additionally, one might be able to ship oneset of fat objects which could be used both for develop-ment and the production of optimized builds, althoughthis isn’t currently feasible, for reasons detailed below.As a surprising side-effect, any mistake in the tool chainthat leads to the LTO code generation not being used(e.g. an older libtool calling ld directly) leads to thesilent skipping of LTO. This is both an advantage, as thesystem is more robust, and a disadvantage, as the userisn’t informed that theoptimizationhas been disabled.

The current implementation is limited in that it onlyproduces “fat” objects, effectively doubling compilationtime. This hides the problem that some tools, such asar

andnm, need to understand symbol tables of LTO sec-tions. These tools were extended to use the plugin in-frastructure, and with these problems solved, GCC willalso support “slim” objects consisting of the intermedi-ate code alone.

The GCC intermediate code is stored in several sections:

• Command line options (.gnu.lto_.opts)

This section contains the command line optionsused to generate the object files. This is used atlink-time to determine theoptimization level andother settings when they are not explicitly specifiedat the linker command line.

At the time of writing the paper, GCC does not sup-port combining LTO object filescompiledwith dif-ferent set of the command line options into a singlebinary.

• The symbol table (.gnu.lto_.symtab)

This table replaces theELF symbol table for func-tions and variables represented in the LTO IL.Symbols used and exported by the optimized as-sembly code of “fat” objects might not match theones used and exported by the intermediate code.The intermediate code is less-optimized and thusrequires a separate symbol table.

There is also possibility that the binary code in the“fat” object will lack a call to a function, since thecall was optimized out at compilation time after theintermediate language was streamed out. In somespecial cases, the sameoptimizationmay not hap-pen during the link-timeoptimization. This wouldlead to an undefined symbol if only one symbol ta-ble was used.

• Global declarations and types (.gnu.lto_.decls).

This section contains an intermediate languagedump of all declarations and types required to rep-resent the callgraph, static variables and top-leveldebug info.

• The callgraph (.gnu.lto_.cgraph).

This section contains the basicdata structureusedby the GCC inter-proceduraloptimization infras-tructure (see Section2.2). This section stores anannotated multi-graph which represents the func-tions and call sites as well as the variables, aliasesand top-levelasm statements.

2






http://sciencewise.info/resource/Executable_and_Linkable_Format/Executable_and_Linkable_Format_by_Wikipedia









• IPA references (.gnu.lto_.refs).

This section contains references between functionand static variables.

• Function bodies

This section contains function bodies in the inter-mediate language representation. Every functionbody is in a separate section to allow copying ofthe section independently to different object filesor reading the function on demand.

• Static variable initializers(.gnu.lto_.vars).

• Summaries and optimization summaries usedby IPA passes.

See Section2.2.

The intermediate language (IL) is the on-disk represen-tation of GCC GIMPLE [Henderson]. It is used for highleveloptimizationin the SSA form [Cytron]. The actualfile formats for the individual sections are still in a rela-tively early stage of development. It is expected that infuture releases the representation will be re-engineeredto be more stable and allow redistribution of object filescontaining LTO sections. Stabilizing the intermediatelanguage format will require a more formal definition ofthe GIMPLE language itself. This is mentioned as oneof main requirements in the original proposal for LTO[LTOproposal], yet five years later we have to admit thatwork on this made almost no progress.

2.1 Fast and Scalable Whole Program Optimiza-tions — WHOPR

One of the main goals of the GCC link-time infrastruc-ture was to allow effective compilation of large pro-grams. For this reason GCC implements two link-timecompilation modes.

1. LTO mode, in which the whole program is read intothecompilerat link-time and optimized in a similarway as if it were a single source-level compilationunit.

2. WHOPR1 mode which was designed to utilizemultiple CPUs and/or a distributed compilation

1An acronym for “scalable WHole PRogram optimizer”, not tobe confused with the-fwhole-program concept described later.

environment to quickly link large applications2

[Briggs].

WHOPR employs three main stages:

1. Local generation (LGEN)

This stage executes in parallel. Every file in theprogram iscompiled into the intermediate lan-guage and packaged together with the local call-graph and summary information. This stage is thesame for both the LTO and WHOPR compilationmode.

2. Whole Program Analysis (WPA)

WPA is performed sequentially. The global call-graph is generated, and a global analysis pro-cedure makes transformation decisions. Theglobal call-graph is partitioned to facilitate paral-lel optimizationduring phase 3. The results of theWPA stage are stored into new object files whichcontain the partitions of program expressed in theintermediate language and theoptimization deci-sions.

3. Local transformations (LTRANS)

This stage executes in parallel. All the decisionsmade during phase 2 are implemented locally ineach partitioned object file, and the final objectcode is generated.Optimizationswhich cannot bedecided efficiently during the phase 2 may be per-formed on the local call-graph partitions.

WHOPR can be seen as an extension of the usual LTOmode of compilation. In LTO, WPA and LTRANS andare executed within an single execution of thecompiler,after the whole program has been read into memory.

When compiling in WHOPR mode the callgraph parti-tioning is done during the WPA stage. The whole pro-gram is split into a given number of partitions of aboutsame size, with thecompilerattempting to minimize thenumber of references which cross partition boundaries.The main advantage of WHOPR is to allow the parallelexecution of LTRANS stages, which are the most time-consuming part of the compilation process. Addition-ally, it avoids the need to load the whole program intomemory.

2Distributed compilation is not implemented yet, but since theparallelism is facilitated via generating aMakefile, it would beeasy to implement.

3









The WHOPR compilation mode is broken in GCC 4.5.x.GCC 4.6.0 will be the first release with a usableWHOPR implementation, which will by default replacethe LTO mode. In this paper we concentrate on the real-world behavior of WHOPR.

2.2 Inter-procedural optimization infrastructure

The program is represented in acallgraph (a multi-graph where nodes are functions and edges are call sites)and thevarpool (a list of static and external variables inthe program) [Hubicka04].

The inter-proceduraloptimizationis organized as a se-quence of individual passes, which operate on the call-graph and the varpool. To make the implementation ofWHOPR possible, every inter-proceduraloptimizationpass is split into several stages that are executed at dif-ferent times of WHOPR compilation:

• LGEN time:

1. Generate summaryEvery function body and variable initializeris examined and the relevant information isstored into a pass-localdata structure.

2. Write summaryPass-specific information is written into anobject file.

• WPA time:

3. Read summaryThe pass-specific information is read backinto a pass-localdata structurein memory.

4. ExecuteThe pass performs the inter-procedural prop-agation. This must be done without actualaccess to the individual function bodies orvariable initializers. In the future we plan toimplement functionality to bring a functionbody into memory on demand, but this shouldbe used with a care to avoid memory usageproblems.

5. Write optimization summaryThe result of the inter-procedural propagationis stored into the object file.

• LTRANS time:

6. Read optimization summaryInter-proceduraloptimization decisions areread from an object file.

7. TransformThe actual function bodies and variable ini-tializers are updated based on the informationpassed down from theExecutestage.

The implementation of the inter-procedural passes areshared between LTO, WHOPR and classic non-LTOcompilation. During the file-by-file mode every pass ex-ecutes its ownGenerate summary, Execute, andTrans-form stages within the single execution context of thecompiler. In LTO compilation mode every pass usesGenerate summary, Write summaryat compilation time,while theRead summary, Execute, andTransformstagesare executed at link time. In WHOPR mode all stagesare used.

One of the main challenges of introducing the WHOPRcompilation mode was solving interactions between theoptimization passes. In LTO compilation mode, thepasses are executed in a sequence, each of which con-sists of analysis (orGenerate summary), propagation (orExecute) andTransformstages. Once the work of onepass is finished, the next pass sees the updated programrepresentation and can execute. This makes the individ-ual passes independent on each other.

In the WHOPR mode all passes first execute theirGen-erate summarystage. Then the summary writing endsLGEN. At WPA time the summaries are read back intomemory and all passes runExecutestage. Optimiza-tion summaries are streamed and shipped to LTRANS.Finally all passes execute theTransformstage.

Most optimizationpasses split naturally into analysis,propagation and, transformation stages. The main prob-lem arises when one pass performs changes and thefollowing pass gets confused by seeing different call-graphs at theTransformstage than at theGenerate sum-mary or theExecutestage. This means that the passesare required to communicate their decisions with eachother. Introducing an interface in between each pair ofoptimizationpasses would quickly make thecompilerunmaintainable.

For this reason, the GCC callgraph infrastructure im-plements a method of representing the changes per-formed by theoptimizationpasses in the callgraph with-out needing to update function bodies.

4













A virtual clone in the callgraph is a function that hasno associated body, just a description how to create itsbody based on a different function (which itself may bea virtual clone) [Hubicka07].

The description of function modifications includes ad-justments to the function’s signature (which allows, forexample, removing or adding function arguments), sub-stitutions to perform on the function body, and, for in-lined functions, a pointer to function it will be inlinedinto.

It is also possible to redirect any edge of the callgraphfrom a function to its virtual clone. This implies updat-ing of the call site to adjust for the new function signa-ture.

Most of the transformations performed by inter-proceduraloptimizationscan be represented via virtualclones: For instance, a constant propagation pass canproduce a virtual clone of the function which replacesone of its arguments by a constant. The inliner can rep-resent its decisions by producing a clone of a functionwhose body will be later integrated into given function.

Using virtual clones the program can be easily updatedat theExecutestage, solving most of pass interactionsproblems that would otherwise occur at theTransformstages. Virtual functions are later materialized in theLTRANS stage and turned into real functions. Passesexecuted after the virtual clone were introduced alsoperform theirTransformstages on new functions, so fora pass there is no significant difference between operat-ing on a real function or a virtual clone introduced be-fore itsExecutestage.

Optimizationpasses then work on virtual clones intro-duced before theirExecutestage as if they were realfunctions. The only difference is that clones are not vis-ible atGenerate Summarystage.

To keep the function summaries updated, the callgraphinterface allows an optimizer to register a callback thatis called every time a new clone is introduced as wellas when the actual function or variable is generated orwhen a function or variable is removed. These hooksare registered at theGenerate summarystage and allowthe pass to keep its information intact until theExecutestage. The same hooks can also be registered at theEx-ecutestage to keep theoptimizationsummaries updatedfor theTransformstage.

To simplify the task of generating summaries severaldata structuresin addition to the callgraph are con-structed. These are used by several passes.

We represent IPA references in the callgraph. For afunction or variableA, the IPA referenceis a list of alllocations where the address ofA is taken and, whenAis a variable, a list of all direct stores and reads to/fromA. References represent an oriented multi-graph on theunion of nodes of the callgraph and the varpool.

Finally, we implement a common infrastructure forjump functions. Suppose that anoptimizationpass seea functionA and it knows values of (some of) its ar-guments. Thejump function[Callahan, Ladelsky05]describes the value of a parameter of a given functioncall in function A based on this knowledge (when do-ing so is easily possible). Jump functions are used byseveraloptimizations, such as the inter-procedural con-stant propagation pass and the devirtualization pass. Theinliner also uses jump functions to perform inlining ofcallbacks.

For easier development, the GCC pass manager dif-ferentiates between normal inter-procedural passesand small inter-procedural passes. Ansmall inter-procedural passis a pass that does everything at onceand thus it can not be executed at the WPA time. It de-fines only theExecutestage and during this stage it ac-cesses and modifies the function bodies. Such passes areuseful foroptimizationat LGEN or LTRANS time andare used, for example, to implement earlyoptimizationbefore writing object files. The simple inter-proceduralpasses can also be used for easier prototyping and de-velopment of a new inter-procedural pass.

2.3 Whole program assumptions, linker plugin andsymbol visibilities

Link-time optimization gives relatively minor benefitswhen used alone. The problem is that propagation ofinter-procedural information does not work well acrossfunctions and variables that are called or referenced byother compilation units (such as from the dynamicallylinked library). We say that such functions are variablesareexternally visible.

To make the situation even more difficult, many ap-plications organize themselves as a set of shared li-braries, and the defaultELF visibility rules allow one

5











to overwrite any externally visible symbol with a dif-ferent symbol at runtime. This basically disables anyoptimizationsacross such functions and variables, be-cause thecompilercannot be sure that the function bodyit is seeing is the same function body that will be used atruntime. Any function or variable not declaredstaticin the sources degrades the quality of inter-proceduraloptimization.

To avoid this problem thecompiler must assumethat it sees the whole program when doing link-timeoptimization. Strictly speaking, the whole program israrely visible even at link-time. Standard system li-braries are usually linked dynamically or not providedwith the link-time information. In GCC, the whole pro-gram option (-fwhole-program) declares that ev-ery function and variable defined in the current compila-tion unit3 is static, except for themain function. Sincesome functions and variables need to be referenced ex-ternally, for example by an other DSO or from an as-sembler file, GCC also provides the function and vari-able attributeexternally_visible which can beused to disable the effect of-fwhole-program on aspecific symbol.

The whole program mode assumptions are slightly morecomplex in C++, where inline functions in headers areput into COMDAT. COMDAT function and variablescan be defined by multiple object files and their bodiesare unified at link-time and dynamic link-time. COM-DAT functions are changed to local only when their ad-dress is not taken and thus un-sharing them with a li-brary is not harmful. COMDAT variables always remainexternally visible, however for readonly variables it isassumed that their initializers cannot be overwritten bya different value.

The whole program mode assumptions do not fit wellwhen shared libraries arecompiledwith the link-timeoptimization. The fact thatELF specification allowsoverwriting symbols at runtime cause common prob-lems with the increase of the dynamic linking time andfor this reason already common mechanisms to solvethis problem are available [Drepper].

GCC provides the function and variable attributevisibility that can be used to specify the visibil-ity of externally visible symbols (or alternatively an-fdefault-visibility command line option).

3At link-time optimization the current unit is the union of allobjects compiled with LTO

ELF defines thedefault, protected, hiddenand internal visibilities. Most commonly used ishidden visibility. It specifies that the symbol cannotbe referenced from outside of the current shared library.

Sadly this information cannot be used directly by thelink-time optimizationin the compilersince the wholeshared library also might contain non-LTO objects andthose are not visible to thecompiler.

GCC solves this with the linker plugin. Thelinker plu-gin [Plugin] is an interface to the linker that allows anexternal program to claim the ownership of a given ob-ject file. The linker then performs the linking procedureby querying the plugin about the symbol table of theclaimed objects and once the linking decisions are com-plete, the plugin is allowed to provide the final objectfile before the actual linking is made. The linker pluginobtains the symbol resolution information which speci-fies which symbols provided by the claimed objects arebound from the rest of a binary linked.

At the current time, the linker plugin works only in com-bination with the Gold linker, but a GNU ld implemen-tation is under development.

GCC is designed to be independent of the rest of thetool-chain and aims to support linkers without pluginsupport. For this reason it does not use the linker pluginby default. Instead the object files are examined beforebeing passed to the linker and objects found to have LTOsections are passed through the link-time optimizer first.This mode does not work for library archives. The deci-sion on what object files from the archive are needed de-pends on the actual linking and thus GCC would have toimplement the linker by itself. The resolution informa-tion is missing too and thus GCC needs to make an ed-ucated guess based on-fwhole-program. Withoutlinker plugin GCC also assume that symbols declared ashidden and not referred by non-LTO code by default.

The current behavior on object archives is suboptimal,since the LTO information is silently ignored and LTOoptimizationis not done without any report. The usercan then be easily disappointed by not seeing any bene-fits at all from the LTOoptimization.

The linker plugin is enabled via command line op-tion -fuse-linker-plugin. We hope that thisbecomes standard behavior in a near future. Manyoptimizationsare not possible without linker plugin sup-port.

6
















3 Inter-procedural optimizations performed

GCC implements several inter-proceduraloptimizationpasses. In this section we provide a quick overview anddiscuss their effectiveness on the test cases.

3.1 Early optimization passes

Before the actual inter-proceduraloptimization is per-formed, the functions are early optimized. Earlyoptimization is a combination of the lowering passes(where the SSA form is constructed), the scalaroptimization passes and the simple inter-proceduralpasses. The functions are sorted in reverse postorder (tomake them topologically ordered for acyclic callgraphs)and all the passes are executed sequentially on the indi-vidual functions. Functions are optimized in this orderand only after all passes have finished a given functionis the next function is processed.

Earlyoptimizationis performed at LGEN time to reducethe abstraction penalty before the real inter-proceduraloptimizationis done. Since this work is done at LGENtime (rather than the link-time), it reduces the size of ob-ject files as well as the linking time, because the worknot re-done each time the object file is linked. An ob-ject file is oftencompiledonce and used many times. Itis consequently beneficial to keep as much of the workas possible in the LGEN rather than doing link-timeoptimizationon unoptimized output from the front-end.

The followingoptimizationsare performed:

• Early inlining

Functions that have been already optimized earlierand are very small are inlined into the current func-tion. Because the early inliner lacks any globalknowledge of the program the inlining decisionsare driven by the code size growth, and only verysmall code size growth is allowed.

• Scalar optimization

GCC currently performs constant propagation,copy propagation, dead code elimination, andscalar replacement.

• Inter-procedural scalar replacement

For static functions whose address is not taken,dead arguments are eliminated and calling con-ventions updated by promoting small arguments

passed by reference to arguments passed by value.Also when a whole aggregate is not needed, onlythe fields which are used are passed, when thistransformation is expected to simplify the resultingcode [Jambor].

The main motivation for this pass is to make object-oriented programs easier to analyze locally byavoiding need to pass thethis pointer to simplemethods.

• Tail recursion elimination

• Exception handling optimizations

This pass reduces the number of exception han-dling regions in the program primarily by removingcleanup actions that were proved to be empty, andregions that contains no code that which possiblythrow.

• Static profile estimation

When profile feedback is not available, GCC at-tempts to guess the function profile based on a setof simple heuristics [Ball, Hubicka05]. Based onthe profile, cold parts of the function body are iden-tified (such as parts reachable only from exceptionhandling or leading to a function call that never re-turns). The static profile estimation can be con-trolled by user via thebuiltin_expect builtinand thecold function attribute.

• Function attributes discovery

GCC has C and C++ language extensions that al-low the programmer to specify several function at-tributes asoptimizationhints. In this pass some ofthose attributes can be auto-detected. In particularwe detect functions that cannot throw (nothrow),functions that never returns (noreturn), andconst andpure functions. Const functions inGCC terminology are functions that only returntheir value and their return value depends only onthe function arguments. For manyoptimizationpasses const functions behave like a simple ex-pression (allowing dead code removal, commonsubexpression elimination etc.). Pure functions arelike const functions but are allowed to read globalmemory. See [GCCmanual] for details.

• Function splitting passThis pass splits functionsinto headers and tails to aid partial inlining.

7












Early optimizationis very effective for reducing the ab-straction penalty. It is essential for benchmarks that, forinstance, use the Pooma library—early inlining at theTramp3d benchmark [Günther] causes an order of mag-nitude improvement [Hubicka07].

The discovery of function attributes also controls theflow graphaccuracy and discovery of the noreturn func-tions significantly improves the effectivity of the staticprofile estimation. Error handling and sanity checkingcode is often discovered as cold.

Earlyoptimizationis of lesser importance on code basesthat do not have significant abstraction or which arehighly hand optimized, such as the Linux kernel. Thedrawback is that most of the work done by scalar opti-mizers needs to be re-done again after inter-proceduraloptimization. Inlining, improved aliasing and other factsderived from the whole program makes the scalar passesoperate more effectively. On such code bases early in-lining leads to slowdowns incompiletime while yield-ing small benefits at best.

After earlyoptimizationunreachable functions and vari-ables are removed.

3.2 The whole program visibility pass

This is the firstoptimizationpass run at WPA when thecallgraph of the whole program is visible. The pass de-cides which symbols are externally visible in the currentunit (entry point). The decisions are rather tricky in de-tails and briefly described in Section2.3. The pass alsoidentifies functions which are not externally visible andonly called directly (local functions). Local functionsare later subject to moreoptimizations. For example, oni386 they can use register passing conventions.

Unreachable functions and variables are removed fromthe program. The removal of unreachable functions isre-done after each pass that might render more functionsunreachable.

3.3 IPA profile propagation (ipa-profile)

Every function in GCC can behot (when it is de-clared with thehot attribute or when profile feedback ispresent),normal, executed once(such as static construc-tors, destructors,main or functions that never returns),or unlikely executed.

This information is then used to decide whether to op-timize for speed or code size. Ifoptimizationfor sizeis not enabled,hot andnormal functions are optimizedfor speed (except for their cold regions), while func-tionsexecuted onceare optimized for speed only insideloops.Unlikely executedfunctions are always optimizedfor size.

When profile feedback is not available, this pass at-tempts to promote the static knowledge based on callersof the function.

• When all calls of a given function are unlikely(That is either the caller isunlikely executedor thefunction profile says that the particular call is cold),the function isunlikely executed, too.

• When all callers areexecuted onceor unlikely exe-cuted, and call the function just once, the functionis executed oncetoo. This is allows a number ofexecutions of functionexecuted onceto be boundby known constant.

• When all callers are executed only at startup, thefunction is also marked as executed only at startup.This helps to optimize code layout of static con-structors.

To optimize the program layout, thehot functions areplaced in a separate subsection of the text segment(.text.hot). Unlikely functions are placed in sub-section.text.unlikely. For GCC 4.6.0 we willalso place functions used only at startup into subsection.text.startup.

While the pass provides an easy and cheap way touse the profile driven compilation infrastructure to savesome of code size, its benefits on large programs aresmall. The decisions on what calls are cold are too con-servative to give substantial improvements on a largeprogram.

For example, on Firefox, only slightly over 1000 func-tions are identified as cold, accounting for fewer thanthan 1% of functions in the whole program. The passseems to yield substantial improvements only on smallbenchmarks, where code size is not much of concern.

A more important effect of the pass is the identificationof functions executed at startup which we will discussin Sections3.5and4.4.

8


http://sciencewise.info/resource/Graph_7/Graph_by_Wikipedia








3.4 Constant propagation (ipa-cp)

GCC’s inter-procedural constant propagation pass[Ladelsky05] implements a standardalgorithm usingbasic jump functions [Callahan]. Unlike the classicalformulation, GCC pass does not implement return func-tions yet. The pass also makes no attempt to propagateconstants passed in static variables.

As an extension the pass also collects a list of types ofobjects passed to arguments to allow inter-proceduraldevirtualization when all types are known and virtualmethod pointers in their respective virtual tables match.

Finally the pass performs cloning to allow propagationacross functions which are externally visible. Cloninghappens when all calls to a function are determined topass the same constant, but the function can be calledexternally too. This makes the assumption that user for-got aboutstatic keyword and all the calls actuallycome from the current compilation unit. The originalfunction remains in the program as a fallback in casethe external call happens.

More cloning would be possible: when a function isknown to be used with two different constant arguments,it would make sense to produce two clones; this howeverdoes not fit the standard inter-procedural constant prop-agation formulation and is planned for future functioncloning pass.

A simple cost model is employed which estimates thecode size effect of cloning. Cloning which reduces over-all program size (by reducing sizes of call sequences) isalways performed. Cloning that increase overall codesize is performed only at the-O3 compilation level andis bound by the function size and overall unit growthparameters.

The constant propagation is important for Fortranbenchmarks, where it is often possible to propagatearguments used to specify loop bounds and to en-able furtheroptimization, such as auto-vectorization.SPECfp2006 has several benchmarks that benefit fromthis optimization. The pass is also useful to propagatesymbolic constants, in particularthis pointers whenthe method is only used on single static instance of theobject. Often various strings used for error handling arealso constant propagated.

So far relatively disappointing results are observed onthe devirtualization component of the pass. Firefox has

many virtual calls and only about 200 calls are devir-tualized this way. Note that prior to this pass, devir-tualization is performed at local basis during the earlyoptimizations.

3.5 Constructor and destructor merging

In this simple pass we collect all static constructors anddestructors of given priority and produce single functioncalling them all that serves as a new static constructoror destructor. Inlining will later most likely produce asingle function initializing the whole program.

This optimization was implemented after examiningthe disk access patterns at startup of Firefox, see Sec-tion 4.4.

3.6 Inlining (ipa-inline)

The inliner, unlike the early inliner, has informationabout the current unit and profile, either statically es-timated or read as profile feedback. As a result it canmake better global decisions than the early inliner.

The inliner is implemented as a pass which tries to doas much useful inlining as possible within the boundsgiven by several parameters: theunit growth limits thecode size expansion on a whole compilation unit,func-tion growth limits the expansion of a single function(to avoid problems with non-linearalgorithms in thecompiler), andstack frame growth. Since the relativegrowth limits do not work well for very small units, theyapply only when the unit, function, or stack frame isconsidered to be already large. All these parameters areuser-controllable [GCCmanual].

The inliner performs several steps:

1. Functions marked with thealways_inline at-tribute and all callers of functions marked by theflatten attribute [GCCmanual] are inlined.

2. Compilation unit size is computed.

3. Small functions are inlined. Functions are consid-ered small until a specified bound on function bodysize is met. The bound differs for functions de-claredinline and functions that are auto-inlined.Unless-O3 or -finline-functions is in ef-fect, auto-inlining is done only when doing so isexpected to reduce the code size.

9

http://sciencewise.info/resource/Algorithms/Algorithms_by_Wikipedia







This step is driven by a simplegreedy algorithmwhich tries to inline functions in order specified byestimatedbadnessuntil the limits are hit. After itis decided that a given function should be inlined,the badness of its callers and callees is recomputed.The badness is computed as:

• The estimated code size growth after inliningthe function into all callers, when this growthis negative,

• The estimated code size growth divided bythe number of calls, when profile feedback isavailable, or

• Otherwise computed by the formula

cgrowthbenefit

+growth for all.

Herebenefitis an estimated speedup of inlin-ing the call,growth is the estimated code sizegrowth caused by inlining this particlar call,growth for all is the estimated growth causedby inlining all calls of function, andc is a suf-ficiently large magical constant.The idea is to inline the most beneficial callsfirst, but also give some importance to the in-formation how hard it is to inline all calls ofthe given function.

Calls marked as cold by the profile information areinlined only when doing so is expected to reducethe overall code size.

At this step the inliner also performs inlining of re-cursive functions into themselves. For functionsthat do have large probability of (non-tail) self-recursion this brings similar benefits as the loopunrolling.

4. Functions called once are inlined unless the func-tion body or the stack frame growth limit is reachedor the function is not inlinable for an other reason.

The inliner is the most important inter-proceduraloptimizationpass and it is traditionally difficult to tune.The main challenge is to tune the inline limits to get rea-sonable benefits for the code size growth, and to specifythe correct priorities for inlining. The requirements oninliner behavior depends on particular coding style andthe type of application beingcompiled.

GCC has a relatively simple cost metric compared toother compilers [Chakrabarti06, Zhao]. Some other

compilersattempt to estimate, for example, the effecton the instruction cache pollution and other parameters,combining them into a single badness value. We believethat these ideas have serious problems in handing pro-grams with a large abstraction penalty. The code seenby the inliner is very different from the final code andthus it is very difficult to get reasonable estimates. Forexample, the Tramp3d benchmark has over 200 functioncalls in the program before the inlining for every oper-ation performed at execution time by the optimized bi-nary. As a result, inline heuristics have serious garbage-in garbage-out problems.

We try to limit the number of metrics we use for in-lining in GCC. At the moment we use only code sizegrowth and time estimates. We employ several heuris-tics predicting what code will be optimized out, and planto extend them more in the future. For example, wecould infer that functions optimize better when some oftheir operands are a known constant. We combine earlyinlining to reduce the abstraction penalty with carefulestimates of the overall code size growth and dynamicupdating of priorities in the queue. GCC takes into ac-count when offline copies of the function will be elim-inated. Dynamic updating of the queue improves theinliner’s ability to solve more complex scenarios overalgorithmsprocessing functions in a pre-defined order.On the other hand this is a source of scalability prob-lems as the number of callers of a given function canbe very large. The badness computation and priorityqueue maintenance has to be effective. The GCC in-liner seems to perform well when compared with imple-mentations in othercompilersespecially on benchmarkswith a large C++ abstraction penalty.

3.7 Function attributes (ipa-pure-const)

This is equivalent to the pass described in Section3.1but propagates across the callgraph and is thus able topropagate across boundaries of the original source filesas well as handle non-trivial recursion.

3.8 MOD/REF analysis (ipa-reference)

This pass [Berlin] first identifies static variables whichare never written to as read-only and static variableswhose address is never taken by a simple analysis ofthe IPA reference information.

10

http://sciencewise.info/resource/Greedy_algorithm/Greedy_algorithm_by_Wikipedia







For static variables that do not have their address taken(non-escaping variables), the pass then collects infor-mation on which functions read or modify them. This isdone by simple propagation across the callgraph, withstrongly connected regions being reduced and final in-formation is stored into bitmaps which are later used bythe alias analysis oracle.

To improve the quality of the information collected, anew function attributeleaf was introduced. This at-tribute specifies that the call to an externalleaf func-tion may return to current module only by returning orwith exception handling. As a result the calls toleaffunctions can be considered as not accessing any of thenon-escaping variables.

This pass has several limitations: First, the bitmapstends to be quadratic and should be replaced by a dif-ferentdata structure. There are also a number of pos-sible extensions for the granularity of the informationcollected. The pass should not work on the level ofvariables, but instead analyze fields of structures inde-pendently. Also the pass should not give up when theaddress of variable is passed to e.g. amemset call.

It is expected that the pass will be replaced by the inter-procedural points-to analysis once it matures.

MOD/REF is effective for some Fortran benchmarks.On programs written in a modern paradigm it suffersfrom the lack of static variables initialized. The codequality effect on Firefox and GCC is minimal. Enablingtheoptimizationsave about 0.1% of GCC binary size.

3.9 Function reordering (ipa-reorder)

This is an experimental pass we implemented while an-alyzing Firefox startup problems. It specifies the orderof the functions in the final binary by concatenating thecallgraph functions in priority order, where the priorityis given by the likeliness that one function will call another. This pass increase code locality and thus reducesthe number of pages that needs to be read at programstartup.

During typical Firefox startup, poor code locality causes84% of the.text section to be paged in by the Linuxkernel while only 19% is actually needed for programexecution. Thus there is room for up to a 3× improve-ment in library loading speed and memory usage [Glek].At the time of writing the paper we cannot demonstrate

consistent improvements in Firefox. When starting theGCC binary, the operating system needs to read about3% fewer pages. It is not decided yet if the pass will beincluded in GCC 4.6.0 release.

3.10 Other experimental passes

GCC also implements severaloptimizationpasses thatare not yet ready for compiling larger application. Inparticular inter-procedural points to analysis, a structurereordering pass [Golovanevsky] and a matrix reorgani-zation pass [Ladelsky07].

3.11 Late local and inter-procedural optimization

At the LTRANS stage the functions of a given call-graph partition arecompiledin the reverse postorder ofthe callgraph (unless the function reordering pass is en-abled). This order allows GCC to pass down certain in-formation from callees to callers. In particular we re-do function attribute discovery and propagate the stackframe alignment information. This reduces the resultingsize of the binary by additional 1% (on both Firefox andGCC binaries).

4 Compiling large applications

In this section we describe major problems observedwhile building large applications. We discuss the perfor-mance of the GCC LTO implementation and its effectson the applicationcompiled.

Ease of use is a critical consideration forcompilerfea-tures. For example, although compiling with profilefeedback can yield large performance improvementsto many applications [Hubicka05] it is used by fewsoftware packages. Even programs which might eas-ily made to benefit from profile-guidedoptimizations,such as scripting language interpreters, have not widelyadopted this feature. One of the main design goals of theLTO infrastructure was to integrate as easily as possibleinto existing build setups [LTOproposal, Briggs]. Thishas been partially met. In many cases it is enough toadd-flto -fwhole-program as a command lineoption.

In more complex packages, however, the user still needsto understand the use of-fuse-linker-plugin(to enable the linker plugin to support object

11








archives and better optimization) as well as-fwhole-program to enable the whole pro-gram assumptions. GCC 4.6.0 will introduce theWHOPR mode (enabled via-flto) command lineoption. The user then needs to specify the parallelismvia -flto=n. We plan to add auto detection ofGNU Make that allows parallel compilation via itsjob server (controlled by well known-j commandline option). The job server can be detected using aenvironment variable, but it requires the user to add+at the beginning of theMakefile rule.

In this section we concentrate mostly on our experi-ences building Firefox and the GCC itself with link-timeoptimization.

Firefox is a complex application. It consist of dozens oflibraries totaling about 6 millions of lines of C and C++code. In addition to being a large application it is usedheavily by desktop users.

Many libraries built as part of Firefox are developedby third parties with independent coding standards. Assuch they stress areas of the link-timeoptimizationin-frastructure not used at all by portable C and C++ pro-grams such as the ones present in the SPEC2006 bench-mark suite. For this reason we chose Firefox as a goodtest for the quality and practical usability of GCC LTOsupport. We believe that by fixing numerous issues aris-ing during Firefox build we also enabled GCC to buildmany other large applications.

On the other hand the the GCCcompiler itself is aportable C application. The implementation of itsmain module, thecompiler binary itself, consist ofabout 800 000 lines of hand written C code and about500 000 lines of code auto-generated from the ma-chine description. We test the effect of the link-timeoptimizationon GCC itself especially because the sec-ond author is very familiar with the code base.

At the time of writing this paper, both GCC and Firefoxcompileand work with LTO. Firefox requires minor up-dates to the source code—In particular, we had to anno-tate the variables and functions used byasm statements.This is done with attributeused [GCCmanual].

4.1 Compilation times

Compilation time increases are always noticeable whenswitching from the normal compilation to the link-

time optimizing environment. Linking is more diffi-cult to distribute and parallelize than the compilation it-self. Moreover, during development, the program is re-linked many times after modifications in some of sourcefiles. In the file-by-file compilation mode only modi-fied files are re-compiled and re-linking is relatively fast.With LTO most of theoptimizationwork is lost and alloptimizationsat link-time have to be redone again.

With the current GCC implementation of LTO, the over-all build time is expected to double at least. This is be-cause of the use of “fat” object files. This problem is willbe solved soon by introduction of “slim” object files.

The actual expense of streaming IL to object files is mi-nor during the compilation stage, so we focus on actuallink-times. We use an 24 core AMD workstation for ourtesting. The “fat” object files are about 70% larger thanobject files which contain assembler code only. (GCCuse zlib to compress the LTO sections).

4.1.1 GCC

Linking GCC in single CPU LTO mode needs 6 minutesand 31 seconds. This is similar to the time needed forthe whole non-LTO compilation (8 minutes and 12 sec-onds). Consequently time the needed to build the mainGCC binary from the scratch is about 15 minutes.

The overall increase of build time of GCC package isbigger. Severalcompiler binaries are built during theprocess and they all are linked with a common backendlibrary. With the link-timeoptimizationsthe backendlibrary is thus re-optimized several times. We do notcount this and instead measure time needed to build onlyonecompilerbinary.

The most time-consuming steps of the link-time compi-lation are the the following:

• Reading the intermediate language from the objectfile into thecompiler: 3% of the overall compila-tion time.

• Merging of declarations: 1%.

• Outputting of the assembly file: 2%.

• Debug information generation (var-trackingandsymout): 8%.

12














• Garbage collection: 2%.

• Local optimizationsconsume the majority of thecompilation time.

The most expensive components are: partial redun-dancy elimination (5%), GIMPLE to RTL expan-sion (8%), RTL level dataflow analysis (11%), in-struction combining (3%), register allocation (6%),scheduling (5%).

The actual inter-proceduraloptimizationsare very fast,with the slowest being the inline heuristics. It still ac-counts for less than 1% of the compilation time (1.5 sec-onds).

In WHOPR mode with the use of all 24 cores of ourtesting workstation we reduce the link time to 48 sec-onds. This is also faster than the parallelized non-LTOcompilation, which needs 56 seconds. As a result, evenin a parallel build setup, the LTO accounts for a two-fold slowdown. The slower non-LTO build time is partlycaused by the existence of large, auto-generated sourcefiles, such asinsn-attrtab.c, which reduce theoverall parallelism. WHOPR linking has the advan-tage of partitioning theinsn-attrtab.c into mul-tiple pieces.

The serial WPA stage takes 19 seconds. The most ex-pensive steps are:

• Reading global declarations and types: 28% of theoverall time taken by the WPA stage.

• Merging declarations: 6%.

• Inter-proceduraloptimization: 9%.

• Streaming of object files to be passed to LTRANS:42%.

The rest of the compilation process, including all par-allel LTRANS stages, and the actual linking consume29 seconds.

4.1.2 Firefox

Linking Firefox in single CPU LTO mode needs 19 min-utes and 29 seconds (compared to 39 minutes needed tobuild Firefox from scratch in non-LTO mode). The timeis distributed as follows:

• Reading of the intermediate language from the ob-ject file into thecompiler: 7%.

• Merging of declarations: 4%.

• Output of the assembly file: 3%.

• Debug information generation is disabled in ourbuilds.

• Garbage collection: 2%.

• Local optimizationsconsume majority of the com-pilation time.

The most expensive components are: operand scan(5%), partial redundancy elimination (5%), GIM-PLE to RTL expansion (13%), RTL level dataflowanalysis (5%), instruction combining (3%), registerallocation (9%), scheduling (3%).

The WHOPR mode reduces the overall link-time to5 minutes and 30 seconds. WPA stage takes 4 minutes24 seconds. This compares favorably to non-LTO par-allel compilation which take 9 minutes and 38 seconds.The source code of Firefox is organized into multipledirectories leading to less parallelism exposed to Make.The most expensive steps are:

• Reading global declarations and types: 24%.

• Merging declarations: 20%.

• Inter-proceduraloptimization: 8%.

• Streaming of object files to be passed to LTRANS:28%.

• Callgraph and WPA overhead (callgraph mergingand partitioning): 12%.

The fact that the link-timeoptimizationseems to scalelinearly and maintain a two-fold slowdown can be seenas a success. WHOPR mode successfully enables GCCto use parallelism, to noticeably reduce the build time. Itis however obvious that the intermediate language inputand output is a bottleneck. We can address this prob-lem in two ways. First, we can reducing the numberof global types streamed by separating debug informa-tion and analyzing the reason why so many types anddeclarations are needed. Second, we can optimize theon-disk representation. By solving these problems, theWPA stage can become several times faster, since the

13








actual inter-proceduraloptimizationseems to scale verywell. Just shortly before finishing the paper, RichardGünther submitted a first patch to reduce the number ofdeclarations at WPA stage to about 1/4th. This demon-strates that there is quite a lot of space for improvementleft here.

Because WHOPR makes linking faster than the timeneeded to build non-LTO application, there is hope thatwith the introduction of “slim” objects, the LTO buildtimes will be actually shorter than non-LTO for manyapplications. This is because slow code generation isbetter distributed to multiple CPUs with WHOPR thanwith avreage parallel build machinery.

Linking time comparable to the time needed to re-build the whole application is still very negative for theedit/recompile/link experience of the developers. It isnot expected that developers will use LTOoptimizationat all stages of development, but still optimizing forquick re-linking after a local modification is important.We plan to address this by the introduction of an incre-mental WHOPR mode, as discussed in Section5.

4.2 Compile time memory usage

A traditional problem in GCC is the memory usage ofits intermediate languages. Both GIMPLE and RTL re-quire an order of magnitude more memory than the sizeof the final binary. While the intermediate languagesneeds to keep more information than the actual machinecode, othercompilersachieves this with a lot slimmerILs [Lattner]. In addition to several projects to reducememory usage of GCC (see, for example, [Henderson]),this was also one of motivations for designing WHOPRto avoid the need to load the whole program into thememory at once.

In LTO mode memory usage peaks at 2GB for the GCCcompilation and 8.5GB on the Firefox compilation.

In WHOPR mode, the WPA stage operates only atthe callgraph andoptimization summaries, while theLTRANS stage sees only parts of the program at a time.In our tests we configure WHOPR to split the programinto 32 partitions. Compiling large programs is cur-rently dominated by the memory usage of the WPAstage: in a 64-bit environment the memory usage ofthe compilation of the GCC binary peaks at 415MB,while the compilation of Firefox peaks slightly over

4GB. The actual LTRANS compilations do not consumemore than 400MB, averaging 120MB for the GCC com-pilation and 300MB for Firefox.

The main binary of the GCCcompiler (cc1) is 10MB,so the GCC compile-time memory consumption is stillabout 50 times larger than the size of the program. Com-piling Firefox uses about 130 times more memory thanthe size of the resulting binary. This is a serious problemespecially for a 32-bit environment, where the memoryusage of a single process is bound by the address spacesize. In 32-bit mode, GCC barely fits in the addressspace when compiling Firefox!

The main sources of the problem are the following:

• GCC is mapping the source object files into mem-ory. This is fast, but careless about address spacelimits in 32bit environments. For GCC this ac-counts to about 170B of the address space.

It is not effective to open one file at a time, since thefiles are read in several stages, each stage accessingall files. Clearly a more effective scheme is stillpossible.

• A large amount of memory is occupied by the rep-resentation of types and declarations. Many ofthese are not really needed at the WPA stage andshould be streamed independently. For GCC thisaccounts for 260MB of the memory.

• The MOD/REF pass has a tendency to createoverly large bitmaps. This is not problem whenbuilding GCC or Firefox, but it can be observedon some benchmarks in the SPEC2006 test-suite,where over 100MB of bitmaps are needed.

Note that a relatively small amount (about 52MB incompilation of GCC) is used by the actual callgraph,varpool, and otherdata structuresused for the inter-proceduraloptimization.

The memory distribution of the compilation of Firefoxis very similar to one seen at GCC compilation. Thepercentage of memory used by declarations and types iseven higher — about 3.7GB. This is because C++ lan-guage implies more types and longer identifiers.

Similarly to the compilation time analysis, we can iden-tify declarations and types as being a major problem forscalability of GCC LTO.

14








benchmark name speedupdromeao css 1.83%tdhtml -0.54%tp_dist 0.50%tsvg 0.07%

Figure 1: Firefox performance.

4.3 Code size and quality

Link time optimizationpromises both performance im-provements as well as code size reductions. It is not dif-ficult to demonstrate benchmarks where cross moduleinlining and constant propagation cause substantial per-formance improvements. However in applications thathas been profiled and hand optimized for a single-filecompilation model (this is the case of both Firefox andGCC), the actual performance improvements are lim-ited. This is because the authors of the software alreadydid by hand most of the work the inter-procedural opti-mizer would otherwise do.

In the short term, we expect the value of link-timeoptimizationson such applications to be primarily inthe reduction of the code size. It is harder use handoptimization to reduce the overall size of the applica-tion than to increase performance. Most program spendsmost of their time in a rather small portion of the code,so one can optimize for speed only the hot code. But todecrease overall program size, one must tune the wholeapplication.

Once more widely adopted, the link-timeoptimizationwill simplify the task of developers by reducing theamount of hand-tuning needed. For example, with LTOit is not necessary to place short functions into headersfor better inlining. This allows a cleaner cut betweeninterface and implementation.

4.3.1 Firefox

Whencompiledwith link-time optimization(using the-O3 optimizationlevel) the size of Firefox main modulereduce from 33.5MB to 31.5MB, a reduction of 6%.

Runtime performance tests comparing Firefox builtwithout LTO to Firefox built with LTO using the-O3optimizationlevel are shown in the Figure1.

When optimizing for size (-Os), early reports at build-ing Firefox with the LLVM compiler and LTO en-abled claim to save 13% of the code size compared toGCC non-LTO build with the same settings [Espindola].Comparing the GCC non-LTO build (28.2MD) with theGCC LTO build (25.3MB) at-Os, we get a 11% smallerbinary.

We also observed that further reductions of thecode size are possible by limiting the overall unitgrowth (--param inline-unit-growth param-eter). We found, that limiting overall growth to 5%seems to give considerable code size saving (additional12%), while keeping most of the performance benefitsof -O3. Since this generally applies to other big com-pilation units too, we plan to re-tune the inliner to au-tomatically cut the code size growth with an increasingunit size.

The non-LTO-O3 build is 18% bigger than the non-LTO -Os build. Consequently enabling LTO (andtweaking the overall program growth) has a code sizeeffect comparable to switching from the aggressiveoptimization for speed to the aggressiveoptimizationfor size. LTO however has positive performance effects,while -Os is reported to be about 10%–17% slower.

4.3.2 GCC

GCC by default uses the-O2 optimizationlevel to builditself. Whencompiledwith the link-timeoptimization,the GCC binary shrinks from 10MB to 9.3MB, a re-duction of 7%. The actual speedups are small (withinnoise level) because during the work on the link-timeoptimization we carefully examined possibilities forcross-module inlining and reorganized the sources makethem possible in single-file compilation mode, too.

Some improvements are seen when GCC iscompiledwith -O3. The non-optimizing compilation of C pro-grams is then 4% faster. The binary size is 11MB.

4.3.3 SPEC2006

For reference we include SPEC2006 results on anAMD64 machine, comparing the options:-O3 -fpeel-loops -ffast-math-march=native

15

















speedup sizeperlbench +1.4% +4%bzip2 +2.6% -45%gcc -0.3% +1.2%mcf +1.9% -33%gobmk +3.4% +1.8%hmmer +0.8% -55%sjeng +1.2% -11%libquantum -0.5% -61%h264ref +7.0% -9%omnetpp -0.8% -11%astar -1.3% -20%

Figure 2: SPECint 2006.

with the same options plus-flto -fwhole-program.

In Figures2 and3 the first column compares SPEC rates(bigger is better), and the second column compares exe-cutable sizes (smaller is better).

The results show significant code size savings derivedfrom improved inlining decisions and the whole pro-gram assumption (especially in code bases that donot use thestatic keyword in declarations consis-tently). The performance is also generally improved.The Bwaves and leslie3d benchmarks demonstrate aproblem in the GCC static profile estimationalgorithmwhere inlining too many loops together causes a hot partof program to be predicted cold. The results in parenthe-sis shows the results with hot/cold decisions disabled.

We did not analyze the regressions in astar or zeusmpyet. We also excluded the xalancbmk and dealII bench-marks, since they do not work with the current GCCbuild4.

4.4 Startup time problems

One of Firefox’s goals is to start quickly. We devotea special section to this problem, because it is oftenoverlooked by tool-chain developers. Startup time is-sues are not commonly visible in common benchmarks,which generally consist of small- to medium-sized ap-plications.

4Both benchmarks were working with GCC LTO in the past.

speedup sizebwaves 0% (+15%) -27%gamess -0.7% -50%milc +2.2% -26%zeusmp +0.4% -27%gromacs 0% -18%cactusADM -0.8% -42%leslie3d -2.1% (0%) +0.6%namd 0% -40%soplex +1.5% -50%povray +5% -2.3%calculix 1.1% -38%GemsFDTD 0% -70%tonto -0.2% -25%lbm +3.2% 0%wrf 0% -36%sphinx3 +2.9% -32%

Figure 3: SPECfp 2006.

Unfortunately, it turns out that currently Linux + GNUtool-chain is ill-suited for starting large applications ef-ficiently. Many of the other open source programs ofFirefox’s size (e.g. OpenOffice, Chromium, Evolution)suffer from various degrees of slow startup.

There are various ways to measure startup speed. In thispaper we will focus on cold startup as a worst-case sce-nario.

4.4.1 Overview of Firefox startup

The firefox-bin “stub” calls into libxul.so,which is a large library that implements most of Fire-fox’s functionality. For historical reasons the rest ofFirefox is broken up into 14 other smaller libraries (e.g.the nspr portability library, nss security libraries).Additionally, Firefox depends on a large number ofX/GTK/GNOME libraries.

Components of Firefox startup

The Firefox startup can be categorized into the follow-ing phases:

1. Kernel loads the executable.

2. The dynamic library loader then loads all of theprerequisite libraries.

16


3. For every library loaded, initializers (static con-structors) are run.

4. Firefoxmain() and the rest of application code isrun.

5. The dynamic library loader loads additional li-braries.

6. Finally, a browser window is shown.

It may come as a surprise that steps 2–3 currently domi-nate Firefox startup on Linux. This being a cold startup,IO dominates. There are 2 kinds of IO, explicit IO donevia explicitread() calls and implicit IO facilitated bymmap().

Most of the overhead comes from not usingmmap()carefully. This IO is triggered by page faults whichare essentially random IO. Typically a page fault causes128KB of IO around the faulted page. For example,it takes 162 page faults (20MB/128KB) to page in the.text section forlibxul.so, Firefox’s main library.Each page fault incurs a seek followed by a read. Harddrive manufactures specify disk seeks ranging from 5msto 14ms(7200 to 4200)5. In practice a missed seekseems to cost 20-50ms.

Modern storage media excels at bulky IO, but randomIO in small chunks keeps devices from performing attheir best.

Figure 4 illustrates the page faults which occur whileloadinglibxul.so from disk.

We now examine each of the stages of loading Firefoxin more detail:

Loading firefox-bin

This is cheap becausefirefox-bin is a small exe-cutable, weighing in at only 48K. This is smaller thanLinux’s readahead, so loadingfirefox-bin only re-quires a single read from disk. The dynamic loader thenproceeds to load various libraries thatfirefox-bindepends on.

Dynamic linker (ld.so)

5SSDs do not suffer from disk seek latency. However, there arestill IO delays ranging from 0.1ms to 2s depending on the types offlash and controllers used [Anandtech].

Tim

e

File Offset 21M

15% relocations

46%

sta

tic

initia

lizer

s

6% misc runtime linker

21M0M

Figure 4: page faults that occur while loadinglibxul.so from disk.

In the shared-lib-happy Linux world, the runtimelinker, ld.so, loads a total of 105 libraries onfirefox-bin’s behalf. It is essential thatld.socarefully avoids performing unnecessary page faults andthat the compile-time linker arranges the binary to facil-itate that.

ld.so loads dependent libraries, sets up the memorymappings, zeroes .bss6, etc. As seen in Figure4, relo-cations are the most expensive part of library loading.Prelink addresses some of the relocation overhead ondistributions like Fedora.

6.bss follows .data which usually does not end on a page bound-ary.

17

Static Initializers

Once a library is loaded byld.so, the static initial-izers for that library are enumerated from the.ctorssection and executed.

In Firefox static initializers arise from initializing C++globals with non-primitives. Most of time they are unin-tentional, and can even be caused by including standardC++ headers (e.g.iostream). Better inlining of sim-ple constructors into POD-initialization could alleviatethis problem.

Compounding the problem of unintentional static ini-tializers is the fact that GCC treats them inefficiently.GCC creates a static initializer function and a corre-sponding.ctors entry for every compilation unit withstatic initializers. When the linker concatenates the re-sulting object files, this has the effect of evenly spread-ing the static initializers across the entire executable.

The GCC runtime then reads the.ctors entrys andexecutes them in reverse order. The order is reverse toensure that any statically linked libraries are initializedbefore code that depends on them.

The combined problem of the abundance of C++ staticinitializers, their layout in the program, and the lack ofspecial treatment by the linker means that executing the.ctors section of a large C++ program likely causesits executable to be paged in backwards!

We noticed that, for example, the Microsoft C++compilerand linker group static initializers together toavoid this problem.

Application Execution

A nice property of static initializers is that the order oftheir execution is known atcompiletime. Once the ap-plication code starts executing,.text is paged in inan essentially random pattern. During file-by-file com-pilation, functions are laid out based on the source fileimplementing them, with no relation to their callgraphrelationships. This results in poor page cache locality.

4.4.2 Startup time improvements within reach ofthe LTO infrastructure

While working on enabling LTO compilation of Firefox,we also experimented with several simpleoptimizations

targeted to improve the startup time. The most obvioustransformation is to merge the static initializers into asingle function for better code locality. This is imple-mented as a special constructor merging pass, see Sec-tion 3.5.

This transformation almost eliminates the part of diskaccessgraphattributed to execution of static initializers.Ironically for Firefox itself this only delays most of diskaccesses to a later stage of the Firefox startup becausea lot of code is needed for the rest of startup process.Other C++ applications however suffer from the sameproblem. If the application does less work during therest of startup, its startup time will benefit noticeably.

To further improve the code locality of the startup pro-cedure, we implemented a function reordering pass, seeSection3.9. Unfortunately we were not yet able toshow consistent improvements of this pass on Firefoxitself. The problem is thatcompiler, besides the execu-tion of static initializers, has little information about restof startup process and just grouping function based ontheir relative references seems not to interact well withkernel’s readahead strategy.

To improve the static analysis, further work will beneeded to track virtual calls. The design of the Fire-fox APIs allows a lot of devirtualization which GCC isnot currently capable of. When devirtualization fails wecan produce speculative callgraph edges from each vir-tual call to every virtual method of a compatible type(may edges) and use them as hints for ordering. Clearlymay edges are one of main missing parts of the GCCinter-proceduraloptimizationinfrastructure. It is how-ever not clear how much potential benefit these methodswill have in practice, as the actual problem of lackingknowledge of the startup procedure remains. ImprovingGCC devirtualization capabilities alone would be how-ever important improvements in its own.

Locality is also improved by aggressive inlining of func-tions which are called once.

In the near future we plan to experiment more with theprofile feedback directedoptimization in combinationwith LTO. With profile feedback available, the actualproblem of ordering functions for better startup time isa lot easier: All we need is to extend the program in-strumentation to record the time when a given functionwas invoked first and order functions according to theirinvocation times.

18




http://sciencewise.info/resource/Graph_7/Graph_by_Wikipedia




Finally data layout can be optimized.Data structureswith static constructors can be placed at a common lo-cation to reduce the amount of paging as well.

5 Conclusion

The link-timeoptimizationinfrastructure in GCC is ma-ture enough tocompile large real-world applications.The code quality improvements are comparable withothercompilers.

Unlike heavily benchmark-optimizedcompilers, GCCusually produces smaller binaries when compiling withLTO support than without. We expect that the code sizeeffect will be one of main selling points of LTO supportin GCC in the near future. Code size and code local-ity are both very important factors affecting the perfor-mance of large, real-world applications. The link-timeoptimization model allows thecompiler to make sub-stantial improvements in this area. The size of the Fire-fox binary built with link time optimization for speedis comparable to the size the Firefox binary built withfile-by-file optimizationfor size.

LTO also brings runtime performance improvements.The magnitude of these improvements largely dependson how much the benchmarked application was pro-filed and hand-tuned for the single-file compilationmodel. Many of the major software packages todaywent through this tuning. Consequently the immedi-ate benefits of classical inter-proceduraloptimizationsare limited. Both GCC and Firefox show a runtime im-provement of less than 1% with LTO.

GCC also exposes a lack of tuning for the new environ-ment as seen in two larger regressions in SPECfp2006benchmark. We hope to re-tune the inliner and profileestimation before GCC 4.6.0 is released. More bene-fits are possible when thecompilerenables some of itsmore aggressiveoptimizations(such as code size ex-panding auto-inlining) by default. This can be done be-cause thecompiler is significantly more aware of theglobal tradeoffs at link time than atcompile time. En-abling automatic inlining with a small overall programgrowth limit (such as 5%) improves GCC performanceby 4%. This suggests that in the future GCC shouldenable some of those code expandingoptimizationsbydefault during link-timeoptimization, with the overallcode size growth bounded to reasonable settings.

GCC also lacks some of more advanced inter-proceduraloptimizationsavailable in othercompilers. To make ourinter-procedural optimizer complete, we should intro-duce more aggressive devirtualization, points-to anal-ysis, a function specialization pass and, a pass merg-ing functions with identical bodies. The callgraphmodule also lacks support for may edges represent-ing possible targets of indirect calls. There is alsoa lot of potential in implementingdata structurelay-out optimizations [Chakrabarti08], such as structurefield reordering for better locality. Some of thesepasses already exist in the form of experimental passes[Golovanevsky, Ladelsky07]. Getting these passes intoproduction quality will involve a lot of work.

More performance improvements are possible by a com-bination of the link timeoptimizationsand profile feed-back directedoptimizations[Li ]: link-time optimizationgives thecompilermore freedom, while profile feedbacktells thecompilermore precisely which transformationsare beneficial. Immediate benefits can be observed, forexample, in the quality of inlining decisions, code lay-out, speculative devirtualization and code size. GCChas profile feedback support [Hubicka05] and it is testedto work with the link timeoptimization. More detailedstudy of the benefits is however out of the scope of thispaper.

There are a number of remaining problems. The on-disk representation of the intermediate language is notstandardized at all. This implies that all files needs tobe recompiled when thecompiler or compiler optionschange. This limits possibilities of distributing LTO-compiled libraries. The memory usage is comparablewith othercompilers(such as MSVC), yet a number ofimprovements are possible, especially by reducing thenumber of declarations and types streamed.

A GCC-specific feature, the WHOPR mode, allows par-allel compilation. Unlike a multi-threaded compilationmodel, it allows distributed compilation, too, although itis questionable how valuable this is, since today and to-morrow’s workstation machines will likely have a gooddeal of parallelism available locally. Increasing thenumber of parallel compilations past the 24 we used inour testing would probably have few benefits, since theserial WPA stage would likely dominate the compila-tion.

The WHOPR mode makes a clean cut in-betweeninter-procedural propagation and local compilation us-

19






























ing optimization summaries. This invites the imple-mentation of an incremental mode, where during re-compilation the whole work is not re-done. Only theinter-procedural passes would be re-run and the assem-bly code of functions whose body nor summary changedwould reused from the previous run. This is planned forthe next GCC releases.

Still, before GCC 4.6.0 is released (and probably evenfor GCC 4.7.0) there is a lot of work left to do on cor-rectness and featurecompletenessof the basic link-timeoptimizationinfrastructure. The main areas lacking in-clude debugging information, which is a lot worse thanin file-by-file compilation. At the time of writing thispaper enabling debug information also leads tocompilercrash when building Firefox. Clearly this is importantproblem as no major project will use acompiler thatdoes not produce usable debug information for buildingof official binaries.

6 Acknowledgment

We thank to Andi Kleen and Justin Lebar for severalremarks and corrections which significantly improvedthe quality of this paper.

References

[Ball] T. Ball, J. Larus,Branch prediction for free,Proceedings of the ACM SIGPLAN 1993Conference on Programming Language Designand Implementation, June 1993.

[Briggs] P. Briggs, D. Evans, B. Grant, R. Hundt,W. Maddox, D. Novillo, S. Park, D. Sehr,I. Taylor, O. Wild,WHOPR — Fast and ScalableWhole Program Optimizations in GCC, Initialdraft, Dec 1997.

[Berlin] D. Berlin, K. Zadeck,Compilation wide aliasanalysis, Proceedings of the 2005 GCCDevelopers’ Summit.

[Callahan] D. Callahan, K. D. Cooper, K. Kennedy,Interprocedural constant propagation, ACMSIGPLAN Notices, Volume 21, Issue 3, 1986.

[Chakrabarti08] G. Chakrabarti, F. Chow,StructureLayout Optimizations in the Open64 Compiler:Design, Implementation and Measurements.Gautam Chakrabarti, Open64 Workshop at CGO,2008.

[Chakrabarti06] D. R. Chakrabarti, S. M. Liu,Inlineanalysis: Beyond selection heuristics,Proceedings of the International Symposium onCode Generation and Optimization, 2006.

[Cytron] R. Cytron, J. Ferrante, B. K. Rosen,M. N. Wegman, F. K. Zadeck,Efficientlycomputing static single assignment form and thecontrol dependence graph, ACM Transactions onProgramming Languages and Systems, Volume13, 1991.

[Drepper] U. Drepper,How to write shared libraries.

[Espindola] R. Espindola,LTO, plugins and binarysizes, email sent to the LLVM Developers MailingList at 9 Oct 2010.

[Glek] T. Glek, libxul.so startup coverage, blog entryhttp://glandium.org/blog/?p=1105.

[Golovanevsky] O. Golovanevsky, A Zaks,Struct-reorg: current status and futureperspectives, Proceedings of the 2007 GCCDevelopers’ Summit.

[Günther] R. Günther,Three-dimensional ParallelHydrodynamics and Astrophysical Applications,PhD thesis, Eberhard-Karls-Universität, 2005.

[Henderson] R. Henderson, A. Hernandez, A.Macleod, D. Novillo,Tuple Representation forGIMPLE.

[Hubicka04] J. Hubicka,Call graph module in GCC,Proceedings of the 2004 GCC Developers’Summit.

[Hubicka05] J. Hubicka,Profile driven optimisationsin GCC, Proceedings of the 2005 GCCDevelopers’ Summit

[Hubicka06] J. Hubicka,Interprocedural optimizationon function local SSA form, Proceedings of the2006 GCC Developers’ Summit.

[Hubicka07] J. Hubicka,Interprocedural optimizationframework in GCC, Proceedings of the 2007 GCCDevelopers’ Summit.

[Jambor] M. Jambor,Interprocedural optimizations offunction parameters, Proceedings of the 2009GCC Developers’ Summit, 53–64.

20


http://sciencewise.info/definitions/Completeness_by_Chris_Lintott




[Ladelsky05] R. Ladelsky, M. Namolaru,Interprocedural constant propagation and methodversioning in GCC, Proceedings of the 2005 GCCDevelopers’ Summit.

[Ladelsky07] R. Ladelsky,Matrix flattening andtransposing in GCC, Proceedings of the 2007GCC Developers’ Summit.

[Lattner] C. Lattner, V. Adve,LLVM: A CompilationFramework for Lifelong Program Analysis &Transformation, International Symposium onCode Generation and Optimization (CGO’04),2004.

[Li] D. X. Li, R. Ashok, R. Hundt,Lightweightfeedback-directed cross-module optimization,Proceedings of the 8th annual IEEE/ACMinternational symposium on Code generation andoptimization, 2010.

[Novillo] D. Novillo, Design and implementation ofTree SSA, Proceedings of the 2004 GCCDevelopers’ Summit.

[Zhao] P. Zhao, J. N. Amaral,To inline or not toinline? Enhanced inlining decisions, Languagesand Compilers for Parallel Computing, 2004.

[Plugin] http://gcc.gnu.org/wiki/LinkTimeOptimization.

[Anandtech] http://www.anandtech.com/show/2738/17.

[LTOproposal] Link-Time Optimization in GCC:Requirements and High-Level Design, November15 2005.http://gcc.gnu.org/projects/lto/lto.pdf

[GCCmanual]Using the GNU Compiler Collection(GCC).

21

Optimizing real-world applications with GCC Link Time...

Documents

Transcript of Optimizing real-world applications with GCC Link Time...