[IEEE International Symposium on Code Generation and Optimization (CGO'07) - San Jose, CA, USA...

12
Structure Layout Optimization for Multithreaded Programs Easwaran Raman * Robert Hundt, Sandya Mannarswamy Department of Computer Science Java, Compilers, and Tools Laboratory Princeton University Hewlett-Packard Company [email protected] {robert.hundt, sandya.s.mannarswamy}@hp.com Abstract Structure layout optimizations seek to improve runtime performance by improving data locality and reuse. The structure layout heuristics for single-threaded benchmarks differ from those for multi-threaded applications running on multiprocessor machines, where the effects of false sharing need to be taken into account. In this paper we propose a technique for structure layout transformations for multi- threaded applications that optimizes both for improved spa- tial locality and reduced false sharing, simultaneously. We develop a semi-automatic tool that produces actual struc- ture layouts for multi-threaded programs and outputs the key factors contributing to the layout decisions. We ap- ply this tool on the HP-UX kernel and demonstrate the ef- fects of these transformations for a variety of already highly hand-tuned key structures with different set of properties. We show that na¨ ıve heuristics can result in massive per- formance degradations on such a highly tuned application, while our technique generally avoids those pitfalls. The improved structures produced by our tool improve perfor- mance by up to 3.2% over a highly tuned baseline. 1 Introduction The importance of optimizations that specifically target multithreaded programs is increasing, particularly with the emergence of multicore processors. The execution time of a multithreaded program is highly influenced by its mem- ory performance. While memory accesses are often a ma- jor source of performance bottlenecks even in uniproces- sor (UP) systems, additional complexities are introduced for multiprocessor (MP) systems. In such systems, the data processed by a thread or process may be in the local caches, in remote caches, or in the main memory which itself may be distributed. This causes a large variability in memory ac- cess latencies. As the number of processors in an MP sys- * Work done while the author was an intern at HP tem increase, the variability also increases. Because of this added complexity, solutions that apply to the uniprocessor case may not work on MP systems and may even aggravate the problem. Data layout transformations [2, 3, 4, 8, 11, 14, 18, 19] are a class of optimizations that seek to improve the mem- ory performance of applications by controlling the way data is arranged in memory. Data layout transformations include global variables layout, stack layout, heap layout, and struc- ture layout optimizations, the focus of this work. The layout of fields within a structure/record often has a significant ef- fect on the memory performance of a program. In the case of single threaded programs, the layout of a structure can increase or decrease spatial locality. There are many tech- niques that optimize the placement of fields within a struc- ture. These include structure splitting, structure peeling, field reordering, dead field removal etc. These techniques use various heuristics to improve locality. For instance, a common heuristic is to simply separate hot and cold fields so that cold fields do not pollute the cache lines containing hot fields. Some of them use the notion of affinity between field accesses. Fields f 1 and f 2 have a strong affinity to each other if they are often accessed close to each other. Consider the code fragment in Figure 1. The loop traverses through an array of type S and access the fields f1 and f2 resulting in a strong affinity between f1 and f2. Placing affine fields in the same cache line would improve spatial locality. In this example, if type S spans multiple cache lines, placing f1 and f2 in the same cache line is beneficial. For multithreaded programs running on multi-processor systems, a new dimension to this problem is introduced in the form of false sharing. On multi-processor systems, shared data are usually replicated across the local caches of all processors to improve locality. The replication leads to the issue of cache coherence, which is the problem of ensur- ing that whenever a shared data in a local cache is modified, it is made visible to all the other processors. This is ensured by cache coherence protocols such as MESI, MSI, MOSI, or MOESI [15]. These protocols operate at cache line gran- ularity; the Intel Itanium processors implement hardware International Symposium on Code Generation and Optimization (CGO'07) 0-7695-2764-7/07 $20.00 © 2007

Transcript of [IEEE International Symposium on Code Generation and Optimization (CGO'07) - San Jose, CA, USA...

Page 1: [IEEE International Symposium on Code Generation and Optimization (CGO'07) - San Jose, CA, USA (2007.03.11-2007.03.14)] International Symposium on Code Generation and Optimization

Structure Layout Optimization for Multithreaded Programs

Easwaran Raman∗ Robert Hundt, Sandya MannarswamyDepartment of Computer Science Java, Compilers, and Tools Laboratory

Princeton University Hewlett-Packard [email protected] {robert.hundt, sandya.s.mannarswamy}@hp.com

Abstract

Structure layout optimizations seek to improve runtimeperformance by improving data locality and reuse. Thestructure layout heuristics for single-threaded benchmarksdiffer from those for multi-threaded applications running onmultiprocessor machines, where the effects of false sharingneed to be taken into account. In this paper we proposea technique for structure layout transformations for multi-threaded applications that optimizes both for improved spa-tial locality and reduced false sharing, simultaneously. Wedevelop a semi-automatic tool that produces actual struc-ture layouts for multi-threaded programs and outputs thekey factors contributing to the layout decisions. We ap-ply this tool on the HP-UX kernel and demonstrate the ef-fects of these transformations for a variety of already highlyhand-tuned key structures with different set of properties.We show that naı̈ve heuristics can result in massive per-formance degradations on such a highly tuned application,while our technique generally avoids those pitfalls. Theimproved structures produced by our tool improve perfor-mance by up to 3.2% over a highly tuned baseline.

1 Introduction

The importance of optimizations that specifically targetmultithreaded programs is increasing, particularly with theemergence of multicore processors. The execution time ofa multithreaded program is highly influenced by its mem-ory performance. While memory accesses are often a ma-jor source of performance bottlenecks even in uniproces-sor (UP) systems, additional complexities are introducedfor multiprocessor (MP) systems. In such systems, the dataprocessed by a thread or process may be in the local caches,in remote caches, or in the main memory which itself maybe distributed. This causes a large variability in memory ac-cess latencies. As the number of processors in an MP sys-

∗Work done while the author was an intern at HP

tem increase, the variability also increases. Because of thisadded complexity, solutions that apply to the uniprocessorcase may not work on MP systems and may even aggravatethe problem.

Data layout transformations [2, 3, 4, 8, 11, 14, 18, 19]are a class of optimizations that seek to improve the mem-ory performance of applications by controlling the way datais arranged in memory. Data layout transformations includeglobal variables layout, stack layout, heap layout, and struc-ture layout optimizations, the focus of this work. The layoutof fields within a structure/record often has a significant ef-fect on the memory performance of a program. In the caseof single threaded programs, the layout of a structure canincrease or decrease spatial locality. There are many tech-niques that optimize the placement of fields within a struc-ture. These include structure splitting, structure peeling,field reordering, dead field removal etc. These techniquesuse various heuristics to improve locality. For instance, acommon heuristic is to simply separate hot and cold fieldsso that cold fields do not pollute the cache lines containinghot fields. Some of them use the notion of affinity betweenfield accesses. Fields f1 and f2 have a strong affinity to eachother if they are often accessed close to each other. Considerthe code fragment in Figure 1. The loop traverses throughan array of type S and access the fields f1 and f2 resultingin a strong affinity between f1 and f2. Placing affine fieldsin the same cache line would improve spatial locality. Inthis example, if type S spans multiple cache lines, placingf1 and f2 in the same cache line is beneficial.

For multithreaded programs running on multi-processorsystems, a new dimension to this problem is introducedin the form of false sharing. On multi-processor systems,shared data are usually replicated across the local caches ofall processors to improve locality. The replication leads tothe issue of cache coherence, which is the problem of ensur-ing that whenever a shared data in a local cache is modified,it is made visible to all the other processors. This is ensuredby cache coherence protocols such as MESI, MSI, MOSI,or MOESI [15]. These protocols operate at cache line gran-ularity; the Intel Itanium processors implement hardware

International Symposium on Code Generation and Optimization (CGO'07)0-7695-2764-7/07 $20.00 © 2007

Page 2: [IEEE International Symposium on Code Generation and Optimization (CGO'07) - San Jose, CA, USA (2007.03.11-2007.03.14)] International Symposium on Code Generation and Optimization

cache coherence with a coherence block being the size ofan L2 cache line (128 bytes). The coherence protocol doesnot distinguish between individual bytes within a coherenceblock. Thus, a modification of even a single byte of theblock causes all other copies of the block to be invalidatedor updated. When multiple processors reference disjointparts of a coherence block it leads to unnecessary coherenceoperations, an effect commonly known as false sharing [1].The costs of false sharing can be quite substantial, depend-ing on the machine configuration. The cost can vary fromthe order of an L2 miss latency on a small bus based sys-tem with 2-4 processors to more than 1000 cycles on largeservers with 64 or 128 processors, such as the HP Super-dome.

In order to avoid false sharing of structures, the heuristicsfor automatic structure layout in programs running on mul-tiprocessor environments differ significantly from those thatrun in uniprocessor environments. If there are two threadsT1 and T2 that execute concurrently such that T1 writes tothe field f1 and T2 reads from f2, then there is false sharingbetween f1 and f2. Decreasing false sharing requires plac-ing f1 and f2 in separate cache lines. If the same programalso executes the function foo in Figure 1, which accessesf1 and f2 together, then placing f1 and f2 in separate cachelines hurts foo since spatial locality is lost. Thus, the chal-lenge is to simultaneously optimize for both spatial localityand false sharing.

In this work, we propose an optimization technique toimprove the layout of structures in multithreaded programsrunning on multiprocessor systems. The technique relieson a mix of static, profile based and sampling based meth-ods. While the ideal solution is to develop a compiler trans-formation that automatically transforms the layout of thesestructures, it is difficult to do so, particularly in the caseof applications such as an OS kernel. An automatic trans-formation has to guarantee that the transformed layout issafe. This is difficult in the presence of code which takesthe address of a field and contains C-style type cast opera-tions, uses assembly code to access certain structure fields,or makes some implicit assumptions on the relative offsetsbetween certain structure fields. The compiler can chooseto be conservative and simply not transform any structurewhose fields are manipulated in the above manner, but ourexperience shows that this is overly restrictive and many im-portant structures are discarded. So, we instead use the tech-nique to develop a semi-automatic tool that suggests betterlayout for structures. Along with the layout, the tool alsoprovides some useful information on the suggested layout.This information is in the form of inter-cluster and intra-cluster edge weights, and a list of edges having a large neg-ative or positive weight. A programmer can use the sug-gested layout or use the additional information to alter theexisting layout to improve the performance manually. In

void foo(){struct S s[N];for (i=0; i<N; i++){sum += s[i].f1*s[i].f2;

}...

}

Figure 1. Spatial locality

this paper, we focus primarily on the HP-UX kernel, but thetechnique is applicable to many classes of multithreaded ap-plications.

The rest of the paper is organized as follows. In Sec-tion 2, we specify an idealized model of the problem andshow the approximations needed to make this practical inSection 3. In Section 4, we describe our implementationframework. In Section 5 we evaluate the performance ofour tool on the HP-UX kernel. We then discuss the relatedwork in Section 6 and conclude with Section 7

1.1 Contributions

This work makes the following main contributions:

• We develop an optimization technique for structurelayout in multithreaded applications that simultane-ously takes into account both spatial locality and falsesharing.

• We propose a novel lightweight technique that useshardware performance counters to estimate potentialfalse sharing between two fields of a structure if theyare placed in the same cache line. This technique isalso applicable to other related problem domains suchas global variables layout or memory allocation in dis-tributed shared memory multiprocessor systems.

• We build a semi-automatic tool based on our techniqueto target important structures in the HP-UX kernel.The tool produces a new layout for a given structureand also outputs the factors that favored that layout de-cision. Use of the suggested layouts for certain keystructures has resulted in observable speedups on a 128processor machine running the SDET benchmark fromthe SPEC SDM suite.

2 Model

We model the locality and false sharing effects on thefields of a structure in the program by a Field Layout Graph(FLG). FLG is a weighted undirected graph whose nodesrepresent the fields of the struct and the edge weights

International Symposium on Code Generation and Optimization (CGO'07)0-7695-2764-7/07 $20.00 © 2007

Page 3: [IEEE International Symposium on Code Generation and Optimization (CGO'07) - San Jose, CA, USA (2007.03.11-2007.03.14)] International Symposium on Code Generation and Optimization

between a pair of nodes represent the potential gain orpenalty in placing the nodes in the same cache line; a pos-itive weight indicates gain due to locality and a negativeweight indicates loss due to false sharing. The goal is topartition the FLG and assign the fields from a partition toa separate cache line such that the net gain is maximized.This can be realized by using graph clustering algorithmsthat maximize the intra-cluster edge weights and minimizeinter-cluster edge weights.

Figure 2 outlines our solution approach to the problem.First we create the graph and assign weights to it. Thenwe partition it to fill up clusters / cache-lines with structurefields. This simple methods assumes that record types areplaced or dynamically allocated at cache line boundaries.While this is not typically the case for general purpose allo-cators, it is true for the HP-UX kernel specific arena alloca-tor. There are various ways to look at this for general pur-pose applications. Dynamic memory could be allocated ina special way, record types could be padded at their start ad-dress, or one could simply accept the fact that am improvedlayout only increases the probability of improved memorybehavior.

Formally, the FLG for a record type struct S is de-fined as follows:

FLG = (F,E,w), whereF = { Fields in S }

E = {(f1, f2)|f1, f2 ∈ F}

w = edge weight function defined as:

w(f1, f2) = CycleGain(f1, f2)− CycleLoss(f1, f2)

The edge weight is made up of two components. Thefirst component, CycleGain (CG), denotes the potentialgain that accrues due to spatial locality when f1 and f2

are placed in the same cache line. The second component,CycleLoss (CL) denotes the potential degradation causedby false sharing by placing f1 and f2 in the same cache line.

CycleGain: When two fields f1 and f2 are placed“close” to each other spatially, the memory performanceis improved if they are accessed together. We define twofields to be close to each other if they lie on the same L2cache line. Consider a path P in the control flow graphcontaining instructions i1 and i2 in that order, such that i1accesses f1 and i2 does a read of f2. The read to f2 ismore likely to hit in the cache if f2 is in the same cacheline as f1 and the instructions i1 and i2 are close to eachother. We make the simplifying assumption that if i2 writesto f2 instead of reading it, then there is little benefit in plac-ing f1 and f2 together. This is because store misses, unlikeload misses, are mostly harmless since they typically do notstall the pipeline. This assumption is correct for most of

the benchmarks we analyzed, but it can be incorrect in thegeneral case.

All such paths in the control flow graph that accessthe field f1 and then read f2 will contribute towardsCycleGain(f1, f2). Similarly, all paths that access f2 andthen reads f1 will also contribute to CycleGain(f1, f2).Thus,

CycleGain(f1, f2) =∑i1,i2

CycleGain(f1, i1, f2, i2)

+∑i2,i1

CycleGain(f2, i2, f1, i1)

Here CycleGain(f1, i1, f2, i2) is the contribution of apath starting from instruction i1 to instruction i2, wherei1 accesses f1 and i2 accesses f2. As mentioned above,this value is non-zero only when i2 reads f2. Moreover,even when i2 is a load, we define this value to be 0 whenthere are lot of other memory accesses separating i1 andi2. For instance, if i1 and i2 are separated by a loop thatsweeps through a large array, then i2 is unlikely to find f2

in the cache, even if f1 and f2 are placed in the same cacheline. To capture this, we define MemoryDistance(i1, i2)(MD), which represents the number of unique memory lo-cations touched by the program between i1 and i2. IfMemoryDistance(i1, i2) is below a certain threshold,then CycleGain gets a positive weight, which we set tothe execution frequency of this path. Otherwise, we defineCycleGain(f1, i1, f2, i2) to be 0. Thus,

CGf1i1f2i2 =

0 if i2 is a storek1Freq(P (i1, i2)) if MD(i1, i2) < T0 otherwise

CycleLoss: CycleLoss(i1, i2) denotes the increase inmemory latency when i1 and i2 are placed close to eachother. As discussed earlier, this increase happens due tofalse sharing. Unlike CycleGain, it is difficult to preciselyspecify a formula for CycleLoss. Instead, we give an ap-proximation in the next section.

3 Approximations

While the FLG accurately models the original problem,constructing the FLG for a structure using the above defi-nitions is difficult due to many reasons. CycleGain com-putation involves identifying all pairs of instructions i1 andi2 such that i1 and i2 access fields of the same structure.Identifying all such paths, that could span across multi-ple procedures, is intractable. CycleGain also requiresMemoryDistance to be computed. In a commercial com-piler, without the use of detailed memory traces, it is vir-tually impossible to identify the set of unique locations ac-cessed between two given instructions.

International Symposium on Code Generation and Optimization (CGO'07)0-7695-2764-7/07 $20.00 © 2007

Page 4: [IEEE International Symposium on Code Generation and Optimization (CGO'07) - San Jose, CA, USA (2007.03.11-2007.03.14)] International Symposium on Code Generation and Optimization

����

����� ����

�����

��

����

���

��

��

����

���������������� � ������� �����

������� ������

�� �� �� �� �� ��

����������������

Figure 2. Solution outline

CycleLoss computation poses even more challenges.First, there is no easy way to measure how many cyclesare lost due to false sharing on a native execution. One pos-sibility is to use hardware performance counters to classifya cache miss as due to false sharing, based on its averagelatency. But in order to map such a miss to a field of astructure instance, we need some form of instrumentation.Since this instrumentation perturbs the execution, it is likelyto affect the amount of false sharing.

But, even if we are able to measure the cycles lost due tofalse sharing, that does not help much in CycleLoss com-putation. CycleLoss is defined for every pair of fields,assuming they are placed in the same cache line. Obvi-ously, in any layout whose size is more than a cache line,there is a pair of fields f1 and f2 that are not on the samecache line. No false sharing would be observed between f1

and f2 if that layout is used. But this does not mean thatCycleLoss(f1, f2) is 0, for one can come up with a newlayout that has f1 and f2 together which might cause falsesharing. One possible approach is to create multiple layoutssuch that there is at least one layout in which f1 and f2 arein the same cache line. Then one could measure the falsesharing between f1 and f2 in that layout and use that valuefor CycleLoss(f1, f2). Clearly, the complexity of this ap-proach increases tremendously as the number of fields in

the structure increases.In order to practically obtain these two key values,

we approximate these quantities and measure them as de-scribed in the rest of this section.

3.1 CycleGain Measurement

In order to keep CycleGain measurement tractable, wemake the following approximations.

• We consider only intra-procedural paths in the CFG.This reduces the problem complexity significantly.While this would result in some undercounting of Cy-cleGain, an aggressive inlining phase before this anal-ysis would alleviate this problem.

• Within a procedure, we form affinity groups [8], whichare groups of fields accessed at the same level of gran-ularity. This removes the overhead of tracking all pairsof field accesses.

• We ignore the computation of MemoryDistancebased on the assumption that at the granularity ofloops, MemoryDistance between most pairs of in-structions are likely to be small. In other words, weassume that the MemoryDistance between fields of

International Symposium on Code Generation and Optimization (CGO'07)0-7695-2764-7/07 $20.00 © 2007

Page 5: [IEEE International Symposium on Code Generation and Optimization (CGO'07) - San Jose, CA, USA (2007.03.11-2007.03.14)] International Symposium on Code Generation and Optimization

the same affinity group is always below the thresholdT

Based on these approximations, we computeCycleGain(f1, i1, f2, i2)(CG) as follows:

CGf1i1f2i2 =

0if i1 and i2 do not belong tothe same procedure

EC(L) if i1 and i2 belong to loop L

Freq(P )if i1 and i2 are in a straightline code in procedure P

The ExecutionCount (EC) of a loop is the number oftimes the loop body is executed.

3.2 CycleLoss Measurement

Since there is no practical method to estimate po-tential false sharing, we define a related quantity calledCodeConcurrency (CC). Let I be some time interval dur-ing the execution of the program. Let FI(Pk, Bi) denotethe frequency of execution of the basic block Bi executingon processor Pk during the interval I . Then,

CCI(Bi, Bj) =∑

Pm,Pn

min (FI(Pm, Bi), FI(Pn, Bj))

CC(Bi, Bj) =∑

I

CCI(Bi, Bj)

We divide the execution of the program into fixed sizetime intervals. In each interval, we consider processors thatexecute the program in pairs. For each such pair Pm andPn, we compute the minimum of execution frequencies ofBi in Pm and Bj in Pn. Summing up these values for allpairs of processors gives the CodeConcurrency value forBi and Bj . Finally, we again take the sum of these valuesover all intervals to compute CodeConcurrency(Bi, Bj).

A high value of CodeConcurrency indicates that dur-ing the program’s execution, whenever some processor ex-ecutes Bi, some other processor executes Bj at roughly thesame time with a high probability. If some field f1 of astruct S is accessed in Bi and some other field f2 of S isaccessed in Bj , and one of these accesses is a write, there islikely to be false sharing if f1 and f2 are placed in the samecache line. Thus,

CycleLoss(f1, f2) = k2

∑B1,B2

CC(B1, B2)

where, the field f1 is accessed in basic block B1, f2 is ac-cessed in B2 and at least one of these two accesses is a write

and k2 is a tunable constant. This equation for CycleLossis an over-approximation of the actual false sharing becauseit does not differentiate between different instances of thesame structure type. Thus, when the field f1 from the in-stance s1 of struct S is accessed concurrently with thefield f2 from a different instance s2 of the same structS, it will contribute to CycleLoss even though there is noactual false sharing between the fields. To some extent,this problem can be mitigated using powerful alias analysistechniques. Whenever alias analysis determines that the ad-dresses of two structure instances do not alias, then we canconclude that there is no false sharing between the fieldsof those structures even though the basic blocks containingthem are highly concurrent.

4 Implementation

Our layout tool is built using the HP-UX compiler forthe Intel Itanium c© processor and scripts that process thedata produced by the compiler. The compiler infrastruc-ture, in particular the high level inter-procedural optimizerSYZYGY has been described extensively in [13]. In short,when the option -ipo is specified, the optimizer front-endemits object files containing an intermediate representationof the input (IELF files). The actual compilation happens atlink time, where all input files are consumed by the inter-procedural optimizer (IPO). Multiple parallel processes ofthe back-end containing loop optimizer, scalar optimizer,code generator, and low-level optimizer are spawned to gen-erate final object files. The object files are passed back tothe linker, which generates the final executable. All thishappens transparently to the user.

The compiler also supports profile-based optimization.In a profile collect phase the executable is instrumentedby the compiler. During runtime, the performance analy-sis tool Caliper is run to additionally collect data from theperformance monitoring unit (PMU) on the Itanium proces-sor. The resulting feedback file contains both precise edgecounts and sampled data. The compiler uses the profile ina use-phase to guide many optimization phases. This in-frastructure has been designed to scale to large applicationsconsisting of many millions of lines of code both in termsof compile time and memory consumption.

Figure 3 shows the block diagram of our implementa-tion. In the rest of this section, we describe the componentsin the block diagram in detail.

4.1 Static Affinity Information

As part of the inter-procedural optimizations the com-piler performs automatic data layout transformations ofrecord types [8]. Even when the automatic transforma-tion fails because of legality violations or profitability con-

International Symposium on Code Generation and Optimization (CGO'07)0-7695-2764-7/07 $20.00 © 2007

Page 6: [IEEE International Symposium on Code Generation and Optimization (CGO'07) - San Jose, CA, USA (2007.03.11-2007.03.14)] International Symposium on Code Generation and Optimization

Sources SYZYGY

AffinityGraph

FMF

NewLayout

Info

LayoutTool

make

a.out Caliper PMUTrace

Con-currency

Gen.

ConcurrencyData

+

Figure 3. Block diagram of the implementa-tion

straints, the analysis performed by the compiler is still valu-able for manual performance tuning. The compiler usesstatic analysis with run-time measurements to produce a de-tailed report on record types, fields, and inter-field relations.The report contains the analysis results for each record typeand is in a simple and easily parseable format. This servesas input to a variety of scripts which process and output thisdata in different ways, highlighting various aspects of thedata. In addition to standard information for fields, such asname, size, offset from the start of the structure and align-ment, the compiler also computes the key attributes hotnessand affinity. We informally define these terms as:

Hotness A field is hotter than others if it is referenced moreoften.

Affinity Two fields are affine to each other if they are ref-erenced together at the same level of granularity, forexample, at the loop level, or in straight line code. Theexecution frequency of that granularity determines thevalue of the affinity.

The affinity is collected in the front-end, after looprecognition, and the data is aggregated in IPO to buildan affinity graph, whose nodes represent fields and whoseedges are weighted with the affinity between fields. Forprofile based compilations, the affinities correspond to theaggregated edge weights over the control flow graphs of allprocedures. Additionally, the number of read and write ref-erences of the field are computed at the basic block leveland the profile data is used to provide data cache statisticssuch as miss counts and average access latencies.

The example in Figure 4 illustrates these concepts. As-suming a type S has three fields f1, f2, and f3. The two

/* entry PBO count: n */S.f1 = ;S.f2 = ;for (int i = 0; i < N; i++){

/* PBO count: N */S.f3 = ;

= S.f3 + S.f1;= S.f3;

}

Figure 4. Affinity and hotness

f1

f2f3

N n

h=N+n

R=NW=n

R=2NW=N

R=0W=n

Figure 5. Affinity graph

fields f1 and f2 are assigned outside a loop, f1 and f3 arereferenced within a loop. The incoming edge count for thiscode snippet is n, the loop body has an execution count ofN. The straight-line code and the loop form separate affinitygroups and the code snippet will result in the affinity graphshown in Figure 5 (shown without attributed d-cache infor-mation).

Field hotness is computed at graph construction time.For each pair of fields (fi, fj) an edge is formed between fi

and fj with the edge weight being the weight of the affinitygroup. This method does not, unfortunately, fully accountfor control flow, since both hot and cold basic blocks insidethe loop are weighted equally. This simple heuristic how-ever was found to be effective by Hundt et al [8] for thepurposes of automatic data layout optimizations for single-threaded programs.

In our implementation, we use a refined version of thisheuristic which we refer to as Minimum Heuristic. We com-pute the sum of read and write counts for the fields fi andfj within the loop. We take the minimum of these counts asthe affinity between fi and fj . The rationale is that, withina loop, the dynamic weight of the acyclic control flow pathcontaining f1 and f2 is upper bounded by the minimum ofthese two values.

International Symposium on Code Generation and Optimization (CGO'07)0-7695-2764-7/07 $20.00 © 2007

Page 7: [IEEE International Symposium on Code Generation and Optimization (CGO'07) - San Jose, CA, USA (2007.03.11-2007.03.14)] International Symposium on Code Generation and Optimization

4.2 Code Concurrency

Code Concurrency was defined in the model in the pre-vious section. In our implementation, instead of measuringthe actual frequency of execution, we use a sampled valueobtained from the Itanium PMU. We use the performanceanalysis tool HP Caliper [7] with a modified measurementmode to sample over the run of an application. Caliper sup-ports a whole system mode where the data is collected forthe entire system and tagged with the CPU id. Each sam-ple contains the instruction pointer (IP), as well as the valueof the Itanium specific Interval Timer Counter (ITC). TheITC counts upward at a fixed relationship to the processorclock frequency and can be used as a high resolution timingdevice. The distance between two counts depends on theclock frequency and the actual value for our systems is inthe order of 1ns. The ITCs for all processors on a multi-processor machines are synchronized with only a few ticksdrift across CPUs.

The sampling interval has been set to 100000 CPU cy-cles, which is 4 or 5 orders of magnitude higher than thesynchronization offsets in the ITCs for all processors. Wedivide the samples taken on all machines into intervals of1ms. With a processor clock frequency of 1.2 GHz weget about 12 samples per time slice per process. These pa-rameters have been selected not only to minimize the totalamount of data that needs to be processed later, but also toavoid losing too many samples, which can happen at highsampling frequencies on heavily loaded machines.

4.3 Constructing Field Layout Graph

An external script processes Caliper’s output files. Sam-pled IP addresses are mapped to source information usingthe source correlation information stored along with the bi-nary. Obviously, the quality of this data mapping dependsdirectly on the quality of the source information maintainedby the compiler for highly optimized code. Applying theformula for CodeConcurrency, the script produces a Con-currency Map (CM) that maps a pair of source lines to theircode concurrency value.

To find which structure fields are referenced in a givenbasic block, we added a new component to the compiler toemit a Field Mapping File (FMF). This component takesthe source lines that are found in CM, finds the basic blockscorresponding to those source lines, and collects all fieldsaccessed in those basic blocks. The field mapping file isa map of source lines to the fields accessed in the basicblocks corresponding to those source lines. It also recordswhether an access is a read or a write. A final processingstep matches the concurrency map with the field mappingfile to produce CycleLoss information between the fields.

Concurrency data can be collected on a 4-way or on a

Input: FLG

Sort FLG nodes by hotnessunassigned = FLGClusters = []while size(Unassigned) > 0 do

seed = head(Unassigned)CurrentCluster={seed}best match = find best match(CurrentCluster, Unas-signed)while best match != NULL do

add best match to CurrentClusterbest match = find best match(CurrentCluster,Unassigned)

end whileAdd CurrentCluster to Clusters

end while

Figure 6. Clustering algorithm

higher-way machine. Clearly the concurrency informationmay change as other effects, such as the locking and spin-ning behavior in the kernel, become more dominant. Toachieve best accuracy, the data should be collected on themachine for which an optimized layout is to be generated,but the size of the data collected increases linearly with thenumber of processors. We collected data on a 4 way and 16way machines. We find that for our main benchmark, theHP-UX kernel, source line pairs with high concurrency val-ues remain more or less the same in both the 4 way and 16way machines. Obviously, the actual values differ becauseof the way we compute CodeConcurrency. Other applica-tions may exhibit different behavior depending on how theyare written. For our experiments we use the data from the16 way machine.

The final step is to generate a layout of the structurebased on the FLG. In the layout we want two fields f1 andf2 to be close to each other, if the edge connecting themhas a large positive weight and keep them separate if theedge weight is negative. A common approach to solve suchproblems is to use graph clustering. We want to partitionthe graph into clusters, such that the intra-cluster weightsare maximized and the inter-cluster weights are minimized.We also want to ensure that each of these field clusters fitswithin a cache line as both CycleGain and CycleLoss areobserved at cache line granularity.

4.4 Clustering Algorithm

We use a simple greedy graph clustering algorithmshown in Figure 6. The routine best match is shown inFigure 7. The algorithm sorts the nodes of the FLG by hot-ness. Among the nodes that are not yet assigned to any clus-

International Symposium on Code Generation and Optimization (CGO'07)0-7695-2764-7/07 $20.00 © 2007

Page 8: [IEEE International Symposium on Code Generation and Optimization (CGO'07) - San Jose, CA, USA (2007.03.11-2007.03.14)] International Symposium on Code Generation and Optimization

Inputs: CurrentCluster, Unassigned

best weight=0best match=NULLfor f1 in Unassigned do

if Adding f1 to CurrentCluster requires new cache linethen

continueend iffor f2 in CurrentCluster do

weight += w(f1, f2)end forif weight > best weight then

best weight = weightbest match = f1

end ifend forreturn best match

Figure 7. Algorithm for find best match

ter, it picks the hottest node and assigns the node to a newcluster. It then grows this cluster by finding the most prof-itable unassigned node and adding that node to it. The mostprofitable node is one, which, when added to the currentcluster, maximizes the intra-cluster edge weight. In otherwords, the sum of edge weights between this node and allother nodes that are already in the cluster is the maximum,over all unassigned nodes. This process is repeated untilone of the following conditions is met:

• The sum of edge weights between an unassigned nodeand the nodes in the current cluster is negative for allunassigned nodes

• No new node can be added to the current cluster with-out increasing the number of cache lines required bythe current cluster.

Then, the next cluster is formed by picking the hottestunassigned node and the process is repeated until all nodesare assigned to some cluster.

5 Experimental Evaluation

We evaluate the potential of our tool by applying it tosome important structures in the HP-UX kernel. HP-UXis a commercial OS, whose kernel has been tuned overmany years for performance by a team of engineers. Hence,we assume the current layout of the structures to be near-optimal. Even if our tool matches the performance of thecurrent kernel or comes close to it, it shows the potential ofthis technique.

We run our experiments primarily on a 128 processor HPSuperdome server. This server contains 64 mx2 chips, eachcontaining two Itanium 2 processors. The server has a to-tal of 384 GB of RAM, distributed across processors andeach processor has a 6 MB L3 cache. These 64 chips areorganized in cells. A cell contains two buses, each support-ing two chips. Four such cells are connected via a crossbarand four such crossbars are connected together. The costof accessing a data varies depending on where the data isfound. Intra-cell latencies are smaller than intra crossbar la-tencies which are smaller than inter-crossbar latencies. Ac-cessing data from a cache on a processor that is in a differentcrossbar takes around 1000 cycles. For comparison, we alsoshow results on a smaller 4 processor machine where the 4processors are connected together by a bus. In this machine,the cost of accessing remote caches is only slightly higherthan an L2 miss.

To evaluate the performance of the kernel, we use thebenchmark 057.sdet from the SPEC Software DevelopmentMulti-Tasking (SDM) benchmark suite. SDET simulatesthe scenario of multiple concurrent users executing severalprograms each with small running times. We chose thisbenchmark because since multiple concurrent processes areexecuted, it spends a large fraction of its time in the kernelcode and stresses the kernel data structures. The throughputof the system, measured in scripts per hour, is used as themetric for comparison. In all our experiments, we first did awarmup SDET run followed by 10 runs. After removing theoutliers, we computed the mean of those 10 runs and usedit for comparison.

5.1 Evaluation of the Automatic Layout

We first evaluate the performance of the layouts pro-duced by the tool. If it is safe to transform a structure, thisis the performance that would be obtained by a compilertransformation that automatically transforms the layout. Todo the evaluation, we identify certain key structures in thekernel based on their hotness and transform their layouts in-dividually. We only consider those structures whose layoutafter transformation span multiple cache lines as otherwisethere is no qualitative difference between the original andthe transformed layouts.

Figure 8 shows the performance difference of the newlayout over the baseline. The results are shown for fourstructs, shown as A, B, C, D and E here due to theproprietary nature of the HP-UX code. The first bar showsthe performance difference compared to the baseline whenthe original layout for that type is replaced by the layoutgenerated by the tool. For three of the structures, there is asmall speedup over the baseline. In the case of struct A,there is around 5% slowdown with respect to the baseline.

International Symposium on Code Generation and Optimization (CGO'07)0-7695-2764-7/07 $20.00 © 2007

Page 9: [IEEE International Symposium on Code Generation and Optimization (CGO'07) - San Jose, CA, USA (2007.03.11-2007.03.14)] International Symposium on Code Generation and Optimization

���

��

��

��

��

� � �

����

���������

� ������ �

�������

������

Figure 8. Performance of the automatic layouton a 128-way machine

���

���

���

� � �

����

���������

��� ��� ��

� �����

Figure 9. Performance on a 4 way machine

As mentioned earlier, the structures in the HP-UX kernelhave been heavily optimized by a team of kernel engineers.Hence, we consider the baseline layout to be near optimal.If the modified layout results in a performance that is closerto the baseline, we consider it to be a step in the right di-rection. Since very few programmers invest such effort inimproving the layout of structures, the benefit of the tool islikely to be pronounced in those cases.

To show the efficiency of our heuristics, we show how itcompares with that of a layout produced by a simple heuris-tic. This heuristic first divides the fields into groups basedon the alignment requirements. Then it sorts each group byhotness and places the field in that order. This results ina highly packed layout with hot fields placed close to eachother. The second bar shows the performance of this heuris-tic relative to the baseline. For two structures, this heuristicdoes marginally better than the clustering of the FLG. Foranother structure it comes very close to the clustering ap-proach. While it might be tempting to conclude that thisvery simple heuristic would suffice, the result for structA shows the problem with this simple approach. If we usethe sort-by-hotness heuristic for this structure, the perfor-mance degradation when compared to the baseline is morethan 2X, while it is only 5% for our proposed technique.The reason for this significant degradation is that there is asignificant amount of false sharing for this structure, whilethe false sharing among the fields of the other structures areminimal at best. This suggests that while sort-by-hotnessis good enough in many cases, it is unsuitable in the pres-ence of false sharing and can not be used in an automatictransformation for multithreaded programs with potentialfor false sharing.

False sharing is pronounced in a larger multiprocessorthan in a machine with a small number of processors. Astructure layout optimization that optimizes for heavy falsesharing should not cause performance degradation in the ab-sence of false sharing. To see how our technique performsin such cases, we evaluated our technique on a machine with4 processors. Figure 9 shows the speedup numbers for thesame five structures in Figure 8 using the same layouts. Thenew layouts show marginal speedup over baseline in all fivecases. This is because, while most of this structures havea large number of fields, only a few of them causes heavyfalse sharing. Hence, they can be placed in separate cachelines without affecting spatial locality significantly.

5.2 Best Performance

While the goal is to develop a tool that automatically pro-duces the best layout, we can use the tool to incrementallyimprove an existing layout. Towards this purpose, we modi-fied the algorithm to obtain the important clustering require-ments. We identified the “important” edges of the FLG and

International Symposium on Code Generation and Optimization (CGO'07)0-7695-2764-7/07 $20.00 © 2007

Page 10: [IEEE International Symposium on Code Generation and Optimization (CGO'07) - San Jose, CA, USA (2007.03.11-2007.03.14)] International Symposium on Code Generation and Optimization

���

���

���

���

� � �

����

���������

Figure 10. Best performance

removed the rest of the edges. We treat all negative weightedges and the top 20 of the positive edges to be important.After removing the other edges, we also remove all nodeswith zero degree. Thus we are left with a subgraph of theFLG. We then apply our clustering algorithm on this sub-graph. The result is a partition of a subset of the fields in theoriginal graph. These clusters specify some constraints onthe layout; if two fields are in the same cluster, they must beplaced together on the final layout and if two two fields arein different clusters, they must be separated. We then alterthe original layout so that these constraints are met.

For two of these structures, this incremental change re-sulted in a better performance than the fully automatic lay-out. We believe that the main reason for this is that thegreedy algorithm for clustering is not an optimal algorithm.Hence, when there are a large number of fields, the layoutcan be poor. But, when we filter the edges, and consideronly a few positive weight edges, the job of clustering algo-rithm becomes easier since the clusters in this subgraph aresmall. This approach is promising if the original layout isalready well tuned.

In Figure 10, we show the performance of the best layoutfor each of the four structures. For structures C and D, thebest layout we obtained was using the automatic layout. Forstruct B, the layout obtained by modifying the originallayout by applying clustering on the subgraph was slightlybetter than the automatic layout. In the case of struct A,the hand modified layout resulted in a 2.65% speedup overbaseline while the automatic layout actually resulted in a5.29% slowdown. Since this structure has more than onehundred fields, the simple greedy clustering algorithm onthe entire set of fields performs sub-optimally.

Note that these improvements are not accumulative. Thiscan be explained by the highly tuned nature of the HP-UXkernel. Improving the performance of one piece of func-tionality can result in degraded performance somewhereelse, for example, in the locking behavior. To make the datastructure layouts accumulative, the whole kernel needs tobe tuned simultaneously. However, we don’t expect this tobe a problem for lesser tuned applications.

6 Related Work

Many previous studies have investigated and imple-mented data layout optimizations including [2, 3, 4, 8, 11,14, 18, 19] and recently [5]. All these works appear to fo-cus on locality improvements for single threaded programsand there is no explicit mentioning of heuristics or meth-ods to model effects from multi-threading or to reduce falsesharing.

The single threaded framework described in this paper isbased on previous work presented in Hundt et al. [8]. Thispaper proposes a practical structure layout optimization andadvisory based on field affinity and hotness in uniprocessorsrunning single threaded code. The affinity representation isalmost identical to the one described in this paper.

Bolosky [1] discusses the impact of false sharing on mul-tiprocessor performance and shows that this is a seriousproblem that needs to be tackled.

Data transformations for eliminating false sharingmisses in explicitly parallel programs have been discussedin several previous works such as Torrellas et al [17] andJeremiassen et al. [9]. The former work uses techniquesbased on trace analysis, and the latter uses static analysisfor explicitly parallel programs in which the synchroniza-tion and parallelization constructs are explicitly exposed tothe compiler. Our technique is applicable to all general pur-pose multithreaded code, and has been employed in a pro-duction compiler and tools system in order to improve thelayout of a multiprocessor operating system kernel.

Torellas et al. [17] also discusses data placement tech-niques for reducing false sharing such as structure padding,record alignment, co-location of lock with the accesseddata, separating out hot shared scalars etc, but does notdeal with reordering of structure fields in order to mini-mize false sharing. Jermiassen [9] discusses a set of datatransformations applied in a source-to-source restructurerin order to minimize false sharing. Their purely compiletime based non-concurrency analysis identifies shared vsnon-shared data by using the barrier synchronization pointspresent in the explicitly parallel code to determine whichportions of the program can execute in parallel and whichcannot. Hence their technique may not be applicable in gen-eral purpose multithreaded code which does not use explicitparallel programming directives.

International Symposium on Code Generation and Optimization (CGO'07)0-7695-2764-7/07 $20.00 © 2007

Page 11: [IEEE International Symposium on Code Generation and Optimization (CGO'07) - San Jose, CA, USA (2007.03.11-2007.03.14)] International Symposium on Code Generation and Optimization

There has also been considerable work on studying thetrade-off between data aggregation and false sharing sinceincreasing the cache line size improves spatial locality, butcan also increase false sharing. Kadiyala [10] proposed adynamic cache design to reduce false sharing. Ours is or-thogonal to the cache line resizing approach, and attempts toprevent the false sharing issue by separating the data whichare accessed by different threads into different cache lines.

Dubois [6] discusses hardware-software approaches toreduce the performance impact caused by false sharing. Heproposes techniques such as delayed invalidations and per-forming invalidations at word boundary instead of at thegranularity of a cache line. Our work is orthogonal to suchschemes, and does not require any hardware modifications.

Mcintosh et al. [12] mention as future work doing globalvariable layout (GVL) for multithreaded code in order toavoid false sharing misses. We plan to integrate code con-currency information into the compiler’s GVL frameworkas part of our future work.

There have been considerable research into developingtools to study the amount of false sharing and to identifythe false sharing hotspots to the programmer. For example,Talbot et al. [16] discusses CLARISSA, a trace analysis toolwhich tries to determine potential contention instead of pre-dicted contention by simulating a particular memory systemdesign. Ours is a hybrid approach which uses both compileranalysis and performance counters to provide false sharingestimates for making structure layout decisions.

7 Conclusion and Future Work

In this paper we developed a model for structure layoutoptimizations for multi-threaded applications. It not onlyaims to improve locality, but also takes the effects of falsesharing into account. We describe a practical approxima-tion and implementation of the model and combine staticcompiler analysis with runtime measurements in a semi-automatic tool to generate good layouts for structures. Theconcept of code concurrency is introduced, which is basedon synchronized sampling, a powerful technique which hasapplications for other domains as well. We use the HP-UXkernel as our benchmark and show that the produced layoutsare good and in some cases even better than the existing,manually tuned layouts.

We believe that the layouts can be improved further bymainly two factors, improved input to the model and a bet-ter clustering algorithm. We will seek to improve the affin-ity information by a variety of means, in particular, viapost-inline computation to better capture the effects of inter-procedural paths, and by lock analysis. We will also evalu-ate the use of light-weight instrumentation to get more ac-curate false sharing information. Finally we plan to use

Code Concurrency information in other parts of our com-piler, such as the GVL framework.

References

[1] W. J. Bolosky and M. L. Scott. False sharing and its effect onshared memory performance. In Proceedings of the USENIXSEDMS IV Conference, 1993.

[2] B. Calder, C. Krintz, S. John, and T. Austin. Cache-consciousdata placement. In ASPLOS-VIII: Proceedings of the eighthinternational conference on Architectural support for pro-gramming languages and operating systems, pages 139–149,New York, NY, USA, 1998. ACM Press.

[3] T. M. Chilimbi, B. Davidson, and J. R. Larus. Cache-conscious structure definition. In PLDI ’99: Proceedings ofthe ACM SIGPLAN 1999 conference on Programming lan-guage design and implementation, pages 13–24, New York,NY, USA, 1999. ACM Press.

[4] T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-consciousstructure layout. In PLDI ’99: Proceedings of the ACM SIG-PLAN 1999 conference on Programming language designand implementation, pages 1–12, New York, NY, USA, 1999.ACM Press.

[5] T. M. Chilimbi and R. Shaham. Cache-conscious coallo-cation of hot data streams. SIGPLAN Not., 41(6):252–262,2006.

[6] M. Dubois, J. Skeppstedt, L. Ricciulli, K. Ramamurthy, andP. Stenstroemm. The detection and elimination of uselessmisses in multiprocessors. In ISCA ’93: Proceedings of the20th annual international symposium on Computer architec-ture, pages 88–97, New York, NY, USA, 1993. ACM Press.

[7] R. Hundt. HP Caliper: A framework for performance analy-sis tools. IEEE Concurrency, 8(4):64–71, 2000.

[8] R. Hundt, S. Mannarswamy, and D. Chakrabarti. Practi-cal structure layout optimization and advice. In CGO ’06:Proceedings of the International Symposium on Code Gen-eration and Optimization, pages 233–244, Washington, DC,USA, 2006. IEEE Computer Society.

[9] T. E. Jeremiassen and S. J. Eggers. Reducing false shar-ing on shared memory multiprocessors through compile timedata transformations. In PPOPP ’95: Proceedings of thefifth ACM SIGPLAN symposium on Principles and practice ofparallel programming, pages 179–188, New York, NY, USA,1995. ACM Press.

[10] M. Kadiyala and L. Bhuyan. A dynamic cache sub-blockdesign to reduce false sharing. iccd, 00:313, 1995.

[11] T. Kistler and M. Franz. Automated data-member layoutof heap objects to improve memory-hierarchy performance.ACM Trans. Program. Lang. Syst., 22(3):490–505, 2000.

[12] N. McIntosh, S. Mannarswamy, and R. Hundt. Whole pro-gram optimization of global variable layout. In PACT ’06:Proceedings of the 15th International Conference on Paral-lel Architectures and Compilation Techniques, Seattle, WA,USA, 2006. IEEE Computer Society.

[13] S. Moon, X. D. Li, R. Hundt, D. R. Chakrabarti, L. A.Lozano, U. Srinivasan, and S.-M. Liu. Syzygy - a frame-work for scalable cross-module IPO. In CGO ’04: Proceed-ings of the international symposium on Code generation andoptimization, page 65, Washington, DC, USA, 2004. IEEEComputer Society.

International Symposium on Code Generation and Optimization (CGO'07)0-7695-2764-7/07 $20.00 © 2007

Page 12: [IEEE International Symposium on Code Generation and Optimization (CGO'07) - San Jose, CA, USA (2007.03.11-2007.03.14)] International Symposium on Code Generation and Optimization

[14] S. Rubin, R. Bodik, and T. Chilimbi. An efficient profile-analysis framework for data-layout optimizations. In POPL’02: Proceedings of the 29th ACM SIGPLAN-SIGACT sym-posium on Principles of programming languages, pages 140–153, New York, NY, USA, 2002. ACM Press.

[15] P. Stenstrom. A survey of cache coherence schemes for mul-tiprocessors. Computer, 23(6):12–24, 1990.

[16] S. A. M. Talbot, A. J. Bennett, and P. H. J. Kelly. Cau-tious, machine-independent performance tuning for shared-memory multiprocessors. In Euro-Par ’96: Proceedings ofthe Second International Euro-Par Conference on ParallelProcessing, pages 106–113, London, UK, 1996. Springer-Verlag.

[17] J. Torrellas, H. S. Lam, and J. L. Hennessy. False sharingand spatial locality in multiprocessor caches. IEEE Trans.Comput., 43(6):651–663, 1994.

[18] D. N. Truong, F. Bodin, and A. Seznec. Improving cache be-havior of dynamically allocated data structures. In PACT ’98:Proceedings of the 1998 International Conference on Par-allel Architectures and Compilation Techniques, page 322,Washington, DC, USA, 1998. IEEE Computer Society.

[19] Y. Zhong, M. Orlovich, X. Shen, and C. Ding. Array re-grouping and structure splitting using whole-program refer-ence affinity. In PLDI ’04: Proceedings of the ACM SIG-PLAN 2004 conference on Programming language designand implementation, pages 255–266, New York, NY, USA,2004. ACM Press.

International Symposium on Code Generation and Optimization (CGO'07)0-7695-2764-7/07 $20.00 © 2007