STATISTICAL MODELS FOR EMPIRICAL SEARCH-BASED PERFORMANCE TUNINg

7/29/2019 STATISTICAL MODELS FOR EMPIRICAL SEARCH-BASED PERFORMANCE TUNINg

1/30

65STATISTICAL MODELS IN TUNING

STATISTICAL MODELS FOR

EMPIRICAL SEARCH-BASEDPERFORMANCE TUNING

Richard Vuduc1

James W. Demmel2

Jeff A. Bilmes3

Abstract

Achieving peak performance from the computational ker-nels that dominate application performance often requiresextensive machine-dependent tuning by hand. Automatic

tuning systems have emerged in response, and they typi-cally operate by (1) generating a large number of possible,reasonable implementations of a kernel, and (2) selectingthe fastest implementation by a combination of heuristicmodeling, heuristic pruning, and empirical search (i.e. actu-ally running the code). This paper presents quantitative datathat motivate the development of such a search-basedsystem, using dense matrix multiply as a case study. Thestatistical distributions of performance within spaces ofreasonable implementations, when observed on a variety ofhardware platforms, lead us to pose and address two gen-eral problems which arise during the search process. First,we develop a heuristic for stopping an exhaustive compile-time search early if a near-optimal implementation is found.Secondly, we show how to construct run-time decision

rules, based on run-time inputs, for selecting from among asubset of the best implementations when the space ofinputs can be described by continuously varying features.We address both problems by using statistical modelingtechniques that exploit the large amount of performancedata collected during the search. We demonstrate thesemethods on actual performance data collected by thePHiPAC tuning system for dense matrix multiply. We closewith a survey of recent projects that use or otherwiseadvocate an empirical search-based approach to codegeneration and algorithm selection, whether at the level ofcomputational kernels, compiler and run-time systems, orproblem-solving environments. Collectively, these effortssuggest a number of possible software architectures forconstructing platform-adapted libraries and applications.

Key words: algorithm selection, automatic performance tun-ing, early stopping, feedback-directed optimization, matrixmultiplication, performance distribution, performance optimi-zation, software engineering, support vector method

1 Introduction

In this paper we present quantitative data that motivatethe use of the platform-specific, empirical search-basedapproach to code generation being adopted by a number of

current automatic performance tuning systems. Such sys-tems address the problem of how to generate highly effi-cient code for a number of basic computational kernelsthat dominate the performance of applications in science,engineering, and information retrieval, among others.The driving idea behind these systems is the use ofempirical search techniques, i.e. actually running andtiming candidate implementations. Familiar examples ofkernels include dense and sparse linear algebra routines,the fast Fourier transform (FFT) and related signal process-ing kernels, and sorting, to name a few. Using dense matrixmultiply as a case study, we show that performance can besurprisingly difficult to model on modern cache-basedsuperscalar architectures, suggesting the use of an empir-

ical approach to code generation. Moreover, we use thesedata to show how statistical techniques can be applied tohelp solve common, general problems arising in search-based tuning systems.

The construction and machine-specific hand-tuning ofcomputational kernels can be tedious and time-consum-ing tasks because performance is a complex function ofmany factors, including the platform (i.e. machine andcompiler), kernel, and input data which may be knownonly at run-time. First, note that modern machines employdeep memory hierarchies and microprocessors having longpipelines and intricate out-of-order execution schemes;understanding and accurately modeling performance on

these machines can be extremely difficult. Secondly,modern compilers vary widely in the range and scope ofanalyses, transformations, and optimizations that theycan perform, further complicating the task of developerswho might wish to rely on the compiler either to performparticular optimizations or even to know which transfor-mations will be most profitable on a particular kernel.Thirdly, any given kernel could have many possible rea-sonable implementations which (a) are difficult to modelbecause of the complexities of machine architectures andcompilers, and (b) correspond to algorithmic variations

1COMPUTER SCIENCE DIVISION DEPARTMENT OF

ELECTRICAL ENGINEERING AND COMPUTER SCIENCESUNIVERSITY OF CALIFORNIA AT BERKELEY, BERKELEY,

CA 94720, USA

2COMPUTER SCIENCE DIVISION DEPARTMENT OF

ELECTRICAL ENGINEERING AND COMPUTER SCIENCES

AND DEPARTMENT OF MATHEMATICS UNIVERSITY OF

CALIFORNIA AT BERKELEY, BERKELEY, CA 94720, USA

3DEPARTMENT OF ELECTRICAL ENGINEERING

UNIVERSITY OF WASHINGTON, SEATTLE, WA, USA

The International Journal of High Performance Computing Applications,Volume 18, No. 1, Spring 2004, pp. 6594DOI: 10.1177/1094342004041293

2004 Sage Publications
http://www.sagepublications.com/http://www.sagepublications.com/


2/30

66 COMPUTING APPLICATIONS

that are beyond the scope of transformations compilerscan presently apply. Finally, performance may dependstrongly on the program inputs. If such inputs are knownonly at run-time, purely static techniques to code genera-tion may be limited. To make matters worse, the process

of performance tuningand hence, of understanding thecomplex interactions among these various factorsmustbe repeated for every new platform.

Fast implementations of some kernels are available viahand-tuned libraries provided by hardware vendors. Anumber of standard and widely used library interfacesexist, including the Basic Linear Algebra Subroutines(BLAS; Dongarra et al., 1990), and the Message PassingInterface (MPI) for distributed parallel communications(Geist et al., 1996). However, the number of such inter-faces is growing to cover additional application areas,such as the Sparse BLAS standard for operations onsparse matrices (Blackford et al., 2001), and the Vectorand Signal Image Processing Library API (Schwartz et

al., 2000). The cost of providing hand-tuned implementa-tions of all these kernels for all hardware platforms, giventhe potential complexity of tuning each kernel, is likelyonly to increase with time.

In response, several recent research efforts have soughtto automate the tuning process using the following two-step methodology. First, rather than code a given kernelby hand for each computing platform of interest, thesesystems contain parametrized code generators that (a)encapsulate possible tuning strategies for the kernel, and(b) output an implementation, usually in a high-level lan-guage (such as C or Fortran) in order to leverage existingcompiler instruction-scheduling technology. By tuning

strategies we mean that the generators can output imple-mentations which vary by low-level machine-specificcharacteristics (e.g. different instruction mixes andschedules), optimization techniques (e.g. loop unrolling,cache blocking, the use of alternative data structures),run-time data (e.g. problem size), and kernel-specifictransformations and algorithmic variants. Secondly, thesystems tune for a particular platform by searching thespace of implementations defined by the generator. Typi-cal tuning systems search using a combination of heuris-tic performance modeling and empirical evaluation (i.e.actually running code for particular implementations). Inmany cases it is possible to perform the potentiallylengthy search process only once per platform. However,

even the cost of more frequent compile-time, run-time, orhybrid compile-time/run-time searches can often beamortized over many uses of the kernel.

In this paper we focus on the search task itself andargue that searching is an effective means by which toachieve near-peak performance, and is therefore animportant area for research. Indeed, search-based meth-ods have proliferated in a variety of computing contexts,

including applications, compilers, and run-time systems,as we discuss in Section 5. We begin by showing empiri-cally the difficulty of identifying the best implementa-tion, even within a space of reasonable implementationsfor the well-studied kernel, dense matrix multiply (Sec-

tion 2). While a variety of sophisticated transformationsand static models have been developed for matrix multiply,the data we present motivate searching, and furthermoresuggest the necessity of exhaustive search to guarantee thebest possible implementation.

However, exhaustive searches are frequently infeasi-ble, and moreover, performance can depend critically oninput data that may only be known at run-time. Weaddress these two search-related problems in this paperusing statistical modeling techniques. Specifically, wepropose solutions to the early stopping problem and theproblem of run-time implementation selection. Our tech-niques are designed to complement existing methodsdeveloped for these problems.

The early stopping problem arises when it is not possi-ble to perform an exhaustive search (Section 3). Existingtuning systems use a combination of kernel-specific,heuristic performance modeling and empirical searchtechniques to avoid exhaustive search. We present acomplementary technique, based on a simple statisticalanalysis of the data gathered while a search is on-going.Our method performs only a partial search while stillproviding an estimate on the performance of the bestimplementation found.

The run-time selection problem for computational ker-nels was first posed by Rice (1976), and again morerecently by Brewer (1995) (Section 4). Informally, sup-

pose we are given a small number of implementations,each of which is fastest on some class of inputs. Weassume we do not know the classes precisely ahead oftime, but that we are allowed to collect a sample of per-formance of the implementations on a subset of all possi-ble inputs. We then address the problem of automaticallyconstructing a set of decision rules which can be appliedat run-time to select the best implementation on anygiven input. We formulate the problem as a statisticalclassification task and illustrate the variety of models andtechniques that can be applied within our framework.

Our analyses are based on data collected from an exist-ing tuning system, PHiPAC (Bilmes et al., 1997, 1998).PHiPAC generates highly-tuned, BLAS compatible dense

matrix multiply implementations, and a more detailedoverview of PHiPAC appears in Section 2. Our use ofPHiPAC is primarily to supply sample performance dataon which we can demonstrate the statistical methods ofthis paper. (For complete implementations of the BLAS,we recommend the use of either the ATLAS tuning sys-tem or any number of existing hand-tuned libraries whenavailable. In particular, ATLAS improves on PHiPAC


3/30


ideas, extending their applicability to the entire BLASstandard (Whaley et al., 2001).)

The ideas and timing of this paper are indicative of ageneral trend in the use of search-based methods at vari-ous stages of application development. We review the

diverse body of related research projects in Section 5.While the present study adopts the perspective of tuningspecific computational kernels, for which it is possible toobtain a significant fraction of peak performance byexploiting all kernel properties, the larger body of relatedwork seeks to apply the idea of searching more generally:within the compiler, within the run-time system, andwithin specialized applications or problem-solving envi-ronments. Collectively, these studies imply a variety ofgeneral software architectures for empirical search-basedtuning.

2 The Case for Searching

We motivate the need for empirical search methods usingmatrix multiply performance data as a case study. Weshow that, within a particular space of performance opti-mization (tuning) parameters, (1) performance can be asurprisingly complex function of the parameters, (2) per-formance behavior in these spaces varies markedly fromarchitecture to architecture, and (3) the very best imple-mentation in this space can be hard to find. Takentogether, these observations suggest that a purely staticmodeling approach will be insufficient to find the bestchoice of parameters.

2.1 FACTORS INFLUENCING MATRIX

MULTIPLY PERFORMANCE

We briefly review the classical optimization strategies formatrix multiply, and make a number of observations thatjustify some of the assumptions of this paper (Section 2.2in particular). Roughly speaking, the optimization tech-niques fall into two broad categories: (1) cache- andTLB-level optimizations, such as cache tiling (blocking)and copy optimization (e.g. as described by Lam et al.(1991) or by Goto with respect to TLB considerations(Goto and van de Geijn, 2002)), and (2) register-level andinstruction-level optimizations, such as register-level til-ing, loop unrolling, software pipelining, and prefetching.Our argument motivating search is based on the surpris-

ingly complex performance behavior observed within thespace of register- and instruction-level optimizations, soit is important to understand what role such optimizationsplay in overall performance.

For cache optimizations, a variety of sophisticatedstatic models have been developed for kernels like matrixmultiply to help understand cache behavior, to predictoptimal tile sizes, and to transform loops to improve tem-

poral locality (Lam et al., 1991; Wolf and Lam, 1991; Carrand Kennedy, 1992; McKinley et al., 1996; Fraguela et al.,1999; Ghosh et al., 1999; Chatterjee et al., 2001). Some ofthese models are expensive to evaluate due to the complex-ity of accurately modeling interactions between the proc-

essor and various levels of the memory hierarchy (Mitchellet al., 1998).1Moreover, the pay-off due to tiling, although

significant, may ultimately account for only a fraction ofperformance improvement in a well-tuned code.Recently, Parello et al. (2002) showed that cache-leveloptimizations accounted for 1220% of the possible per-formance improvement in a well-tuned dense matrixmultiply implementation on an Alpha 21264 processorbased machine, and the remainder of the performanceimprovement came from register- and instruction-leveloptimizations.

To give some additional intuition for how these twoclasses of optimizations contribute to overall performance,consider the following experiment comparing matrix mul-

tiply performance for a sequence ofn n matrices. Fig-ure 1 shows examples of the cumulative contribution toperformance (Mflop/s) for matrix multiply implementa-tions in which (1) only cache tiling and copy optimiza-tion have been applied (shown by solid squares), and (2)register-level tiling, software pipelining, and prefetchinghave been applied in conjunction with these cache opti-mizations (shown by triangles). These implementationswere generated with PHiPAC, discussed below in moredetail (Section 2.2). In addition, we show the perform-ance of a reference implementation consisting of threenested loops coded in C and compiled with full optimiza-tions using a vendor compiler (solid line), and a hand-

tuned implementation provided by the hardware vendor(solid circles). The platform used in Figure 1 (top) is aworkstation based on a 333 MHz Sun Ultra 2i processorwith a 2 MB L2 cache and the Sun v6 C compiler, and inFigure 1 (bottom) is an 800 MHz Intel Mobile PentiumIII processor with a 256 KB L2 cache and the Intel Ccompiler. On the Pentium III, we also show the perform-ance of the hand-tuned, assembly-coded library by Gotoand van de Geijn (2002), shown by asterisks.

On the Ultra 2i, the cache-only implementation is 17 faster than the reference implementation for large n, butonly 42% as fast as the automatically generated imple-mentation with both cache- and register-level optimiza-tions. On the Pentium III, the cache-only implementation

is 3.9 faster than the reference, and about 5560% of ofthe register and cache optimized code. Furthermore, thePHiPAC-generated code matches or closely approachesthat of the hand-tuned codes. On the Pentium III, thePHiPAC routine is within 510% of the performance ofthe assembly-coded routine by Goto at large n (Goto andvan de Geijn, 2002). Thus, while cache-level optimiza-tions significantly increase performance over the refer-


4/30


Fig. 1 Performance (Mflop/s) of n nmatrix multiply for a workstation based on the Sun Ultra 2i processor (top) andan 800 MHz Mobile Pentium III processor (bottom). The theoretical peaks are 667 and 800 Mflop/s, respectively. Weinclude values of n that are powers of 2. While copy optimization (shown by cyan squares) improves performancesignificantly compared to the reference (purple solid line), register and instruction level optimizations (red triangles)are critical to approaching the performance of hand-tuned code.


5/30


ence implementation, applying them together withregister- and instruction-level optimizations is critical toapproaching the performance of hand-tuned code.

These observations are an important part of our argu-ment (Section 2.2) motivating empirical search-based

methods. First, we focus exclusively on performance inthe space of register- and instruction-level optimizationson in-cache matrix workloads. The justification is thatthis class of optimizations is essential to achieving highperformance. Even if we extend the estimate by Parello etal. (2002)specifically, from the observation that 1220% of overall performance is due to cache-level optimi-zations, to 1260% based on Figure 1there is still aconsiderable margin for further performance improve-ments from register- and instruction-level optimizations.Secondly, we explore this space using the PHiPAC gener-ator. Because PHiPAC-generated code can achieve goodperformance in practice, we claim this generator is a rea-sonable one to use.

2.2 A NEEDLE IN A HAYSTACK: THENEED FOR SEARCH

To demonstrate the necessity of search-based methods,we next examine performance within the space of regis-ter-tiled implementations. The automatically generatedimplementations of Figure 1 were created using the para-metrized code generator provided by the PHiPAC matrixmultiply tuning system (Bilmes et al., 1997, 1998).(Although PHiPAC is no longer actively maintained, forthis paper the PHiPAC generator has been modified toinclude some software pipelining styles and prefetchingoptions developed for the ATLAS system (Whaley et al.,2001).) This generator implements register- and instruc-tion-level optimizations including (1) register tilingwhere non-square tile sizes are allowed, (2) loop unroll-ing, and (3) a choice of software pipelining strategies andinsertion of prefetch instructions. The output of the gen-erator is an implementation in either C or Fortran inwhich the register-tiled code fragment is fully unrolled;thus, the system relies on an existing compiler to performthe instruction scheduling.

PHiPAC searches the combinatorially large spacedefined by possible optimizations in building its imple-mentation. To limit search time, machine parameters(such as the number of registers available and cache

sizes) are used to restrict tile sizes. In spite of this andother search-space pruning heuristics, searches can gen-erally take many hours or even a day depending on theuser-selectable thoroughness of the search. Nevertheless,as we suggest in Figure 1, performance can be compara-ble to hand-tuned implementations.

Consider the following experiment in which we fixeda particular software pipelining strategy and explored thespace of possible register tile sizes on 11 different plat-

forms. As it happens, this space is three-dimensional(3D) and we index it by integer triplets (m0, k0, n0).

2

Using heuristics based on the maximum number of reg-isters available, this space was pruned to containbetween 500 and 10000 reasonable implementations per

platform.Figure 2 (top) shows what fraction of implementations(y-axis) achieved at least a given fraction of machine peak(x-axis), on a workload in which all matrix operands fitwithin the largest available cache. On two machines, a rel-atively large fraction of implementations achieve close tomachine peak: 10% of implementations on the Power2/133 and 3% on the Itanium 2/900 are within 90% ofmachine peak. By contrast, only 1.7% on a uniprocessorCray T3E node, 0.2% on the Pentium III-M/800, andfewer than 4% on a Sun Ultra 2i/333 achieved more than80% of machine peak. Also, on a majority of the plat-forms, fewer than 1% of implemenations were within 5%of the best. Worse still, nearly 30% of implementations on

the Cray T3E ran at less than 15% of machine peak. Twoimportant ideas emerge from these observations: (1) differ-ent machines can display widely different characteristics,making generalization of search properties across themdifficult; (2) finding the very best implementations is akinto finding a needle in a haystack.

The latter difficulty is illustrated more clearly in Fig-ure 2 (bottom), which shows a 2D slice (k0 = 1) of the 3Dtile space on the Ultra 2i/333. The plot is color codedfrom dark blue = 66 Mflop/s to red = 615 Mflop/s, andthe lone red square at (m0 = 2, n0 = 3) was the fastest. Theblack region in the upper-right of Figure 2 (bottom) waspruned (i.e. not searched) based on the number of regis-

ters. We see that performance is not a smooth function ofalgorithmic details as we might have expected. Thus,accurate sampling, interpolation, or other modeling ofthis space is difficult. Like Figure 2 (top), this motivatesempirical search.

3 A Statistical Early Stopping Criterion

While an exhaustive search can guarantee finding thebest implementation within the space of implementationsconsidered, such searches can be demanding, requiringdedicated machine time for long periods. If we assumethat search will be performed only once per platform,then an exhaustive search may be justified. However,

users today are more frequently running tuning systemsthemselves, or may wish to build kernels that are custom-ized for their particular application or non-standard hard-ware configuration. Furthermore, the notion of run-timesearching, as pursued in dynamic optimization systems(Section 5) demand extensive search-space pruning.

Thus far, tuning systems have sought to prune thesearch spaces using heuristics and performance modelsspecific to their code generators. Here, we consider a


6/30


Fig. 2 (Top) The fraction of implementations (y-axis) attaining at least a given level of peak machine speed (x-axis)on six platforms. The distributions of performance vary dramatically across platforms. (Bottom) A two-dimensional(2D) slice of the 3D register tile space on the Sun Ultra 2i/333 platform. Each square represents an implementation(m0 1 n0 tile) shaded by performance (color-scale in Mflop/s). The fastest occurs at (m0 = 2, n0 = 3), havingachieved 615 Mflop/s out of a 667 Mflop/s peak. The dark region extending to the upper-right has been pruned fromthe search. Finding the optimal point in these highly irregular spaces can be like looking for a needle in a hay-stack.


7/30


complementary method for stopping a search early basedonly on performance data gathered during the search. Inparticular, Figure 2 (top), described in the previous sec-tion, suggests that even when we cannot otherwise modelthe space, we do have access to the statistical distribution

of performance. On-line estimation of this distribution isthe key idea behind the following early stopping crite-rion. This criterion allows a user to specify that thesearch should stop when the probability that the perform-ance of the best implementation observed is approxi-mately within some fraction of the best possible withinthe space.

3.1 A FORMAL MODEL AND STOPPINGCRITERION

The following is a formal model of the search process.Suppose there areNpossible implementations. When wegenerate implementation i, we measure its performance

xi. Assume that eachxi is normalized so that maxixi = 1.(We discuss the issue of normalization further in Section3.1.2.) Define the space of implementations as S= {x1, ,

xN}. Let X be a random variable corresponding to thevalue of an element drawn uniformly at random from S,and let n(x) be the number of elements of Sless than orequal tox. ThenXhas a cumulative distribution function(cdf) F(x) = Pr[Xx] = n(x)/N. At time t, where tis aninteger between 1 and Ninclusive, suppose we generatean implementation at random without replacement. Let

Xt be a random variable corresponding to the observedperformance, and furthermore letMt= max1 i tXi be therandom variable corresponding to the maximum observed

performance up to t.We can now ask the following question at each time t:what is the probability thatMt is at least 1 1, where 1 ischosen by the user or library developer based on per-formance requirements? When this probability exceedssome desired threshold 1 , also specified by the user,then we stop the search. Formally, this stopping criterioncan be expressed as follows:

or, equivalently,

(1)

Let Gt(x) = Pr[Mtx] be the cdf forMt. We refer to Gt(x)as the max-distribution. Given F(x), the max-distribu-tionand thus the left-hand side of Equation (1)can becomputed exactly as we show in Section 3.1.1. However,because F(x) cannot be known until an entire search hasbeen completed, we must approximate the max-distribu-tion. We use the standard approximation for F(x) based

on the current sampled performance data up to time tthe so-called empirical cdf (ecdf) for X. In Section 3.1.2we present our early stopping procedure based on theseideas, and we discuss the issues that arise in practice.

3.1.1 Computing the Max-Distribution Exactly andApproximately We can explicitly compute the max-distribution as follows. First, observe that

Recall that the search proceeds by choosing implementa-tions uniformly at random without replacement. We canlook at the calculation of the max-distribution as a count-

ing problem. At time t, there are ways to have

selected timplementations. Of these, the number of ways

to choose t implementations, all with performance atmost x, is , provided n(x) t. To cover the case

when n(x) < t, let = 0 when a < b for notational ease.

Thus,

(2)

where the latter equality follows from the definition ofF(x).We cannot evaluate the max-distribution after t < N

samples because of its dependence on F(x). However, wecan use the tobserved samples to approximate F(x) usingthe ecdfF(x) based on the tsamples

(3)

where nt(x) is the number of observed samples that are atmost xat time t. We can now approximate Gt(x) by thefollowing Gt(x):

(4)

The ceiling ensures that we evaluate the binomial coeffi-cient in the numerator using an integer. Thus, our empiri-

Pr Mt 1 1>[ ] 1 >

Pr Mt 1 1[ ] .> 1. After the first tmin sam-

ples, under our rescaling policy, all samples will be

renormalized to one and the ecdfFt(1 1) will evaluate to

zero for any 1 > 0. Thus, the stopping condition will beimmediately satisfied, but the realized performance will

be . While this artificial example might seem unrepre-

sentative of distributions arising in practice (as we verifyin Section 3.2), it is important to note the potential pit-falls.

3.2 RESULTS AND DISCUSSION USINGPHIPAC DATA

We applied the above model to the register tile space datafor the platforms shown in Figure 2 (top). Specifically, oneach platform, we simulated 300 searches using a ran-dom permutation of the exhaustive search data collectedfor Figure 2 (top). For various values of 1 and , wemeasured (1) the average stopping time over all searches,and (2) the average proximity in performance of theimplementation found to the best found by exhaustivesearch.

Figures 36 show the results for the Intel Itanium 2,Alpha 21164 (Cray T3E node), Sun Ultra 2i, and IntelMobile Pentium III platforms, respectively. The top halfof Figures 36 show the average stopping time as a frac-tion of the search space size for various values of1 and .That is, each plot shows at what value oft/Nthe empiri-cal stopping criterion, Equation (5), was satisfied.Because our rescaling procedure will tend to overesti-mate the fraction of implementations near the maximum(as discussed in Section 3.1.2), we need to check that theperformance of the implementation chosen is indeedclose to (if not well within) the specified tolerance 1when is small, and moreover what constitutes a small. Therefore, the bottom half of Figures 36 shows theaverage proximity to the best performance when thesearch stopped. More specifically, for each (1, ) weshow 1 M

t, whereM

tis the average observed maximumat the time twhen Equation (5) was satisfied. (Note that

M

t is the true performance where the maximum per-formance is taken to be 1.)Suppose the user selects 1 = 0.05 and = 0.1, and then

begins the search. These particular parameter values canbe interpreted as the request, stop the search when wefind an implementation within 5% of the best with lessthan 10% uncertainty. Were this search conducted onthe Itanium 2, for which many samples exhibit perform-ance near the best within the space, we would observe

1

2---

1

2---


9/30


Fig. 3 Average stopping time (top), as a fraction of the total search space, and proximity to the best performance(bottom), as the difference between normalized performance scores, on the Intel Itanium 2/900 platform as functionsof the tolerance parameters 1 (x-axis) and (y-axis).


10/30


Fig. 4 Average stopping time (top), as a fraction of the total search space, and proximity to the best performance(bottom), as the difference between normalized performance scores, on the DEC Alpha 21164/450 (Cray T3E node)platform as functions of the tolerance parameters 1 (x-axis) and (y-axis).


11/30


Fig. 5 Average stopping time (top), as a fraction of the total search space, and proximity to the best performance(bottom), as the difference between normalized performance scores, on the Sun Ultra 2i/333 platform as functions ofthe tolerance parameters 1 (x-axis) and (y-axis).


12/30


Fig. 6 Average stopping time (top), as a fraction of the total search space, and proximity to the best performance(bottom), as the difference between normalized performance scores, on the Intel Mobile Pentium III/800 platform asfunctions of the tolerance parameters 1 (x-axis) and (y-axis).


13/30


that the search ends after sampling just under 10.2% ofthe full space on average (Figure 3 (top)), having foundan implementation whose performance was within 2.55%of the best (Figure 3 (bottom)). Note that we requested animplementation within 5% (1 = 0.05), and indeed the dis-

tribution of performance on the Itanium 2 is such that wecould do even slightly better (2.55%) on average.From the perspective of stopping a search as soon as

possible, the Alpha 21164 T3E node presents a particu-larly challenging distribution. According to Figure 2(top), the Alpha 21164 distribution has a relatively longtail, meaning very few implementations are fast. At 1 =0.05 and = 0.1, Figure 4 (top) shows that indeed wemust sample about 70% of the full space. Still, we dofind an implementation within about 3% of the best onaverage. Indeed, for 1 = 0.05, we will find implementa-tions within 5% of the best for all 1 1] could be computed for various values of1 at the end

of the search and reported to the user. A user could stopand resume searches, using these estimates to gaugethe likely difficulty of tuning on her particular architec-ture.

Finally, the stopping condition we have presented

complements existing pruning techniques; a randomsearch with our stopping criterion can always be appliedto any space after pruning by other heuristics or methods.

4 Statistical Classifiers for Run-TimeSelection

In the previous sections we assume that a single optimalimplementation exists. For some applications, however,several implementations may be optimal depending onthe run-time inputs. In this section, we consider the run-time implementation selection problem (Rice, 1976;Brewer, 1995): how can we automatically build decisionrules to select the best implementation for a given input?

Below, we treat this problem as a statistical classificationtask. We show how the problem might be tackled from thisperspective by applying three types of statistical models toa matrix multiply example. In this example, given thedimensions of the input matrices, we must choose at run-time one implementation from among three, where each ofthe three implementations has been tuned for matrices thatfit in a particular cache level.

4.1 A FORMAL FRAMEWORK

We can pose the selection problem as the following clas-sification task. Suppose we are given

1. a set of m good implementations of an algo-rithm, A = {a1, , am} which all give the sameoutput when presented with the same input;

2. a set of n samples S0 = {s1, s2, , sn} from thespace Sof all possible inputs (i.e. S0 S), whereeach si is a d-dimensional real vector;

3. the execution time T(a, s) of algorithm a on inputs, where a A and s S.

Our goal is to find a decision functionf(s) that maps aninput s to the best implementation inA, i.e.f: SA. Theidea is to construct f(s) using the performance of theimplementations in A on a sample of the inputs S0. We

refer to S0 as the training set, and we refer to the execu-tion time data T(a, s) for a A, s S0 as the trainingdata. In geometric terms, we would like to partition theinput space by implementation, as shown in Figure 7(left). This partitioning would occur at compile (orbuild) time. At run-time, the user calls a single routinewhich, when given an input s, evaluatesf(s) to select animplementation.


14/30


The decision function f models the relative perform-ance of the implementations in A. Here, we considerthree types of statistical models that trade-off classifica-tion accuracy against the cost of buildingfand the cost ofexecutingfat run-time. Roughly speaking, we can sum-

marize these models as follows.

1. Parametric data modeling. We can build a para-metric statistical model of the execution time datadirectly. For each implementation, we posit a par-ametrized model of execution time and use thetraining data to estimate the parameters of thismodel (e.g. by linear regression for a linearmodel). At run-time, we simply evaluate the mod-els to predict the execution time of each imple-mentation. This method has been explored in priorwork on run-time selection by Brewer (1995).Because we choose the model of execution time,we can control the cost of evaluatingfby varying

the complexity of the model (i.e. the number ofmodel parameters).

2. Parametric geometric modeling. Rather thanmodel the execution time directly, we can alsomodel the shape of the partitions in the inputspace parametrically, by, say, assuming that theboundaries between partitions can be describedconcisely by parametrized functions. For exam-ple, if the input space is 2D, we might posit thateach boundary is a straight line which can ofcourse be described concisely by specifying itsslope and intercept. Our task is to estimate theparameters (e.g. slope and intercept) of all bound-

aries using the training data. Such a model mightbe appropriate if a sufficiently accurate model ofexecution time is not known but the boundariescan be modeled. Like parametric data modelingmethods, we can control the cost of evaluating fby our choice of functions that represent theboundaries.

3. Non-parametric geometric modeling. Rather thanassume that the partition boundaries have a partic-ular shape, we can also construct implicit modelsof the boundaries in terms of the actual datapoints. In statistical terms, this type of representa-tion of the boundaries is called non-parametric. Inthe example of this paper, we use the support vec-

tor method to construct just such a non-parametricmodel (Vapnik, 1998). The advantage of the non-parametric approach is that we do not have tomake any explicit assumptions about the inputdistributions, running times, or geometry of thepartitions. However, we will need to store at leastsome subset of the data points which make up theimplicit boundary representation. Thus, the reduc-

tion in assumptions comes at the price of moreexpensive evaluation and storage offcompared toa parametric method.

This categorization of models implies a fourth method:

non-parametric data modeling. Such models are certainlypossible, for example, by the use of support vectorregression to construct a non-parametric model of thedata (Smola and Schlkopf, 1998). We do not considerthese models in this paper.

To illustrate the classification framework, we apply theabove three models to a matrix multiply example. Specif-ically, consider the operation C C+AB, whereA, B,and Care dense matrices of sizeM K, KN, andM

N, respectively, as shown in Figure 7 (right). Note thatthese three parameters make the input space S three-dimensional. In PHiPAC, it is possible to generate differ-ent implementations tuned on different matrix workloads(Bilmes et al., 1998). Essentially, this involves conduct-

ing a search where the size of the matrices on which theimplementations are benchmarked is specified so that thematrices fit within a particular cache level. For instance,we could have three implementations: one tuned for thematrix sizes that fit approximately within L1 cache, thosethat fit within L2, and all larger sizes.

We compare the accuracy of the above modeling meth-ods using two metrics. First, we use the average misclas-sification rate, i.e. the fraction of test samplesmispredicted. We always choose the test set S'to excludethe training data S0, that is, S' (S S0). However, if theperformance difference between two implementations issmall, a misprediction may still be acceptable. Thus, our

second comparison metric is the slow-down of the pre-dicted implementation relative to the true best. That is,for each point in the test set, we compute the relative

slow-down 1, where tselected and tbest are the execu-

tion times of the predicted and best algorithms for a giveninput, respectively. For a given modeling technique, weconsider the distribution of slow-downs for points in thetest set.

4.2 PARAMETRIC DATA MODEL: LINEARREGRESSION MODELING

In our first approach, we postulate a parametric model

for the running time of each implementation. Then atrun-time, we choose the fastest implementation basedon the execution time predicted by the models. Thisapproach was adopted by Brewer (1995). For matrix mul-tiply on matrices of size NN, we might guess that therunning time of implementation a will have the form

Ta(N) = 3N3

+ 2N2

+ 1N+ 0 .

tselectedtbest

-------------


15/30


where we can use standard regression techniques todetermine the coefficients k, given the running times onsome sample inputs S0. The decision function is just f(s)= argmina A Ta(s).

One strength of this approach is that the models, andthus the accuracy of prediction as well as the cost ofmaking a prediction, can be as simple or as complicatedas desired. For example, for matrices of more generalsizes, (M, K,N), we might hypothesize a model Ta(M, K,

N) with linear coefficients and the terms MKN,MK, KN,MN,M, K,N, and 1:

(6)

We can even eliminate terms whose coefficients aresmall to reduce the run-time cost of generating a pre-diction. For matrix multiply, a simple model of this formcould be automatically derived by an analysis of thethree-nested loops structure. However, in general it mightbe difficult to determine a sufficiently precise parametricform that captures the interaction effects between the

processor and all levels of the memory hierarchy. Moreo-ver, for other more complicated kernels or algorithmshaving, say, more complicated control flow like recursionor conditional branchessuch a model may be more dif-ficult to derive.

4.3 PARAMETRIC GEOMETRIC MODEL:SEPARATING HYPERPLANES

One geometric approach is to first assume that there aresome number of boundaries, each described parametri-cally, that divide the implementations, and then find best-fitting boundaries with respect to an appropriate costfunction.

Formally, associate with each implementation a aweight function (s), parametrized by

a

, which returns

a value between 0 and 1 for some input value s. Further-

more, let the weights satisfy the property, a A (s) =1. Our decision function selects the algorithm with thehighest weight on input s, f(s) = argmaxa A { (s)}.

We can compute the parameters , , (and thus,

the weights) so as to minimize the following weightedexecution time over the training set:

(7)

Intuitively, if we view (s) as a probability of selecting

algorithm a on input s, then C is a measure of theexpected execution time if we first choose an input uni-formly at random from S0, and then choose an implemen-

tation with the probabilities given by the weights on inputs.

In this formulation, inputs s with large execution timesT(a, s) will tend to dominate the optimization. Thus, if all

Fig. 7 (Left) Geometric interpretation of the run-time selection problem: a hypothetical 2D input space in which oneof three algorithms runs fastest in some region of the space. Our goal is to partition the input space by algorithm.(Right) The matrix multiply operation C C+ ABis specified by three dimensions, M, K, and N.

Ta N( ) 7MKN 6MK 5KN 4MN 3M+ + + +=

+ 2K 1N 0.+ +

wa

wa

wa

a1

am

C a1

am

, ,( ) 1S0------- w

as( ) T a s,( ).

a A

s S

0

=

wa


16/30


inputs are considered to be equally important, it may bedesirable to use some form of normalized execution time.We defer a more detailed discussion of this issue to Sec-tion 4.5.

Of the many possible choices for (), we choose the

logistic function

(8)

where a has the same dimensions as s, and a, 0 is anadditional parameter to estimate. Note that the denomina-

tor ensures that a A (s) = 1. While there is some sta-tistical motivation for choosing the logistic function(Jordan, 1995), in this case it also turns out that the deriv-atives of the weights are particularly easy to compute.

Thus, we can estimate a and a, 0 by minimizing Equa-tion (7) numerically using Newtons method.A nice property of the weight function is thatfis cheap

to evaluate at run-time; the linear form s + a, 0 costsO(d) operations to evaluate, where dis the dimension ofthe space. However, the primary disadvantage of thisapproach is that the same linear form makes this formula-tion equivalent to asking for hyperplane boundaries topartition the space. Hyperplanes may not be a good wayto separate the input space as we shall see below. Ofcourse, other forms are certainly possible, but positingtheir precise form a priori might not be obvious, andmore complicated forms could also complicate the numer-

ical optimization.

4.4 NON-PARAMETRIC GEOMETRICMODEL: SUPPORT VECTORS

Techniques exist to model the partition boundaries non-parametrically. The support vector (SV) method is oneway to construct just such a non-parametric model, givena labeled sample of points in the space (Vapnik, 1998).

Specifically, each training sample si S0 is given alabel li A to indicate which implementation was fasteston input si. That is, the training points are assigned toclasses by implementation. The SV method then com-putes a partitioning by selecting a subset of trainingpoints that best represents the location of the boundaries,where by best we mean that the minimum geometricdistance between classes is maximized.

4The resulting

decision function f(s) is essentially a linear combinationof terms with the factor K(si, s), where only si in theselected subset are used, and Kis some symmetric posi-tive definite function. Ideally, Kis chosen to suit the data,but there are also a variety of standard choices for Kas

well. We refer the reader to the description by Vapnik(1998) for more details on the theory and implementationof the method.

The SV method is regarded as a state-of-the-artmethod for the task of statistical classification on many

kinds of data, and we include it in our discussion as akind of practical upper bound on prediction accuracy.However, the time to computef(s) is up to a factor of |S0|greater than that of the other methods because some frac-tion of the training points must be retained to evaluatef.Thus, evaluation off(s) is possibly much more expensiveto calculate at run-time than either of the other two meth-ods.

4.5 RESULTS AND DISCUSSION WITHPHIPAC DATA

We offer a brief comparison of the three methods on thematrix multiply example described in Section 4.1, using

PHiPAC to generate the implementations on a Sun Ultra1/170 workstation with a 16 KB L1 cache and a 512 KBL2 cache.

4.5.1 Experimental Setup To evaluate the predictionaccuracy of the three run-time selection algorithms, weconducted the following experiment. First, we built threematrix multiply implementations using PHiPAC: (a) onewith only register-level tiling, (b) one with register + L1tiling, and (c) one with register, L1, and L2 tiling. We con-sidered the performance of these implementations within a2D cross-section of the full 3D input space in whichM=

Nand 1 M, K,N 800. We selected disjoint subsets of

points in this space, where each subset contained 1936points chosen at random.

5Then we further divided each

subset into 500 testing points and 1436 training points.We trained and tested the three statistical models (detailsbelow), measuring the prediction accuracy on each testset.

In Figure 8, we show an example of a 500-point testset from this space where each point is color-coded bythe implementation which ran fastest. The implementa-tion which was fastest on the majority of inputs is thedefault implementation generated by PHiPAC containingfull filing optimizations, and is shown by a blue x.Thus, a useful reference is a baseline predictor whichalways chooses this implementation: the misclassifica-

tion rate of this predictor was 24%. The implementationusing only register-tiling makes up the central banana-shaped region in the center of Figure 8, shown by a redo. The register and L1 tiled implementation, shown bya green asterisk (*), was fastest on a minority of points inthe lower left-hand corner of the space. Observe that thespace has complicated boundaries, and is not obviouslycleanly separable.

wa

wa

s( )exp a

Ts a 0,+( )

exp bTs b 0,+( )

b A

----------------------------------------------=

wa

aT


17/30


The three statistical models were implemented as fol-lows.

We implemented the linear least-squares regressionmethod as described in Section 4.2, Equation (6).

Because the least-squares fit is based on choosing thefit parameters to minimize the total square errorbetween the execution time data and the model predic-tions, errors in the larger problem sizes will contributemore significantly to the total squared error thansmaller sizes, and therefore tend to dominate the fit.This could be adjusted by using weighted least-squaresmethods, or by normalizing execution time differently.We do not pursue these variations here.

For the separating hyperplane method outlined in Sec-tion 4.3, we built a model using six hyperplanes inorder to try to better capture the central region inwhich the register-only implementation was fastest.Furthermore, we replaced the execution time T(a, s) in

Equation (7) by a binary execution time T(a, s) such

that T(a, s) = 0 ifa was the fastest on input s, and oth-erwise T(a, s) = 1. (We also compared this binaryscheme to a variety of other notions of execution time,including normalizing each T(a, s) byMKNto put allexecution time data on a similar scale. However, wefound the binary notion of time gave the best results interms of the average misclassification rate on this par-ticular data set.)

For the support vector method of Section 4.4, we usedthe sequential minimal optimization algorithm of Platt(1999) with a Gaussian kernel for the function K(, ).In Platts algorithm, we set the tuning parameter C=

100. We built multiclass classifiers from ensembles ofbinary classifiers, as described by Vapnik (1998).

Below, we report on the overall misclassification rate foreach model as the average over all of the ten test sets.

4.5.2 Results and Discussion Qualitative examples ofthe predictions made by the three models on a sample testset are shown in Figures 911. The regression methodcaptures the boundaries roughly but does not correctlymodel one of the implementations (upper-left of Figure9). The separating hyperplane method is a poor qualita-tive fit to the data. The SV method appears to producethe best predictions. Quantitatively, the misclassification

rates, averaged over the ten test sets, were 34% for theregression predictor, 31% for the separating hyperplanespredictor, 12% for the support vector predictor. Only thesupport vector predictor significantly outperformed thebaseline predictor.

However, misclassification rate seems too strict ameasure of prediction performance, because we may bewilling to tolerate some penalties to obtain a fast predic-

Fig. 8 A truth map showing the regions in whichparticular implementations are fastest. The pointsshown represent a 500-point sample of a 2D slice (spe-cifically, M= N) of the input space. An implementationwith only register tiling is shown with a red o; one withL1 and register tiling is shown with a green *; one withregister, L1, and L2 tiling is shown with a blue x. Thebaseline predictor always chooses the blue algorithm.The average misclassification rate for this baselinepredictor is 24.5%.

Fig. 9 Sample classification results for the regressionpredictor on the same 500-point sample shown in Fig-ure 8. The average misclassification rate for this pre-dictor was 34%.


18/30


tion. Therefore, we also show the distribution of slow-downs due to mispredictions in Figure 12. Each curvedepicts this distribution for one of the four predictors. Thedistribution shown is for one of the ten trials which yielded

the lowest misclassification rate. Slow-down appears onthe x-axis, and the fraction of predictions on all 1936points (including both testing and training points) exceed-ing a given slow-down is shown on they-axis.

Consider the baseline predictor (solid blue line with+ markers). Note that only 56% of predictions led toslow-downs of more than 5%, and that only about 0.4%of predictions led to slow-downs of more than 10%. Not-ing the discretization, evidently only one out of the 1936cases led to a slow-down of more than 47%, with noimplementations being between 1847% slower. Thesedata indicate that the baseline predictor performs fairlywell, and that furthermore the performance of the threetuned implementations is fairly similar. Therefore, we do

not expect to improve upon the baseline predictor bymuch. This hypothesis is borne out by observing theslow-down distributions of the separating hyperplane andregression predictors (green circles and red x markers,respectively), neither of which improves significantly (ifat all) over the baseline.

However, we also see that for slow-downs of up to 5%(and, to a lesser extent, up to 10%), the support vector

predictor (cyan * markers) shows a significant improve-ment over the baseline predictor. It is possible that thisdifference would be significant in some applications withvery strict performance requirements, thereby justifying

the use of the more complex statistical model. Further-more, had the differences in execution time betweenimplementations been larger, the support vector predictorwould have appeared even more attractive.

There are a number of cross-over points in Figure 12.For instance, comparing the regression and separatinghyperplanes methods, we see that even though the overallmisclassification rate for the separating hyperplanes pre-dictor is lower than the regression predictor, the tail of thedistribution for the regression predictor becomes muchsmaller. A similar cross-over exists between the baselineand support vector predictors. These cross-overs suggestthe possibility of hybrid schemes that combine predictorsor take different actions on inputs in the tails of these

distributions, provided these inputs could somehow beidentified or otherwise isolated.

In terms of prediction times (i.e. the time to evaluatef(s)), both the regression and separating hyperplanemethods lead to reasonably fast predictors. Predictiontimes were roughly equivalent to the execution time of a3 3 matrix multiply. By contrast, the prediction cost ofthe SVM is about a 64 64 matrix multiply, which

Fig. 10 Sample classification results for the separat-ing hyperplanes predictor on the same 500-point sam-ple shown in Figure 8. The average misclassificationrate for this predictor was 31%.

Fig. 11 Sample classification results for the supportvector predictor on the same 500-point sample shownin Figure 8. The average misclassification rate for thispredictor was 12%.


19/30


would prohibit its use when small sizes occur often.Again, it may be possible to reduce this run-time over-head by a simple conditional test of the input dimensions,or perhaps a hybrid predictor.

However, this analysis is not intended to be definitive.For instance, we cannot fairly report on specific training

costs due to differences in the implementations in ourexperimental setting.6

Also, matrix multiply is only onepossible application, and we see that it does not stress allof the strengths and weaknesses of the three methods.Furthermore, a user or application might care about onlya particular region of the full input-space which is differ-ent from the one used in our example. Instead, our pri-mary aim is simply to present the general framework andillustrate the issues on actual data. There are many possi-ble models; the examples presented here offer a flavor ofthe role that statistical modeling of performance data canplay.

5 Related Work: A Survey of EmpiricalSearch-Based Tuning

There has been a flurry of research activity in the use ofempirical search-based approaches to platform-specificcode generation and tuning. The primary motivation, asthis paper demonstrates for matrix multiply, is the diffi-culty of instantiating purely static models that predictperformance with sufficient accuracy to decide among

possible code and data structure transformations. Aug-menting such models with observed performance appearsto yield viable and promising ways to make these deci-sions.

In our review of the diverse body of related work, we

note how each study or project addresses the following.

1. What is the unit of optimization? In a recentposition paper on feedback-directed optimization,Smith (2000) argues that a useful way to classifydynamic optimization methods is by the size andsemantics of the piece of the program being opti-mized. Traditional static compilation applies opti-mizations in units which following programminglanguage conventions, e.g. within a basic block,within a loop nest, within a procedure, or within amodule. By contrast, dynamic (run-time) tech-niques optimize across units relevant to run-timebehavior, e.g. along a sequence of consecutively

executed basic blocks (a trace or path).Following this classification, we divide the

related work on empirical search-based tuningprimarily into two high-level categories: kernel-centric tuning and compiler-centric tuning. Thispaper adopts a kernel-centric perspective in whichthe unit of optimization is the kernel itself. Thecode generatorand hence, the implementationspaceis specific to the kernel. We would expectthat a generator specialized to a particular kernelmight best exploit mathematical structure or otherstructure in the data (possibly known only at run-time) relevant to performance. As we discuss

below, this approach has been very successful inthe domains of linear algebra and signal process-ing, where understanding problem-specific struc-ture leads to new, tunable algorithms and datastructures.

In the compiler-centric view, the implementa-tion space is defined by the space of possiblecompiler transformations that can be applied toany program expressed in a general-purpose pro-gramming language. In fact, the usual suite of opti-mizations for matrix multiply can all be expressedas compiler transformations on the standard three-nested loop implementation, and thus it is possiblein principle for a compiler to generate the same

high-performance implementation that can be gen-erated by hand. However, what makes a specializedgenerator useful in this instance is that the expertwho writes the generator identifies the precise trans-formations which are hypothesized to be most rele-vant to improving performance. Moreover, wecould not reasonably expect a general purposecompiler to know about all of the possible mathe-

Fig. 12 Each line corresponds to the distribution ofslow-downs due to mispredictions on a 1936 pointsample for a particular predictor. A point on a givenline indicates what fraction of predictions (y-axis)resulted in more than a particular slow-down (x-axis).Note the logarithmic scale on the y-axis.


20/30


matical transformations or alternative algorithmsand data structures for a given kernel; it is pre-cisely these kinds of transformations that haveyielded the highest performance for other impor-tant computational kernels such as the discrete

Fourier transform (DFT) or operations on sparsematrices.We view these approaches as complementary,

because hybrid approaches are also possible. Forinstance, in this paper we consider the use of amatrix multiply-specific generator that outputs Cor Fortran code, thus leaving aspects of the codegeneration task (namely, scheduling) to the com-piler. What these approaches share is that theirrespective implementation spaces can be verylarge and difficult to model. It is the challenge ofchoosing an implementation that motivates empir-ical search-based tuning.

2. How should the implementation space be

searched? Empirical search-based approachestypically choose implementations by some combi-nation of modeling and experimentation (i.e. actu-ally running the code) to predict performance andthereby choose implementations. In Section 2 weargue that performance can be a complex functionof algorithmic parameters, and therefore may bedifficult to model accurately using only staticmodels in practice. This paper explores the use ofstatistical models, constructed from empiricaldata, to model performance within the space ofimplementations. The related work demonstratesthat a variety of additional kinds of models are

possible. For instance, one idea that has beenexplored in several projects is the use of evolu-tionary (genetic) algorithms to model and searchthe space of implementations.

3. When to search? The process of searching animplementation space could happen at any time,whether it be strictly off-line (e.g. once per archi-tecture or once per application), strictly at run-time,or in some combination. The cost of an off-linesearch can presumably be amortized over manyuses, while a run-time search can maximally usethe information only available at run-time. Again,hybrid approaches are common in practice.

The question of when to search has implications

for software system support. For instance, a strictlyoff-line approach requires only that a user makecalls to a special library or a special search-basedcompiler. Searching at run-time could also be hid-den in a library call, but might also require changesto the run-time system to support dynamic codegeneration or dynamic instrumentation or trap-han-

dling to support certain types of profiling. This sur-vey mentions a number of examples.

Our survey summarizes how various projects and studieshave approached these questions, with a primary emphasis

on the kernel-centric versus compiler-centric approaches,although again we see these two points of view as com-plementary. Collectively, these questions imply a varietyof possible software architectures for generating codeadapted to a particular hardware platform and run-timeworkload.

5.1 KERNEL-CENTRIC EMPIRICALSEARCH-BASED TUNING

Typical kernel-centric tuning systems contain specializedcode generators that exploit specific mathematical prop-erties of the kernel or properties of the data. The targetperformance goal of these systems is to achieve the per-

formance of hand-tuned code. Most research has focusedon tuning in the domains of dense and sparse linear alge-bra, and signal processing. In these areas, there is a richmathematical structure relevant to performance to exploit.We review recent developments in these and other areasbelow. (For alternative views of some of this work, werefer the reader to recent position papers on the notion ofactive libraries (Veldhuizen and Gannon, 1998) and self-adapting numerical software (Dongarra and Eijkhout,2003).)

5.1.1 Dense and Sparse Linear Algebra Dense matrixmultiply is among the most important of the computa-

tional kernels in dense linear algebra both because a largefraction (say, 75% or more) of peak speed can beachieved on most machines with proper tuning, and alsobecause many other dense kernels can be expressed ascalls to matrix multiply (Kagstrom et al., 1998). The pro-totype PHiPAC system was an early system for generat-ing automatically tuned implementations of this kernelwith cache tiling, register tiling, and a variety of unroll-ing and software pipelining options (Bilmes et al., 1997).The notion of automatically generating tiled matrix mul-tiply implementations from a concise specification withthe possibility of searching the space of tile sizes formatrix multiply also appeared in early work by McCalpinand Smotherman (1995). The ATLAS project has since

extended the applicability of the PHiPAC prototype to allof the other dense matrix kernels that constitute theBLAS (Whaley et al., 2001). These systems contain spe-cialized, kernel-specific code generators, as discussed inSection 2. Furthermore, most of the search process canbe performed completely off-line, once per machinearchitecture. The final output of these systems is a library


21/30


implementing the BLAS against which a user can linkhis/her application.

A recent and promising avenue of research relates tothe construction of sophisticated new generators. Veldhu-izen (1998) and Siek and Lumsdaine (1998) have devel-

oped C++ language-based techniques for cleanlyexpressing dense linear algebra kernels. More recentwork by Gunnels et al. (2001) in the FLAME projectdemonstrates the feasibility of systematic derivation ofalgorithmic variations for a variety of dense matrix ker-nels. These variants would be suitable implementations aiin our run-time selection framework (Section 4). In addi-tion, FLAME provides a new methodology by which onecan cleanly generate implementations of kernels thatexploit caches. However, these implementations still relyon highly-tuned inner matrix multiply code, which inturn requires register- and instruction-level tuning. There-fore, we view all of these approaches to code generationas complementing empirical search-based register- and

instruction-level tuning.Another complementary research area is the study of

so-called cache-oblivious algorithms, which claim toeliminate the need for cache-level tuning to some extentfor a number of computational kernels. Like traditionaltiling techniques (Hong and Kung, 1981; Savage, 1995),cache-oblivious algorithms for matrix multiply and LUfactorization have been shown to asymptotically mini-mize data movement among various levels of the mem-ory hierarchy, under certain cache modeling assumptions(Frens and Wise, 1997; Toledo, 1997; Andersen et al.,1999; Frigo et al., 1999). Unlike tiling, cache-obliviousalgorithms do not make explicit reference to a tile size

tuning parameter, and thus appear to eliminate the needto search for optimal cache tile sizes either by modelingor by empirical search. Furthermore, language-level sup-port now exists both to convert loop-nests to recursionautomatically (Yi et al., 2000) and also to convert lineararray data structures and indexing to recursive formats(Wise et al., 2001). However, we note in Section 2 that atleast for matrix multiply, cache-level optimizationsaccount for only a part (perhaps 1260%, depending onthe platform) of the total performance improvement pos-sible, and therefore complements additional register- andinstruction-level tuning. The nature of performance inthese spaces, as shown in Figure 2, together with recentresults showing that even carefully constructed models of

the register- and instruction-level implementation spacecan mispredict (Yotov et al., 2003), imply that empiricalsearch is still necessary for tuning.

7

For matrix multiply, algorithmic variations that requirefewer than O(n

3) flops for n n matrices, such as

Strassens algorithm, are certainly beyond the kind oftransformations we expect general purpose compilers tobe able to derive. Furthermore, like cache-oblivious algo-

rithms, practical and highly efficient implementations ofStrassens algorithm still depend on highly-tuned base-case implementations in which register- and instruction-level tuning is critical (Huss-Lederman et al., 1996; Thot-tethodi et al., 1998).

The BLAS-tuning ideas have been applied to higher-level, parallel dense linear algebra libraries. In the con-text of cluster computing in the Grid, Chen et al. (2003)have designed a self-tuning version of the LAPACKlibrary for Clusters (LFC). LFC preserves LAPACKsserial library interface, and decides at run-time whetherand how to parallelize a call to a dense linear solve rou-tine, based on the current cluster load. In a similar spirit,Liniker et al. (2002) have applied the idea of run-timeselection to the selection of data layout in their distrib-uted parallel version of the BLAS library (Beckmann andKelley, 1997). Their library, called DESOBLAS, is basedon the idea of delayed evaluation; all calls to DESOB-LAS library routines return immediately, and are not exe-

cuted until either a result is explicitly accessed by theuser or the user forces an evaluation of all unexecutedcalls. At evaluation time, DESOBLAS uses informationabout the entire sequence of operations that need to beperformed to make decisions about how to distributedata. Both LFC and DESOBLAS adopt the library inter-face approach, but defer optimization until run-time.

Kernels arising in sparse linear algebra, such as sparsematrixvector multiply (SpMV), complicate tuning com-pared to their dense counterparts because performancedepends on the non-zero structure of the sparse matrix.For sparse kernels, the user must choose a data structurethat minimizes storage of the matrix while still allowing

efficient mapping of the kernel to the target architecture.Worse still, the matrix structure may not be known untilrun-time. Prototype systems exist which allow a user tospecify separately both the kernel and the data structure,while a specialized generator (or restructuring compiler)combines the two specifications to generate an actualimplementation (Bik and Wijshoff, 1995; Pugh and Shpeis-man, 1998; Stodghill, 1997). At present, such systems donot explicitly address the register- and instruction-leveltuning issues, nor do they adequately address the run-time problem of choosing a data structure given a sparsematrix. Automatic tuning with respect to these low-leveltuning and data structure selection issues have been takenup by recent work on the Sparsity system (Im and Yelick,

1999; Vuduc et al., 2002).

5.1.2 Digital Signal Processing Recent interest inautomatic tuning of digital signal processing (DSP)applications is driven both by the rich mathematicalstructure of DSP kernels and by the variety of targethardware platforms. One of the best-studied kernels, theDFT, admits derivation of many FFT algorithms. The fast


22/30


algorithms require significantly fewer flops than a naveDFT implementation, but because different algorithmshave different memory access patterns, strictly minimiz-ing flops does not necessarily minimize execution time.The problem of tuning is further complicated by the fact

that the target architectures for DSP kernels range widelyfrom general purpose microprocessors and vector archi-tectures to special-purpose DSP chips.

FFTW was the first tuning system for various flavorsof the DFT (Frigo and Johnson, 1998). FFTW is notablefor its use of a high-level, symbolic representation of theFFT algorithm, as well as its run-time search which savesand uses performance history information. Search boilsdown to selecting the best fully-unrolled base case imple-mentations, or equivalently, the base cases with the bestinstruction scheduling. The search process occurs only atrun-time because that is when the problem size isassumed to be known. There have since been additionalefforts in signal processing which build on the FFTW

ideas. The SPIRAL system is built on top of a symbolicalgebra system, allows users to enter customized trans-forms in an interpreted environment using a high-leveltensor notation, and uses a novel search method based ongenetic algorithms (Pschel et al., 2001). The perform-ance of the implementations generated by these systemsis largely comparable both to one another and to vendor-supplied routines. One distinction between the two sys-tems is that SPIRALs search is off-line, and carried outfor a specific kernel of a given size, whereas FFTWchooses the algorithm at run-time. The most recent FFTtuning system has been the UHFFT system, which isessentially an alternative implementation of FFTW that

includes a different implementation of the code generator(Mirkovic et al., 2000). In all three systems, the output ofthe code generator is either C or Fortran code, and theuser interface to a tuned routine is via a library or subrou-tine call.

5.1.3 Other Kernel Domains In the area of paralleldistributed communications, Vadhiyar et al. (2000) pro-pose techniques to tune automatically the Message Pass-ing Interface (MPI) collective operations. The mostefficient implementations of these kernels, which includebroadcast, scatter/gather, and reduce, depend oncharacteristics of the network hardware. Like its tuningsystem predecessors in dense linear algebra, this proto-

type for MPI kernels targets the implementation of astandard library interface. Achieved performance meetsor exceeds that of vendor-supplied implementations onseveral platforms. The search for an optimal implementa-tion is conducted entirely off-line, using heuristics toprune the space and a benchmarking workload thatstresses message size and number of participating proc-essors, among other features.

Empirical search-based tuning systems for sortinghave shown some promise. Recent work by Arge et al.(2001) demonstrate that algorithms which minimizecache misses under simple but reasonable cache modelslead to sorting implementations which are suboptimal in

practice. They furthermore stress the importance of regis-ter- and instruction-level tuning, and use all of these ideasto propose a new sorting algorithm space with machine-dependent tuning parameters. A preliminary study byDarcy (2002) shows that even for the well-studied quick-sort algorithm, an extensive implementation space existsand exhibits distributions of performance like thoseshown in Figure 2 (top). Lagoudakis and Littman (2000)have shown how the selection problem for sorting can betackled using statistical methods not considered in thispaper, namely, by reinforcement learning techniques.Together, these studies suggest the applicability ofsearch-based methods to non-numerical computationalkernels.

Recently, Baumgartner et al. (2002) have proposed asystem to generate entire parallel applications for a classof quantum chemistry computations. Like SPIRAL, thissystem provides a way for chemists to specify their com-putation in a high-level notation, and carries out a sym-bolic search to determine a memory and flop efficientimplementation. The authors note that the best implemen-tation depends ultimately on machine-specific parameters.Some heuristics tied to machine parameters (e.g. availablememory) guide search.

Dolan and Mor (2002) have identified empirical dis-tributions of performance as a mechanism for comparingvarious mathematical optimization solvers. Specifically,

the distributions estimate the probability that the per-formance of a given solver will be within a given factorof the best performance of all solvers considered. Theirdata were collected using the on-line optimization server,NEOS, in which users submit optimization jobs to beexecuted on NEOS-hosted computational servers. Theprimary aim of their study was to propose a new metric(namely, the distributions themselves) as a way of com-paring different optimization solvers. However, whatthese distributions also show is that Grid-like computingenvironments can be used to generate a considerableamount of performance data, possibly to be exploited inrun-time selection contexts as described in Section 4.

A key problem in the run-time selection framework we

present in Section 4 is the classical statistical learningproblem of feature selection. In our case, features are theattributes that define the input space. The matrix multiplyexample assumes the input matrix dimensions constitutethe best features. Can features be identified automaticallyin a general setting? A number of recent projects haveproposed methods, in the context of performance analy-sis and algorithm selection, which we view as possible


23/30


solutions. Santiago et al. (2002) apply the statisticalexperimental design methods to program tuning. Thesemethods essentially provide a systematic way to analyzehow much hypothesized factors contribute to perform-ance. The most significant contributors identified could

constitute suitable features for classification. A differentapproach has been to codify expert knowledge in theform of a database, recommender, or expert system inparticular domains, such as a partial differential equation(PDE) solver (Houstis et al., 2000; Ramakrishnan andValds-Prez, 2000; Lucks and Gladwell, 1992), or amolecular dynamics simulation (Ko and Izaguirre, 2003).In both cases, each algorithmic variation is categorizedby manually identified features which would be suitablefor statistical modeling.

What is common to most of the preceding projects is alibrary-based approach, whether tuning occurs off-line orat run-time. The Active Harmony project seeks to pro-vide a general library API and run-time system that sup-

ports run-time selection and run-time parameter tuning inthe setting of the Grid (Tapus et al., 2002). This work,although in its early stages, highlights the need for searchin new computational environments.

5.2 COMPILER-CENTRIC EMPIRICALSEARCH-BASED TUNING

The idea of using data gathered during program executionto aid compilation has previously appeared in the compilerliterature under the broad term feedback-directed optimi-zation (FDO). A recent survey and position paper bySmith (2000) reviewed developments in subareas of FDO

including profile-guided compilation (Section 5.2.2) anddynamic optimization (Section 5.2.4). FDO methods areapplied to a variety of program representations: sourcecode in a general-purpose high-level language (e.g. C orJava), compiler intermediate form, or even a binary execut-able. These representations enable transformations toimprove performance on general applications, either off-line or at run-time. Binary representations enable optimi-zations on applications that have shipped or on applica-tions that are delivered as mobile code. The underlyingphilosophy of FDO is the notion that optimization withoutreference to actual program behavior is insufficient to gen-erate optimal or near-optimal code.

In our view, the developments in FDO join renewed

efforts in superoptimizers (Section 5.2.1) and the newnotion of self-tuning compilers (Section 5.2.3) in animportant trend in compilation systems toward the use ofempirically-derived models of the underlying machinesand programs.

5.2.1 Superoptimizers Massalin (1987) coined theterm superoptimizer for his exhaustive search-based

instruction generator. Given a short program, representedas a sequence of (six or so) machine language instruc-tions, the superoptimizer exhaustively searched all possi-ble equivalent instruction sequences for a shorter (andequivalently at the time, faster) program. Although

extremely expensive compared to the usual cost of com-pilation, the intent of the system was to superoptimizeparticular bottlenecks off-line. The overall approach rep-resents a noble effort to generate truly optimal code.

8

Joshi et al. (2001) substitute exhaustive search in Mas-salins superoptimizer with an automated theorem proverin their Denali superoptimizer. One can think of theprover as acting as a modeler of program performance.Given a sequence of expressions in a C-like notation,Denali uses the automated prover to generate a machineinstruction sequence that is provably the fastest imple-mentation possible. However, to make such a proof-based code generation system practical, Denalis authorsnecessarily had to assume (a) a certain model of the

machine (e.g. multiple issue with pipeline dependencesspecified but fixed instruction latencies), and (b) a particu-lar class of acceptable constructive proofs (i.e. matchingproofs). Nevertheless, Denali is able to generate extremelygood code for short instruction sequences (roughly 16instructions in a days worth of time) representing ALU-bound operations on the Alpha EV6. As the Denaliauthors note, it might be possible to apply their approachmore broadly by refining the instruction latency esti-mates, particularly for memory operations, with meas-ured data from actual runsagain suggesting a combinedmodeling and empirical search approach.

5.2.2 Profile-Guided Compilation and IterativeCompilation The idea behind profile-guided compila-tion (PGC) is to carry out compiler transformationsusing information gathered during actual execution runs(Knuth, 1971; Graham et al., 1982). Compilers caninstrument code to gather execution frequency statisticsat the level of subroutines, basic blocks, or paths. Onsubsequent compiles, these statistics can be used to ena-ble more aggressive use of classical compiler optimi-zations (e.g. constant propagation, copy propagation,common subexpression elimination, dead code removal,loop invariant code removal, loop induction variableelimination, global variable migration) along frequentexecution paths (Chang et al., 1991; Ball and Larus,

1996). The PGC approach has been extended to helpguide prefetch instruction placement on x86 architec-tures (Barnes, 1999). PGC can be viewed as a form ofempirical search in which the implementation space isimplicitly defined to be the space of all possible com-piler transformations over all inputs, and the user (pro-grammer) directs the search by repeatedly compiling andexecuting the program.


24/30


The search process of PGC can be automated by replac-ing the user-driven compile/execute sequence with a com-piler-driven one. The term iterative compilation has beencoined to refer to such a compiler process (van der Mark etal., 1999; Kisuki et al., 2000). Users annotate their pro-

gram source with a list of which transformations (e.g. loopunrolling, tiling, software pipelining) should be tried on aparticular segment of code, along with any relevant para-metric ranges (e.g. a range of loop unrolling depths). Thecompiler then benchmarks the code fragment under thespecified transformations. In a similar vein, Pike andHilfinger (2002) built tile-size search using simulatedannealing into the Titanium compiler, with application to amultigrid solver. The Genetic Algorithm ParallelizationSystem (GAPS) by Nisbet (1998) addressed the problemof compile-time selection of an optimal sequence of serialand parallel loop transformations for scientific applica-tions. GAPS uses a genetic algorithms approach to directsearch over the space of possible transformations, with the

initial population seeded by a transformation chosen byconventional compiler techniques. The costs in all ofthese examples are significantly longer compile cycles (i.e.including the costs of running the executable and re-opti-mizing), but the approach is off-line because the costsare incurred before the application ships. Furthermore, thecompile-time costs can be reduced by restricting the itera-tive compilation process to only known application bottle-necks. In short, what all of these iterative compilationexamples demonstrate is the utility of a search-basedapproach for tuning general codes that requires minimaluser intervention.

5.2.3 Self-Tuning Compilers We use the term self-tuning compiler to refer to recent work in which thecompiler itself (e.g. the compilers internal models forselecting transformations, or the optimization phaseordering) is adapted to the machine architecture. Thegoal of this class of methods is to avoid significantlyincreasing compile-times (as occurs in iterative compila-tion) while still adapting the generated code to the under-lying architecture.

Mitchell et al. (2001) proposed a scheme in whichmodels of various types of memory access patterns aremeasured for a given machine when the compiler isinstalled. At analysis time, memory references withinloop nests are decomposed and modeled by functions of

these canonical patterns. An execution time model is thenautomatically derived. Instantiating and comparing thesemodels allows the compiler to compare different trans-formations of the loop nest. Although the predicted exe-cution times are not always accurate in an absolute sense,the early experimental evidence suggests that they maybe sufficiently accurate to predict the relative ranking ofcandidate loop transformations.

The Meta Optimization project proposes automatictuning of the compilers internal priority (or cost) func-tions (Stephenson et al., 2003). The compiler uses thesefunctions to choose a code generation action based onknown characteristics of the program. For example, in

deciding whether or not to prefetch a particular memoryreference within a loop, the compiler evaluates a binarypriority function that considers the current loop trip countestimates, cache parameters, and estimated prefetchlatency,

9among other factors. The precise function is

usually tuned by the compiler writer. In the Meta Opti-mization scheme, the compiler implementer specifiesthese factors, their ranges, and a hypothesized form ofthe function, and Meta Optimization uses a genetic pro-gramming approach to determine (i.e. to evolve) a betterform for the function. The candidate functions are evalu-ated on a benchmark or suite of benchmark programs tochoose one. Thus, priority functions can be tuned oncefor all applications, or for a particular application or

class of applications.In addition to internal models, another aspect of the

compiler subject to heuristics and tuning is the optimiza-tion phase ordering, i.e. the order in which optimizationsare applied. While this ordering is usually fixed throughexperimentation by a compiler writer, Cooper et al.(2002a, 2002b) have proposed the notion of an adaptivecompiler which experimentally determines the orderingfor a given machine. Their compiler uses genetic algo-rithms to search the space of possible transformationorders. Each transformation order is evaluated againstsome metric (e.g. execution time or code size) on a pre-defined set of benchmark programs.

The Liberty compiler research group has proposed anautomated scheme to organize the space of optimizationconfigurations into a small decision tree that can bequickly traversed at compile-time (Triantafyllis et al.,2003). Roughly speaking, their study starts with the IntelIA-64 compiler and identifies the equivalent ofkinternalbinary flags that control optimization. This defines aspace of possible configurations of size 2

k. This space is

systematically pruned, and a final, significantly smallerset of configurations are selected.

10(In a traditional com-

piler implementation, a compiler writer would manuallychoose just one such configuration based on intuition andexperimentation.) The final configurations are organizedinto a decision tree. At compile-time, this tree is tra-

versed and each configuration visited is applied to thecode. The effect of the configuration is predicted by astatic model, and used to decide which paths to traverseand what final configuration to select. This work com-bines the model-tuning of the other self-tuning compilerprojects and the idea of iterative compilation (except thatin this instance, performance is predicted by a staticmodel instead of by running the code.)


25/30


5.2.4 Dynamic (Run-Time) Optimization Dynamicoptimization refers to the idea of applying compileroptimizations and code gene

STATISTICAL MODELS FOR EMPIRICAL SEARCH-BASED PERFORMANCE TUNINg

Documents

Transcript of STATISTICAL MODELS FOR EMPIRICAL SEARCH-BASED PERFORMANCE TUNINg