Parametric Timing Estimation With Newton-Gregory Formulae

Parametric Timing Estimation With Newton-Gregory Formulae∗

Robert van Engelen, Kyle Gallivan, and Burt WalshDepartment of Computer Science

and School of Computational Science and Information TechnologyFlorida State University

Tallahassee, FL 32306-4530{engelen,gallivan,walsh}@cs.fsu.edu

Abstract

To determine safe and tight worst-case execution time (WCET) estimates of scientific and multi-media codes that spent most of the execution time on executing loop iterations, efficient and accurateloop iteration count estimation methods are required. To support dynamic scheduling decisions basedon WCET estimations, an effective loop iteration count estimation method should generate parametricformulae that can be evaluated at runtime. Therefore, the loop iteration count estimation methods uti-lized for WCET estimation must be effective in analyzing loops with symbolic bounds, non-rectangularloops, zero-trip loops, loops with multiple critical paths, and loops with non-unit strides. In this paperwe present a novel approach to parametric WCET estimation to handle loops with both affine and non-affine loop bounds in an efficent manner using a formulation based on Newton-Gregory interpolatingpolynomials.

1 Introduction and Related Work

Static worst-case execution of time (WCET) estimates are used in real-time system scheduling and dynamicvoltage scheduling (DVS) [27] [18] [33] [34]. In real-time scheduling, WCET estimates are used to deter-mine a static schedule for a task based on timing constraints. WCET estimates are also used to determineif a dynamic schedule, such as an earliest deadline first (EDF) [22] schedule, will meet its schedulabilityrequirements. In DVS methods, WCET estimates are used to ensure that a task, or group of tasks, can runas slow as possible, thus saving energy, while still meeting the required target deadline [39]. To determinea tight WCET on a task based on the static analysis of the task’s code is difficult, because the tightness of aWCET estimate is dependent upon high-level and low-level factors. High-level factors include program flowinformation, such as loop structures, conditional flow, and procedure invocation. Low-level factors includecache and pipeline affects. An effective WCET calculation method uses both the high-level and low-levelinformation to determine a WCET estimate.

Timing estimation methods for determining the execution time of a code involving loop structures re-quire the combination of precise modeling of low-level loop code costs with a safe and tight loop iterationcount estimation method. However, in the current approaches there is a division between methods whichmodel low-level code behavior using cache models and pipeline models, and methods which are used toestimate the number of iterations of a loop nest, see for example [17] for an overview. In general, most

∗Supported in part by NSF grants CCR-0105422, CCR-0208892, EIA-0072043, and DOE grant DEFG02-02ER25543.

analysis methods loosely couple the loop iteration count estimation with the results of the low-level estima-tors on the cost of the execution of the loop body. More specifically, the WCET calculation methods differmostly in how loop bounds are formulated and execution path analysis is applied.

Tree-based methods [26] [21] compute the cost of a simple loop by using a formula representing theloop cost as a sum of the cost of its loop body over the loops iterations. This method can incorporate cacheand pipeline affects, but requires user annotations specifying path information and maximum loop counts;these methods can not determine infeasible paths.

Path-based methods [16] [36] [24] [2] [17] determine the most expensive path in a program by usingabstract interpretation or symbolic execution. This execution can potentially determine loop bounds, in-feasible paths and can support cache and pipeline affects. The most complex of these methods [17] canhandle non-rectangular and parametric loops but requires a constant bound for the outer most loop of a loopnest. The Implicit Path Enumeration Technique (IPET) [9] [19] [20] based methods compute the WCETestimate based upon a system of program structural constraints (control flow graph) and program functionalconstraints (loop bounds, path information) which might be determined by analysis or provided by a user.Most IPET systems use integer linear programming (ILP), which is NP-complete [25], to solve the resultingsystem. The WCET estimate, which is the objective function of such systems, is the sum of the products ofthe cost of a basic block and the number of times the block is executed.

Parametric WCET analysis [38] [5] generates formulae which represent the cost of executing code.These parametric formula can be evaluated at runtime when the values of their parameters are known. Theparametric methods are often based upon the latter types of calculation methods. Vivancos [38] uses thepath analysis of Healy [16]. The work of Chapman [7] uses symbolic execution of the program and userannotations for to describe code structure. Lisper [5] uses a symbolic IPET approach for calculation.

With respect to loop iteration count estimation methods, the number of points in a nested loop withaffine bounds is equivalent to the number of points inside the polyhedra bounded by the affine constraints.A variety of techniques exist to count the number of points in a polyhedron to determine the size of theiteration space of a nested loop. The current techniques differ mostly in complexity and accuracy.

Haghihat [12] evaluates nested sums with affine bounds by relying on a compiler with the capacity torecognized generalized induction variables (GIVs). The main concern of his work is accurately summingloops that are possibly zero trip, but do not include non-unit strides. A truth value function is used to guardagainst summing up zero trip loops. The use of the truth value function can result inmin andmax operatorsin the resulting closed form expression for the sum.

Tawbi [37] uses Bernoulli’s formula to evaluate nested sums. She applies polyhedra splitting to ensurethat points in the polyhedra are not counted twice. Only affine bounds are handled. Sums of loops involvingstrides are evaluated by taking an average of the ceiling and floor of the stride expression.

Sakellariou [31] also uses Bernoulli formulas and other algebraic rules to evaluate nested sums. Hisalgorithm generates parametric formulas involvingmin andmax bounds when the index of the innermostsum appears in the guard for a sum or when such a formula is required to guard against zero trip loops. Themethod removes themin andmax terms by introducing more guards. The method can handle loops withaffine bounds and ceiling and floor functions applied to an affine bound. For loops with strides he evaluatesthe bound as a floor operator and replaces this expression with an expression involving amod operator. Thisresults in all of the accuracy of splintering [28] the sum without the complexity. However, when multipleloops use non-unit strides a closed-form parametric formula cannot be derived.

Pugh [28] uses Presburger formula, the Omega test [29] and standard math tables to evaluate nestedsums. The method can handle sums with affine bounds which contain floor and ceiling operators and canalso handle non-unit strides. For loops with ceilings and floors he uses either splintering to break the sum

2

up into many guarded sums or the Omega test to approximate the sum; the method chosen is based upon theaccuracy required. Pugh explicitly looks at summing a general function, which can be expressed as a generalPresburger formula, over a nested loop. He also has a method for putting the problem in a disjunctive normalform using the Omega test.

Clauss [8] uses Ehrhart polynomials to generate an enumerator which represents the number of pointsinside of a convex polyhedra. Clauss uses Loechner’s [23] technique for determining the vertices of a poly-hedra. This vertex enumeration method can result in polyhedra splitting like the technique of Tawbi [37].The method then determines an Ehrhart polynomial representing the number of points inside the polyhedrabased upon the dimension of the polyhedra described by the vertices. The resulting polynomial could haveperiodic coefficients. This occurs in the same cases where Pugh’s [28] method would have to create splintersto handle non-unit strides.

In this paper we present a parametric WCET estimate approach that is unique in a number of ways.Firstly, a loop is treated as a weighted sum where each iteration of the loop can have a unique time cost(weight). In general, the method makes no assumptions on the underlying methods used to determine thecost of the statements that span the loop body during the iterations. Secondly, no assumptions are madeabout the structure of a loop body; a loop body can contain multiple critical paths and can itself be a groupof non-perfectly nested loops. The advantage is that the parametric bounding functions can occur at anylevel of the loop nest and can be determined using a variety of methods, which gives flexibility in adjustingthe complexity of the loop cost expressions to the need for tightness of the bounds. Thirdly, because the low-level WCET model approximation in the presence of cache and pipeline effects, there is no reason to ensurethat the loop iteration count estimation method must always be exact for any complicated loop nest, as longas the approximation is sufficiently tight. Therefore, an efficient approximate method for loop iteration countestimation is used. The method described in this paper provides an inexact upper bound on the number ofloop iterations for loop nests containing loops with non-unit strides. By modeling loop strides more loosely,the method avoids costly splintering operations that would otherwise cause an exponential growth of theiteration count problem. Fourthly, the method requires simple numeric/symbolic manipulation capabilitieson standardized representations of polynomials. The resulting method produces parametric formulae forthe WCET of rectangular and non-rectangular loops, symbolic loop bounds, zero-trip loops, and loops withnon-unit strides. Lastly, the method supports more general type of codes than methods based upon othercurrent parametric counting methods [32] [28] [12] [8], because the WCET estimation method allows forcertain forms of polynomial loop bounds.

The remainder of this paper is organized as follows. Section 2 introduces loop timing estimation usingthe Newton-Gregory formulae. A method for loop bounds analysis based on Newton series is presented inSection 3. Section 4 describes a technique to estimate the timing of critical loop iteration paths. Finally,some concluding remarks are given in Section 5.

2 Loop Timing Estimation

Most of the execution time of a typical program is spent in executing loops. Therefore, tight timing esti-mation of loop nests is crucial to estimate the total execution time of a program or time-critical task. Thissection introduces our approach to treating the execution time cost of a loop as a weighted sum.

2.1 Bounding the Execution Time of Loops

In the sequel, it is assumed that loop (nests) are normalized. That is:

3

for i = a to b step s doS

⇒ for i = 0 to b b−as

c doS′

where the loop body statementsS are adjusted toS′ such that all in-scope occurrences ofi are replaced bys ∗ i+ a. In the presented framework,S is not restricted to a simple loop body; but rather,S could containa non-perfectly nested loop nest with function calls and multiple critical paths with an execution behaviorthat is dependent on the current value of the iteration counter.

The total real execution timeω of a loop can be expressed as an accumulation of the real execution timesof the individual operations in the loop, which consists of the loop header (not to be confused with a naturalloop header)H and bodyS′ operations

ω = ω(H(0)) +b b−a

sc∑

i=0

ω(S′(i)) + ω(H(i+ 1)) , (1)

whereω(H(0)) denotes the real start-up time of the loop header (normally the cost of initializing the loopcontrol variable) at iterationi = 0, ω(S′(i)) denotes the real execution time of the loop bodyS′ at iterationi, andω(H(i+1)) denotes the real execution time of the loop header operations (normally the incrementingof the loop control variable, the evaluation of the loop test condition, and the branch) required to commencewith the next iteration or terminate the loop wheni+ 1 exceeds the loop bound.

The total real execution timeω of the loop can be bounded given safe upper bounds on the real executiontimes of the loop operationsH andS′

ω(H(0)) ≤ c0

ω(S′(i)) + ω(H(i+ 1)) ≤ p(i) ,

wherec0 is a constant andp(i) ≥ 0 for i = 0, . . . , b b−as c is the bounding function. Given a safe bounding

constantc0 and safe bounding functionp, the real total execution timeω of the loop is bounded by

ω ≤ c0 +b b−a

sc∑

i=0

p(i) . (2)

where the bounding functions and bounding constants may or may not take pipeline and cache affects intoaccount. The tightness of the computed WCET estimate for the loop will be dependent upon the tightnessof these bounding functions as well as the accuracy of our summation.

The summations in (1) and (2) represent an approximation to the cost of the loop over its iteration space.If either loop has a negative upper bound the loop will be zero trip and should have a cost of zero. Due tothe semantics of summations, the summation will not necessarily have a value of zero when it representssuch a zero trip loop. This can be seen in the following example, which is taken from Haghihat [12].∑n

i=1

∑ij=10 1 evaluates ton(n − 17)/2 which for n = 10 should evaluate to1; but rather evaluates to

10(10 − 17)/2 = −35. Although sums provide a simple and attractive notation for loop iteration countproblems, their semantics provide a poor vehicle to accurately model loop iteration count problems.

2.2 Newton-Gregory Interpolating Polynomials

In this section we present the Newton-Gregory formalism and demonstrate how it is used to determine theexecution time cost of a single loop or a more complicated loop nest.

4

Newton-Gregory forward formula can be used to model an interpolating polynomial over an interval.Any polynomial of orderk has unique Newton-Gregory coefficientsφj , j = 0, . . . , k, such that

p(i) =k∑

j=0

φj

(i

j

). (3)

The Newton seriesφj , j = 0, . . . , k, of a polynomial proves to be particularly useful in dealing with theinconsistency between sum semantics and loop iteration counts. The bound (2) requires a sum over aniteration space that is potentially empty. Given the Newton series of polynomialp, it follows from (3) thatthe sum can be written as

n−1∑i=0

p(i) =k∑

j=0

φj

(n

j + 1

)(4)

for all n > 0, sincen−1∑i=0

k∑j=0

φj

(i

j

)=

k∑j=0

φj

n−1∑i=0

(i

j

)=

k∑j=0

φj

(n

j + 1

).

It is assumed thatp represents a bounding function on the cost of a loop iteration. Therefore, a loop iter-ation cost problem for a loop with symbolic bounds can be translated into a simpler summation problemwith constant bounds, i.e. the constant degreek of the polynomialp. More importantly, the formulation isapplicable to zero-trip loops, because

k∑j=0

φj

(0

j + 1

)= 0 (5)

by the properties of the binomial coefficients. Equations (4) and (5) provide the basic requirements to definea loop cost function.

Definition 2.1 (Interpolating Sum) LetΦ(i) = 〈φ0, . . . , φk〉i denote the Newton series of a polynomial ini. Theinterpolating sumσ(Φ(i), n) of Φ(i) over a domaini = 0, . . . , n− 1 of sizen ≥ 0 is the polynomialdefined by

σ(Φ(i), n) def=k∑

j=0

φj

(n

j + 1

).

The interpolating sumσ(Φ(i), n) of a Newton seriesΦ(i) is a polynomial inn. This important propertyis exploited to compose interpolating sums to determine the (parametric) size of the multi-dimensionaliteration space of a nested loop. The interpolating sum of the Newton series of a polynomial is related to theRiemann approximation of the integral of the polynomial

σ(Φ(i), x) ≈∫ x

0p(y) dy

However, the interpolating sum determines the exact size of the iteration space of a loop by exploiting thediscrete formulation of Newton-Gregory interpolating polynomials.

Note thatσ(Φ(i), x) can be evaluated for any (symbolic) real valuex ≥ 0, because the binomial coeffi-cients are defined by (

x

j + 1

)=

1(j + 1)!

x(x− 1)(x− 2) · · · (x− j) .

Non-unit loop strides can be incorporated in theσ function by means of approximation, as is noted by thefollowing Lemma.

5

Lemma 2.2 For any polynomialp(i) ≥ 0 for i ≥ 0 with corresponding Newton seriesΦ(i), the interpolat-ing sum is bound as follows

σ(Φ(i), b b−as c+ 1) ≤ σ(Φ(i), b−a

s + 1)

if (b− a)/s+ 1 ≥ 0 ands 6= 0, with equality whens evenly dividesb− a.

Proof. Becausep(i) ≥ 0 for i ≥ 0 the summationσ(Φ(i), n) =∑k

j=0 φj( nj+1

)is monotonically increasing

with increasing iteration space sizen. Therefore,σ(Φ(i), b b−as c+ 1) ≤ σ(Φ(i), b−a

s + 1) becauseb b−as c+

1 ≤ b−as + 1 when(b− a)/s+ 1 ≥ 0. 2

Definition 2.3 (Parametric WCET) The parametric worst-case execution time (WCET) bound of a (pos-sibly zero-trip) loop with iterationi = a, . . . , b and strides 6= 0 is

WCET (a, b, s) = c0 + σ(Φ(i),max(0, b−as + 1))

which bounds the real execution timeω of the loop, whereΦ(i) is the Newton series representation of thepolynomialp(i) over i = 0, . . . , b b−a

s c + 1 that bounds the execution time of the normalized loop headerand body operationsω(S′(i)) + ω(H(i+ 1)) ≤ p(i), andc0 bounds the initial loop header execution timeω(H(0)) ≤ c0.

When the loop boundsa and/orb are symbolic, the WCET estimation is parameterized.The complexity of the resulting loop cost function depends upon the complexity of both the loop bounds,

which can be non-affine, and the polynomial representation of the loop body cost. When the loop bodyS′

includes non-constant time operations such as inner loops with bounds that can be expressed as polynomialsin i, then the bounding functionp on the execution time of the outer loop header and body operations ispolynomial. It should be noted that this method is general and can compute sums of weights which arebounded by polynomials. For simple architectures, such as DSP machines with simple pipelines and nocaching, where the inner loops are bounded by (symbolic) constants or affine functions of the outer loopcounter variables,p is a polynomial of small finite orderk.

2.3 Newton Series Conversions

The Newton series of a polynomial is unique and can be efficiently computed. It is noteworthy to men-tion that the coefficients of the Chains of Recurrences (CR) pure-sum representation of a polynomial areequivalent to the coefficients of the Newton series of the polynomial [3, 40]. Therefore, the CR algebraconstruction rules for polynomials can be applied to e.g. the Horner form of the polynomial. However, amore efficient construction method exists that is based on Newton’s triangle [3]. In this paper, the triangleis represented as a matrix to apply series conversions via matrix-vector products.

Given the coefficientspj , j = 0, . . . , k, of polynomial1 p(i) = p0 + p1i+ p2i2 + · · ·+ pki

k, thek + 1Newton coefficientsφj , j = 0, . . . , k, can be directly obtained from the coefficientspj of the polynomialusing Newton’s triangle. Fork = 3, the Newton triangle in matrix form is

N3 =

1 0 0 00 1 1 10 0 2 60 0 0 6

1The polynomial representationsp(i) = p0 + p1i + p2i

2 + · · ·+ pkik and its vector formp(i) = [p0, p1, . . . , pk]i will be usedinterchangeably throughout this paper.

6

Input: p[0 : k]Output: φ[0 : k]Declare array m[0 : k] of integerfor j = 0 to k do

φ[j] := 0m[j] := 0

enddoφ[0] := p[0]if k > 0 then

φ[1] := p[1]m[1] := 1for i = 2 to k do

for j = i to 1 step −1 dom[j] := j * (m[j − 1] + m[j])φ[j] += m[j] * p[i]

enddoenddo

endif

Input: φ[0 : k]Output: p[0 : k]Declare array m[0 : k] of rationalfor j = 0 to k do

p[j] := 0m[j] := 0

enddop[0] := φ[0]if k > 0 then

p[1] := φ[1]m[1] := 1for i = 2 to k do

for j = i to 1 step −1 dom[j] := (m[j − 1] − (i − 1)*m[j])/ip[j] += m[j]*φ[i]

enddoenddo

endif

a. Algorithm to computeΦ = Nkp b. Algorithm to computep = N−1k Φ

Figure 1: Newton Series Conversion Algorithms

The Newton seriesΦ(i) of a polynomial is obtained by the matrix-vector productNkp(i) of the NewtontriangleNk and the vector of polynomial coefficientsp(i) = [p0, . . . , pk]i.

The Newton triangle can be formed by a two-term recurrence [3]. UsingO(k2) operations while requir-ingO(k) temporary storage space, the algorithm shown in Figure 1a computes the coefficientsφj for pj andthe matrix-vector productΦ(i) = Nkp(i), whereNk is the Newton triangle,p(i) = [p0, . . . , pk]i is a vectorof polynomial coefficients, andΦ(i) = 〈φ0, . . . , φk〉i is a Newton series represented by a vector of Newtoncoefficientsφj , j = 0, . . . , k.

A similar two-term recurrence exists for the inverse Newton triangleN−1k required to compute the

polynomialp(i) = N−1k Φ(i) of a Newton seriesΦ(i). Fork = 3, the inverse Newton triangle matrix is

N−13 =

1 0 0 00 1 −1

213

0 0 12 −1

20 0 0 1

6

An algorithm to compute the polynomial of a Newton series is shown in Figure 1b. The Newton trianglesare used to sum the values of polynomials over (parametric) domains. To this end, the interpolating sum canbe represented in matrix notation as follows.

Definition 2.4 LetNk be the Newton triangle of orderk andN−1k+1 be the inverse Newton triangle of order

k + 1. Thesum matrixis defined asΣk+1 = N−1

k+1Zk+1Nk

where

Zk+1 =

0 0 . . . 01 0 . . . 00 1 . . . 0...

......

0 0 . . . 1

7

is the(k + 2)× (k + 1) unit shift matrix.

The following notation is used to denote the application of a sum matrix to a vector of polynomial coeffi-cients.

Definition 2.5 Letp(i) = [p0, . . . , pk]i be polynomial and letΣk+1 be the sum matrix as defined in Defini-tion 2.4. Then, the sum ofp(i) over domaini = 0, . . . , n− 1 is a polynomial defined as

s(n) = [s0, . . . , sk+1]n = [Σk+1p(i)]n

wheres0 = 0.

The sums(n) = [Σk+1p(i)]n of a polynomialp(i) of orderk is a polynomial of at most orderk + 1. Thepolynomials(n) is identical to the interpolating sum of the Newton seriesΦ(i) of p(i) overi = 0, . . . , n−1,as is stated more formally in the following Lemma.

Lemma 2.6 Let p(i) = [p0, . . . , pk]i be polynomial over domaini = 0, . . . , n − 1 and letΦ(i) = Nkp(i)be the Newton series ofp(i). Then,

σ(Φ(i), n) = [Σk+1p(i)]n

For sake of convenience, we will use[Σp(i)]n to denote the sum[Σk+1p(i)]n of ak-order polynomialp(i).BecauseΣ is a linear operation (for anyk), the following identities hold

[Σc x]n = c[Σx]n for any (symbolic) constantc (6)

[Σ(x+ y)]n = [Σx]n + [Σy]n (7)

To incorporate the constraintn ≥ 0 on the number of loop iterationsn, the sum [Σp(i)]max(0,n) is used(see also Definition 2.3). To manipulatemax bounds with summations, unit step functions [13] are used.Unit step functions are based on truth-functions, attributed to Kenneth Iverson who introduced the notationin APL (see also [11] p.24 and [32]). Other methods such as conditional expressions are also applicable.However, unit functions appear to be particularly effective as shown below.

Definition 2.7 Theunit stepfunctionµ is defined as

µ(x) ={

1 if x > 00 otherwise

and thecomplement unit step functionµ is defined as

µ(x) = 1− µ(x) ={

1 if x ≤ 00 otherwise

The unit step function has the following properties.

Lemma 2.8 For anyx, y ∈ IR the following hold:

µ(x)µ(x) = 0 (8)

µ(cx) = µ(sign(c)x) (9)

µ(xy) = µ(x)µ(y) + µ(−x)µ(−y) (10)

µ(x)x = max(0, x) (11)

(µ(x))n = µ(x) for n ∈ IN, n > 0 (12)

µ(min(x, y)) = µ(x)µ(y) (13)

µ(max(x, y)) = µ(x) + µ(y)− µ(x)µ(y) (14)

8

Proof. See [13]. 2

The following property is used to translate themax term into a simpler expression based on Eq.(11).

Corollary 2.9 Letp(i) be polynomial. Then,

[Σp(i)]max(0,n) = µ(n)[Σp(i)]n

Proof. By Eq. (11),[Σp(i)]max(0,n) = [Σp(i)]µ(n)n = s(µ(n)n) = [s0, . . . , sk+1]µ(n)n, wheres(µ(n)n) ispolynomial by Definition 2.5 and Lemma 2.6 withs0 = 0. Thus,s(µ(n)n) = s0+s1µ(n)n+s2(µ(n)n)2+· · ·+sk+1(µ(n)n)k+1 = s0 +s1µ(n)n+s2µ(n)n2 + · · ·+sk+1µ(n)nk+1 = s0 +µ(n)(s1n+s2n2 + · · ·+sk+1n

k+1)(s0=0)

= µ(n)(s0 + s1n+ s2n2 + · · ·+ sk+1n

k+1) = µ(n)s(n) = µ(n)[Σp(i)]n using Eq. (12).2

The composition of sums guarded by unit step functions requires a splitting technique. Sums are split onunit step functions as follows.

Lemma 2.10 Letp(i) be polynomial. Then,

[Σµ(x− i)p(i)]n = µ(n− x− 1)[Σp(i)]n + µ(n− x− 1)µ(x)[Σp(i)]x+1 (15)

[Σµ(x− i)p(i)]n = µ(n− x− 1)([Σp(i)]n − µ(x)[Σp(i)]x+1) (16)

[Σµ(i− x)p(i)]n = µ(n− x− 1)([Σp(i)]n − µ(x)[Σp(i)]x+1) (17)

[Σµ(i− x)p(i)]n = µ(n− x− 1)[Σp(i)]n + µ(n− x− 1)µ(x)[Σp(i)]x+1 (18)

Proof. The unit step functionµ(x− i) intersects the domain when0 < x < n− 1. Thus,

[Σµ(x− i)p(i)]n =

[Σp(i)]n if x ≥ n− 1[Σp(i)]x+1 if 0 < x < n− 10 otherwise

= µ(n− x− 1)[Σp(i)]n + µ(n− x− 1)µ(x)[Σp(i)]x+1

Similarly,

[Σµ(i− x)p(i)]n =

[Σp(i)]n − [Σp(i)]x+1 if 0 < x < n− 1[Σp(i)]n if x < 0 < n− 10 otherwise

= µ(n− x− 1)([Σp(i)]n − µ(x)[Σp(i)]x+1)

The other identities are shown to hold as follows

[Σµ(x− i)p(i)]n = [Σp(i)]n − [Σµ(x− i)p(i)]n= [Σp(i)]n − µ(n− x− 1)[Σp(i)]n − µ(n− x− 1)µ(x)[Σp(i)]x+1

= [Σp(i)]n − [Σp(i)]n + µ(n− x− 1)[Σp(i)]n − µ(n− x− 1)µ(x)[Σp(i)]x+1

= µ(n− x− 1)([Σp(i)]n − µ(x)[Σp(i)]x+1)[Σµ(i− x)p(i)]n = [Σp(i)]n − [Σµ(i− x)p(i)]n

= [Σp(i)]n − µ(n− x− 1)[Σp(i)]n + µ(n− x− 1)µ(x)[Σp(i)]x+1

= µ(n− x− 1)[Σp(i)]n + µ(n− x− 1)µ(x)[Σp(i)]x+1

2

9

Note that the splitting is performed in the continuous domain, because the splitting pointx may not beinteger. When introducing the step functions to guard zero-trip loops, this is not a problem because theupper bound estimate introduced by Lemma 2.2 and Corollary 2.9 is sound. However, the explicit use ofstep functions with affine bounds to partition loop iteration spaces may require adjustments to ensure thatthe upper bound property of the WCET is preserved (see the discussion of Lemma 4.1).

Because the manipulation of symbolic expressions is mainly performed using numeric elementary ma-trix operations with symbolic expressions, the approach requires a significantly lower number of operationscompared to techniques based on guarded sums [32], Ehrhart polynomials [8] or Presburger formulae [28].However, as a consequence of the reduced cost, the approach may yield an approximate iteration count forcertain non-rectangular loop nests and loop nests with strides. For WCET estimation based on conservativeapproximate cost models, approximations are acceptable. Considering the fact that low-level execution timeanalysis is imprecise, the overall execution time estimation is imprecise even when exact counting methodsare used.

The following example illustrates the application of sums and unit step functions to determine the sizeof the iteration space of a loop nest.

Example 2.11 Supposes(n) =∑n

i=1

∑ij=3

∑5k=j 1. In normalized forms(n) =

∑n−1i=0

∑i−2j=0

∑2−jk=0 1,

which leads to

s(n) = [Σµ(i− 1)[Σµ(3− j)[Σ1]3−j ]i−1]n= [Σµ(i− 1)[Σµ(3− j)[3,−1]j ]i−1]n= [Σµ(i− 1)(µ(i− 5)[Σ[3,−1]j ]i−1 + µ(i− 5)µ(3)[Σ[3,−1]j ]4)]n= [Σµ(i− 1)(µ(i− 5)[−4, 9

2 ,−12 ]i + µ(i− 5)[−4, 9

2 ,−12 ]4)]n

= [Σµ(i− 1)(µ(i− 5)[−4, 92 ,−

12 ]i + µ(i− 5)6)]n

= µ(n− 2)(Xn − µ(1)X2) whereXn = [Σ(µ(i− 5)[−4, 92 ,−

12 ]i + µ(i− 5)6)]n

= µ(n− 2)(Xn −X2) whereXn = [Σµ(i− 5)[−4, 92 ,−

12 ]i]n + [Σµ(i− 5)6)]n

= µ(n− 2)(Xn −X2) whereXn = µ(n− 6)[0,−193 ,

52 ,−

16 ]n + µ(n− 6)16 + [Σµ(i− 5)6)]n

= µ(n− 2)(Xn −X2) whereXn = µ(n− 6)[0,−193 ,

52 ,−

16 ]n + µ(n− 6)[−20, 6]n

= µ(n− 2)(µ(n− 6)[0,−193 ,

52 ,−

16 ]n + µ(n− 6)[−20, 6]n + 4)

= µ(n− 2)(µ(n− 6)(−38n+15n2−n3

6 ) + µ(n− 6)(6n− 20) + 4)

=

−38n+15n2−n3

6 + 4 if n > 2 ∧ n ≤ 66n− 16 if n > 60 if n ≤ 2

3

The example produces a parametric formula that is similar to [8] and [13]. However, in contrast to othermethods, the derivation of the formula is significantly simpler and based on matrix-vector multiplication andarithmetic with simple symbolic manipulation techniques. The method produces an upper-bound approxi-mation for loop nests with non-unit strides as illustrated in the following example.

Example 2.12 Consider the following loop nest

for i = 0 to n − 1 dofor j = 0 to 2 ∗ i step k do

S

10

The exact size of the iteration space is given by the non-closed formula

s(n, k) =n−1∑i=0

b 2ikc∑

j=0

1 =n−1∑i=0

b2ik + 1c

wherek ≥ 1 is assumed. Current iteration count methods [32, 13, 8, 28] cannot produce a closed-form for-mula with symbolic stridek due to various issues in the splintering process. The derivation of the parametricbound using our approach proceeds as follows.

s(n, k) = [Σµ(i+ k2 )[Σ1]b 2i

kc+1

≤ [Σµ(i+ k2 )[Σ1] 2i

k+1]n

= [Σµ(i+ k2 )[1, 2

k ]i]n= µ(n+ k

2 − 1)([Σ[1, 2k ]i]n − µ(−k

2 )[Σ[1, 2k ]i]− k

2)

= µ(n+ k2 − 1)([0, 1− 1

k ,1k ]n − 0)

= µ(n+ k2 − 1)( (k−1)n+n2

k )

={

(k−1)n+n2

k if n+ k2 > 1

0 otherwise

Because the resulting parametric iteration count value is an upper bound and rational, the fraction in theformula can be truncated to obtain a tighter bound:

s(n, k) =

{⌊(k−1)n+n2

k

⌋if n+ k

2 > 10 otherwise

In fact, the result is exact fork = 1 and k = 2. Figure 2 shows the bound and exact iteration spacesize forn = 1, . . . , 10 andk = 1, . . . , 10. Figure 3 shows the error of the bound forn = 1, . . . , 10 andk = 1, . . . , 10. The overestimation is at most four iterations. 3

For parametric WCET estimation of codes with loop nests, the low-level statement execution cost model isintegrated into the summation by Definition 2.3. However, rather than evaluating the expensive symbolicbinomial sum of the Newton-Gregory formula, matrix-vector products with the sum matrix are used. Thisis demonstrated by the following two examples.

Example 2.13 Consider the following loop nest with triangular iteration space:

for I = 1 to N dofor J = I to M do

S

The normalized iteration space is

i = 0, . . . , N − 1j = 0, . . . ,M − 1− i .

Suppose the following execution time bounds are obtained

ω(H1(0)) ≤ c0

ω(H ′2(i, 0)) ≤ c1

ω(S′(i, j)) + ω(H ′2(i, j + 1)) ≤ c2 ,

11

boundexact

12

34

56

78

910

n

12

34

56

78

910

k

0

20

40

60

80

100

Figure 2: Parametric Bound and Exact Size of the Iteration Space forn = 1, . . . , 10 of the Example LoopNest with Stridek = 1, . . . , 10

1 2 3 4 5 6 7 8 9 10

n1

23

45

67

89

10

k

0

1

2

3

4

Figure 3: Error of the Parametric Bound on the Exact Size of the Iteration Space forn = 1, . . . , 10 of theExample Loop Nest with Stridek = 1, . . . , 10

12

wherec1, c2, andc3 are constants bounding the time of the loop headerH1 of the outer loop, the normalizedloop headerH ′

2 of the inner loop, and the normalized loop bodyS′, respectively. For simple architectures,such as DSPs with out caches and with fixed instruction costs, constant boundsc0, c1, andc2 can easily beestablished. The evaluation of this parametric WCET estimation expression proceeds as follows

WCET (N,M)= c0 + [Σc1 + [Σc2]max(0,M−i)]max(0,N)

= c0 + µ(N)[Σ(c1 + µ(M − i)[Σc2]M−i)]N= c0 + µ(N)[Σ(c1 + µ(M − i)[c2M,−c2]i)]N= c0 + µ(N)([Σc1]N + [Σµ(M − i)[c2M,−c2]i]N )= c0 + µ(N)(c1N + µ(N −M − 1)[Σ[c2M,−c2]i]N + µ(N −M − 1)µ(M))[Σ[c2M,−c2]i]M+1)= c0 + µ(N)(c1N + µ(N −M − 1)(c2(M + 1

2)N − 12c2N

2) + µ(N −M − 1)µ(M)(12c2(M +M2)))

=

c0 + c1N + c2(M + 1

2)N − 12c2N

2 if 0 < N ≤M + 1c0 + c1N + 1

2c2(M +M2) if 0 < M ∧M + 1 < Nc0 otherwise

3

Nonlinear loop bounds can be incorporated in the analysis, if polynomial root finding is not required forsplitting sums during the process of simplification as illustrated in the following example. The example alsoillustrates the elimination ofmax bounds based on value range analysis, which will be further discussed inSection 3.

Example 2.14 Consider the following loop nest with a non-linear iteration space and a non-unit stride:

for I = 1 to N dofor J = I to I ∗ I − 2 step 2 do

S

The normalized iteration space is

i = 0, . . . , N − 1j = 0, . . . , b i+i2

2 c − 1 .


ω(H1(0)) ≤ c0

ω(H ′2(i, 0)) ≤ c1

ω(S′(i, j)) + ω(H ′2(i, j + 1)) ≤ c2 ,

wherec0, c1, andc2 are constants.The worst-case execution time of the two-dimensional loop nest is expressed by the composition of twoσ

functions and two Newton series conversions. The evaluation of this WCET estimation expression proceedsas follows

WCET 1(N) = c0 + [Σ(c1 + [Σc2]max(0, i+i2

2))]max(0,N)

= c0 + µ(N)[Σ(c1 + [Σc2] i+i2

2

)]N eliminate innermax becausei+ i2 ≥ 0 for i ≥ 0

13

0100200300400500600700800900

1000

0 2 4 6 8 10

WCET 2WCET 1

Figure 4: Comparison of WCET Estimates

= c0 + µ(N)[Σ(c1 + [0, 12c2,

12c2]i]N

= c0 + µ(N)[Σ([c1, 12c2,

12c2]i]N

= c0 + µ(N)[0, c1 − 16 , 0,

16c2]N

={c0 + (c1 − 1

6)N + 16c2N

3 if N > 0c0 otherwise

Conventional worst-case execution time estimation techniques based on guarded sums [32], Ehrhart poly-nomials [8] or Presburger formulas [28] cannot handle this loop nest accurately because the inner loopbound contains a non-linear expression inside of the floor operator. These methods have to resort to com-puting overestimations by establishing the maximum size of the inner loop iteration space using the bounds1 ≤ I ≤ N . As a result, the inner loop’s iteration space size is overestimated bybN2−N

2 c. Therefore, theresulting overestimated parameterized WCET expression is

WCET 2 = c0 + c1N + c2NbN2−N2 c (19)

for N ≥ 1. This approach can result in very loose bounds for triangular loop nests, because the inner loopis bounded by a non-negative constant over the entire iteration space of the inner loop.

Figure 4 shows the worst-case execution time estimation bounds using Newton-Gregory (WCET 1) com-pared to the simple WCET (19) approach (WCET 2) for increasing values ofN (shown on the x-axis). Thesample boundsc0 = 1, c1 = 1, andc2 = 2 are used in the graph. 3

In the above examples the WCET is tight under the assumption that low-level execution costs are constant,which is the case for DSP architectures without caches and with fixed instruction costs. On such archi-tectures constants can easily be determined to bound the cost of the loop bodies. However, the presentedapproach is also applicable to more accurate cost models, as long as the low-level cost can be tightly boundedby a non-negative multi-variate polynomial in the loop iteration variables, possibly in combination with unitstep functions to incorporate piece-wise polynomials. When the bounding functions are multi-variate higherorder polynomials, the method is still applicable.

3 Bounding Functions

In many cases, the limitingmax operation from the WCET formulation given in Definition 2.3 can beeliminated, as was shown in Example 2.14. The elimination avoids the introduction of unit step functions in

14

the sum composition, which requires splitting. This can result in a significant speedup in the evaluation ofthe sums. To eliminate the guard, the value must be shown to be non-negative using value range analysis.

Value range analysis lies at the heart of tight timing analysis. The value range of a symbolic expressionE is denoted by[L(E),U(E)], whereL(E) denotes the lower bound andU(E) denotes the upper bound ofE, respectively. The value range of an expression is derived by the application of rules such asL(E1+E2) =L(E1) + L(E2). Rules for arithmetic operators are mostly straightforward, see e.g. [6, 10, 15, 13, 30].

The range of values of a symbolic expression evaluated on an interval is of particular interest. In gen-eral, the tightness of a range analyzer is dependent on how the analyzed expression is formed [1] and ifmonotonicity can be exploited [6]. It is well known that when a functionf is monotonically increasing onan interval[a, b], the full range of values off on the interval is given by[f(a), f(b)]. Similarly, whenf ismonotonically decreasing on[a, b], then the range off is [f(b), f(a)].

Determining monotonicity can be difficult. Haghighat [13] uses difference methods which are bothsymbolic and iterative and must be applied to every (index) expression to be bounded without reusing in-termediate results. Blume [6] determines monotonicity using a replacement method. This method is notexact in certain cases due to considerations inside of his algorithm that guard againts non-termination of thealgorithm.

In contrast, the Newton series representation of a polynomial provides a means to accurately determinethe value range of the polynomial on a given interval. The reason is that the Newton series provides a formfrom which the monotonicity of the polynomial on an interval[0, n− 1] is evident.

Lemma 3.1 LetΦ(i) = 〈φ0, φ1, . . . , φk〉i denote the Newton series of a polynomialp(i) in i and leth(i) =N−1

k−1〈φ1, . . . , φk〉i. If h(i) ≥ 0 for all i = 0, . . . , n − 2, n ≥ 0, thenp(i) is monotonically increasingover the integers in the interval[0, n − 1]. Similarly, if h(i) ≤ 0 for all i = 0, . . . , n − 2, thenp(i) ismonotonically decreasing over the integer in the interval[0, n− 1].

Proof. The polynomialh(i) is constructed from the Newton series〈φ1, . . . , φk〉i starting with the secondcoefficient ofΦ(i). Therefore, by the Newton-Gregory formula we have

h(i) =k∑

j=1

φj

(i

j − 1

)

which leads to

p(i) =k∑

j=0

φj

(i

j

)

= φ0 +k∑

j=1

φj

(i

j

)

= φ0 +k∑

j=1

φj

i−1∑`=0

(`

j − 1

)

= φ0 +i−1∑`=0

k∑j=1

φj

(`

j − 1

)

= φ0 +i−1∑`=0

h(`)

15

for i ≥ 1 with p(0) = φ0. Becauseh(i) ≥ 0 for all integers in the rangei = 0, . . . , n− 2, we observe that∑i−1`=0 h(`) is monotonically increasing with increasingi = 1, . . . , n−1 and thereforep(i) is monotonically

increasing with increasingi = 0, . . . , n− 1. The proof of the monotonically decreasing case is similar.2

The fact that monotonicity of a polynomial can be observed from its Newton series combined with the factthat the range of a monotonic function can be easily bounded leads to the following recursive definition of alower and upper bound on the range of values of a polynomial.

Definition 3.2 LetΦ(i) = 〈φ0, . . . , φk〉i denote the Newton series of a polynomial evaluated on the integersin the intervali = 0, . . . , n− 1 with n ≥ 0. The lower bound of the polynomial defined byΦ(i) is

L(Φ(i)) =

L(φ0) if L(〈φ1, . . . , φk〉i) ≥ 0L(N−1

k Φ(n− 1)) if U(〈φ1, . . . , φk〉i) ≤ 0L(N−1

k Φ(i)) otherwise

and the upper bound of the polynomial defined byΦ(i) is

U(Φ(i)) =

U(φ0) if U(〈φ1, . . . , φk〉i) ≤ 0U(N−1

k Φ(n− 1)) if L(〈φ1, . . . , φk〉i) ≥ 0U(N−1

k Φ(i)) otherwise

whereN−1k Φ(n − 1) denotes the expression obtained by convertingΦ to a polynomial that is evaluated at

i = n− 1, that is, by replacingi with n− 1.

The conditions in the guards in this definition exploits the fact that whenL(Φ(i)) ≥ 0, on an interval ofintegers, the polynomial ofΦ(i) is non-negative over the integers in the interval and whenU(Φ(i)) ≤ 0 thenthe polynomial is non-positive at integer points in the interval.

Lemma 3.3 Let Φ(i) denote the Newton series of a polynomialp(i) evaluated oni = 0, . . . , n − 1 withn ≥ 0. Then,

L(Φ(i)) ≤ p(i) ≤ U(Φ(i))

for all i = 0, . . . , n− 1.

Example 3.4 Consider the following example loop nest with nonlinear bounds.

for i = 0 to n − 1 dofor j = 0 to (i ∗ i − i)/2 − 1 do

S

The iteration space size is determined by

s(n) = [Σ[Σ1]max(0, i2−i

2)]max(0,n)

= µ(n)[Σ[Σ1] i2−i2

]n eliminate innermax and rewrite outermax

= µ(n)[Σ[0,−12 ,

12 ]i]n

= µ(n)[0, 13 ,−

12 ,

16 ]n

={

2n−3n2+n3

6 if n > 00 otherwise

Because the Newton series ofi2−i2 is Φ(i) = N2[0,−1

2 ,12 ]i = 〈0, 0, 1〉i andL(Φ(i)) ≥ 0, the innermax

guard can be eliminated. 3

16

4 Loop Iteration Path Timing Estimation

The objective is to find the critical paths in the body of a loop, i.e. the paths that are the most time consumingacross the iterations of a loop. The following loop fragment generalizes this concept:

for i = a to b step s doif C then

S1

elseS2

If condition C partitions the iteration space ofi, e.g.C is a condition such asi > c, then loop timinganalysis can be applied by splitting the loop into partitioned loops which can be separately analyzed. Suchan approach is used by Sakellariou [32], Bik [4] and Haghighat [14] for partitioning loops with affine bounds.For example the above loop could be represented as follows:

for i = max(a, c + 1) to b step s doS1

for i = a to min(b, c) step s doS2

The number of iterations executed by the two loops is the same as the number of iterations executed by theoriginal unsplit loop. Of course, it is possible that the computed WCET estimate for the two loop version ofthe code would be different than the first version with just one loop if cache affects and pipeline affects aretaken into accout; this due to alternations between the two branches of the loop body in successive iterationsof the loop. The sum of a loop over a conditional execution path can be split using the following.

Lemma 4.1 For anyx, y, z ∈ IR the following hold:

[Σx]min(y,z) = µ(y − z)[Σx]y + µ(y − z)[Σx]z[Σx]max(y,z) = µ(z − y)[Σx]y + µ(z − y)[Σx]z

In cases where the branch conditionC exhibits switching behavior that is unknown at compile time theabove splitting technique can not be applied. In such cases we attempt to determine if one path is alwaysmore costly than another path. When we can identify these critical paths, we can compute a bound on theWCET estimate of the loop by assuming the loop body has a cost equal to the critical path. This WCETestimate will be safe when cache and pipeline affects are not considered. In the presence of cache affectsand pipeline affects, the WCET estimate can be made safe, when the cost of alternation between branchesis relatively negligible compared to the cost of the body of the critical path of the loop, by adding a smallconstant to the cost of the WCET estimate for the critical path.

It is possible that one path is not the critical path over the entire iteration space. The critical path atthe beginning of the loop iteration space could have decreasing cost over the entire iteration space. A non-critical path, with increasing cost over the loop iteration space, might overtake the initial critical path andbecome the critical path for the rest of the iteration space. In such a case, the cost of the loop body can beapproximated by a bounding curve, which bounds both of the cost curves for each path in the loop. Findingsuch a curve might require root finding, and this approach will be complicated when many paths assume thecritical path role over the iteration space.

A special case arises when the conditional, representing which path a loop will follow in its body, isloop invariant. In our framework, the WCET estimate of a loop is the weighted sum of the time cost for

17

each iteration of the loop. When a bounding function for one path is greater, over the whole iteration space,than the other path we can take the bounding function for this more expensive path (in the absence of cacheaffects and pipeline affects) as the cost of the loop body over all iterations. In the case where one path’scost is represented by an ascending bounding function and the other’s cost is represented by a descendingbounding function we can reverse the iteration space for the descending bounding function and treat it as anascending bounding function which will have the same sum cost, over the new reversed iteration space, asthe original descending curve over the original iteration space: The area under the orginal descending curveand its ascending reverse iteration space version will be the same. Since the loop condition is invariant weknow that only one of the loop paths will be executed during the entire execution of the loop. If one of theascending bounding functions is greater than the other we can take this bounding function to be the costestimate for the loop body. When this is not the case, root finding or some appropriate approximation isrequired.

We present an approach to analyze two paths in the body of a loop with a loop-invariant condition. Theextension to multiple paths is straightforward. If the worst-case execution time bounds onS1 andS2 arepolynomial ini (including constant), then the

To this end, the timing of the paths is compared by computing the difference series of the Newton seriesof the polynomials that bound the execution time of the paths.

Definition 4.2 LetΦ(i) = 〈φ0, . . . , φk〉i andΨ(i) = 〈ψ0, . . . , ψ`〉i denote Newton series. Then, the differ-ence series∆(i) = Φ(i)−Ψ(i) = 〈δ0, . . . , δmax(k,`)〉i is defined by

δj =

φj − ψj if j ≤ k andj ≤ `φj if j ≤ k andj > `−ψj if j > k andj ≤ `

for all j = 0, . . . ,max(k, `).

The difference series of the Newton series of two polynomials gives the necessary information to determinewhich of the two polynomials is the largest on an interval.

Lemma 4.3 Let Φ(i) be the Newton series of a polynomialp(i) and letΨ(i) be the Newton series of apolynomialq(i). If L(Φ(i) − Ψ(i)) ≥ 0 then p(i) ≥ q(i) for all i = 0, . . . , n − 1, n ≥ 0, and ifU(Φ(i)−Ψ(i)) ≤ 0 thenp(i) ≤ q(i) for all i = 0, . . . , n− 1.

A maximizing operator on Newton series is introduced that exploits iteration reversal. The maximizingoperator returns the larger of two series. If neither of the two series dominates, a new series is constructedthat bounds both polynomials of the series.

Definition 4.4 Let i = 0, . . . , n−1, n ≥ 0, be the iteration space of a (normalized) loop with two executionpaths through the loop body based on a condition that is loop invariant. LetΦ(i) = 〈φ0, . . . , φk〉i denotethe Newton series of the parameterized WCET of the first path and letΨ(i) = 〈ψ0, . . . , ψ`〉i denote theNewton series of the parameterized WCET of the second path. We define the maximizing operator↑ of the

18

Difference Series Value Range on Domaini = 0 . . . , N∆1(i) = Φ(i)−Ψ(i) = 〈−1− 4N −N2, 5 + 2N,−2〉i [L(∆1),U(∆1)] = [−1− 4N −N2,−1 + 2N ]∆2(i) = Φ(i)−Ψ(N − i) = 〈−1,−3,−2〉i [L(∆2),U(∆2)] = [−1− 2N −N2,−1]∆3(i) = Φ(N − i)−Ψ(i) = 〈−1− 2N −N2, 1 + 2N,−2〉i [L(∆3),U(∆3)] = [−1− 2N −N2, 1]∆4(i) = Φ(N − i)−Ψ(N − i) = 〈−1 + 2N,−7,−2〉i [L(∆4),U(∆4)] = [−1− 4N −N2,−1 + 2N ]

Table 1: Difference Series ofΦ(i) andΨ(i) and Their Value Range for Example 4.5

two series by

Φ(i) ↑ Ψ(i) =

Φ(i) if L(∆1(i)) ≥ 0or L(∆2(i)) ≥ 0or L(∆3(i)) ≥ 0or L(∆4(i)) ≥ 0

Ψ(i) if U(∆1(i)) ≤ 0or U(∆2(i)) ≤ 0or U(∆3(i)) ≤ 0or U(∆4(i)) ≤ 0

Θ(i) otherwise

where

∆1(i) = Φ(i)−Ψ(i)∆2(i) = Φ(i)−Ψ(n− i− 1)∆3(i) = Φ(n− i− 1)−Ψ(i)∆4(i) = Φ(n− i− 1)−Ψ(n− i− 1)

andΘ(i) = 〈θ0, . . . , θmax(k,`)〉i

with coefficients

θj =

max(φj , ψj) if j ≤ k andj ≤ `max(0, φj) if j ≤ k andj > `max(0, ψj) if j > k andj ≤ `

for all j = 0, . . . ,max(k, `).

The differencesΘ(i)− Φ(i) andΘ(i)−Ψ(i) are always positive and thereforeΘ(i) bounds bothΦ(i) andΨ(i).

The Newton seriesΦ(n− i− 1) is obtained from the seriesΦ(i) by reversing the evaluation of the poly-nomial on the interval[0, n−1]. There are several ways this can be accomplished. A simple implementationis to convert the Newton series back to a polynomial usingp(i) = N−1

k Φ(i) which gives the coefficientsof p(i) or with p(i) =

∑kj=0 φj

(ij

)which gives a closed-form polynomial expression. Then replacei with

n−i−1 in p(i), simplify the polynomial, and convert to Newton series withΦ(n−i−1) = Nkp(n−i−1).The generalization of this approach to multiple paths and multiple loops follows from the fact that

Φ(i) ↑ Ψ(i) is closed under polynomial formation. Therefore it can be applied to multiple paths to maximizethe Newton series representation of the timing of the paths.

19

Example 4.5 Consider the following loop with a loop body that contains a conditional expression with aloop invariant condition.

1: for i = 0 to N do2: if a > 0 then3: for j = 0 to i do4: S4

5: else6: for j = 0 to N − i do7: for k = 0 to j do8: S8


ω(H1(0)) ≤ 1ω(C2(i)) ≤ 1

ω(H3(i, 0)) ≤ 1ω(S4(i, j)) + ω(H3(i, j + 1)) ≤ 2

ω(H6(i, 0)) ≤ 1ω(H7(i, 0)) ≤ 1

ω(S8(i, j)) + ω(H7(i, j + 1)) ≤ 2 .

The analysis begins with the inner loops and proceeds to the outer loop. Loop bounds analysis revealed thatnone of the iteration space sizes of the loops is negative and therefore themax operations can be eliminatedfrom theσ functions. The first path spans statements2 to 4 and the second path spans statements2 and6to 8. The Newton series corresponding to the two paths are the following:

Φ(i) = N1(2 + [Σ2]i+1)= N1[4, 2]i= 〈4, 2〉i

Ψ(i) = N2(2 + [Σ(1 + [Σ2]j+1]N−i+1)= N2[5 + 4N +N2,−2N − 4, 1]i= 〈5 + 4N +N2,−3− 2N, 2〉i

and the reversed series are

Φ(N − i) = N1(2 + [Σ2]N−i+1)= N1[4 + 2N,−2]i= 〈4 + 2N,−2〉i

Ψ(N − i) = N2(2 + [Σ(1 + [Σ2]j+1]i+1)= N2[5, 4, 1]i= 〈5, 5, 2〉i .

The difference series and their value ranges on the domaini = 0, . . . , N are shown in Table 1. The valuerange of∆2(i) is negative. Therefore, it can be concluded that the polynomial of the seriesΨ(N−i) bounds

20

0

500

1000

1500

2000

2500

3000

0 2 4 6 8 10

WCET 3WCET 2WCET 1

Figure 5: Comparison of WCET Estimates

the polynomial of the seriesΦ(i). That is,Ψ(i) (and the reversedΨ(N − i)) represents the execution timeof the critical path through the loop body. The total worst-case execution time of the loop fragment is givenby

WCET 1(N) = 1 + σ(Ψ(i), N + 1) = 7 + 376 N + 5

2N2 + 1

3N3

Now consider a conventional WCET estimation that is determined by decoupling the outer loop from theinner loops. This WCET is the best estimation that current methods can achieve. This estimation is obtainedby simplifying the problem by assuming that the inner loops are bounded by the maximum value ofi whichis N . Effectively, this WCET is computed from the Newton series of the inner loops by replacingi byNgiving N−1

1 Φ(N + 1) = 4 + 2N andN−12 Ψ(N + 1) = 5 + 4N + N2. Because the outer loop iterates

N + 1 times and the loop header execution time is1, the simplified worst-case execution time estimation is

WCET 2(N) = 1 + (N + 1) max(4 + 2N, 5 + 4N +N2)

Yet even more simple estimation methods are commonly used in practice that bound the iteration spaceof loops by using value range analysis that provides only very simple bounding expressions on the loopiteration space sizes. These methods cannot accurately estimate the timing of triangular loops, becauseinner loops are decoupled from outer loops. For example, loop statement3, with a loop body cost of 2,iterates at mostN + 1 times. This gives a cost for executing loop3 of 2(N + 1) + 3; this where3 is thecost of the header of loop statement1, plus the cost of the conditional statement2, and plus the cost of theheader of loop statement3. Loop statement7, with a loop body cost of 2, iterates at mostN +1 times. Loopstatement6, with a loop body cost of2(N + 1) + 1, iterates at mostN + 1 times. Therefore, the cost of theloop nest statements6 to 8 is 3 + (N + 1)(2(N + 1) + 1). The total estimated execution time using thissimple approach is

WCET 3(N) = 1 + (N + 1) max(5 + 2N, 6 + 5N + 2N2)

Figure 5 shows the worst-case execution time estimation bounds using Newton-Gregory estimation (WCET 1),decoupled estimation (WCET 2), and simple estimation (WCET 3), for increasing values ofN (shown onthe x-axis). 3

The approach is also applicable to loops with muiltiple paths with conditions that are non-loop invariantby using theΘ(i) = Φ(i) ↑ Ψ(i) function to bound the timingsΦ(i) ≤ Θ(i) andΨ(i) ≤ Θ(i) of the pathswhen neitherΨ(i) norΦ(i) dominate the execution time.

21

L4

L5

L6

L7

L8

L9

L10

L11

L12

L3

exit

exit

Figure 6: CFG of the Test Code

5 Experiments

For the experiments a Texas Instruments TMS320C54x DSP was used with a cycle counter to obtain thenumber of cycles to execute code. This 16-bit fixed-point DSP is a VLIW machine with a six-stage pipeline.It does not feature a cache but has ROM, dual-access RAM and single-access RAM. The code was written,compiled and tested on the DSP using C54x Code Composer Studio. The following test code fragment wasrun on the TMS320C54x DSP.

1: for i = 0 to N do2: if a[i] > 0 then3: for j = 0 to i do4: x = x + 35: else6: for j = 0 to i do7: for k = 0 to j do8: x = worker(x)

The outer for loop has two possible paths in its body. The first path spans statements2 to 4 and the secondpath spans statements2 and6 to 8. Figure 6 shows the CFG of the Test Code.

This example test code was analyzed using instruction counts from the DSP reference manual [35] toobtain the cycle counts of individual instructions for our parametric timing technique. The following timeswere produced for the various parts of the test code.

ω(H1(0)) ≤ 11ω(H1(i+ 1)) ≤ 9

ω(C2(i)) ≤ 13ω(H3(i, 0)) ≤ 9

ω(S4(i, j)) + ω(H3(i, j + 1)) ≤ 13

22

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

0 1 2 3 4 5 6 7 8 9 10 11

WCETSamples

Figure 7: Parametric Bound and Range of the Measured Execution Times of the Test Code forN =1, . . . , 10

ω(H6(i, 0)) ≤ 9ω(H6(i, j + 1)) ≤ 9ω(H7(i, j, 0)) ≤ 11

ω(S8(i, j)) + ω(H7(i, j, k + 1)) ≤ 42 .

The TMS320C54x DSP has different costs for taken branches and non-taken branches. Taken branches costis5 cycles and non-taken branches cost3 cycles. With this information the WCET bound can be made tighteron the cost of theω(H6(i, j+1)) estimate. This cost is only incurred when theω(S8(i, j))+ω(H7(i, j, k+1)) path has a false branch condition. In such a case theω(S8(i, j)) + ω(H7(i, j, k + 1)) estimate is boundby 40 and this savings can be incorporated into theω(H6(i, j + 1)) estimate.

The Newton series corresponding to the two paths are the following:

Φ(i) = N1(9 + µ(i+ 1)[Σ13]i+1)= 〈22, 13〉i if i+ 1 ≥ 0

Ψ(i) = N2(9 + µ(i+ 1)[Σ((9− 2) + 11 + µ(j + 1)[Σ42]j+1]i+1)= 9 + [0, 39, 21]i+1 if i+ 1 ≥ 0= [69, 81, 21]i if i+ 1 ≥ 0= 〈69, 102, 42〉i if i+ 1 ≥ 0

By an application of 3.2 and 4.3 we see that the polynomial, which represents the branch associated withΨdominates the polynomial which represents the branch associated withΦ.

23

L(Ψ(i)− Φ(i)) = L(〈69, 102, 42〉i − 〈22, 13〉i) = L(〈47, 89, 42〉i) = L(47) ≥ 0

Therefore, in this example, the cost of the path spanning statements2 and6 to 8, dominates the cost of thestatements between2 and4, inclusive. Thus, we can take the WCET to be the following:

WCET1 (N) = 11+µ(N+1)[Σ(9+13+9)+[60, 81, 21]i]N+1 = 7N3+51N2+135N+101 if N + 1 > 0

There is no limit on the applicability of the approach to the structure of the loop nest to increase the accuracyof the analysis to handle differences in branch costs through alternative execution paths by splitting thesums and applying different weights to the parts. Therefore, unit step functions can be used to model thesedifferences for particular (sets of) loop iterations, such as the first and last iteration, more accurately.

The WCET estimate is improved by noting that the bound forω(H1(i+ 1)) can be tightened by takingexecution path information into account. When we follow the first branch, which encompasses statements3 and 4, the cost of the branch to the statements forω(H1(i + 1)) is actually1 cycle less that we areassuming in our estimate for the3 and4 statements. The estimated cost of theω(H1(i + 1)) statement,when proceeded by the statements6 through8, is 2 cycles more costly than the actual cycles. Taking theminimum between these two branches results in an cycle overestimation of1 cycle forω(H1(i+ 1)).

With this adjustment the WCET estimate is:

WCET2 (N) = 11+µ(N+1)[Σ(9+13+8)+[60, 81, 21]i]N+1 = 7N3+51N2+134N+101 if N + 1 > 0

The actual running time will depend upon the contents of the arraya[i]. The most costly execution pathwill occur whena[i] ≤ 0. TheWCET2 estimate forN = 10 was13541 cycles. The actual number of cyclesexecuted while running the code was13372 cycles. Our analysis overestimated the true cost of executingthe code by1%. This demonstrates that our method can be very accurate assuming that the architecturalcharacteristics of the processor can be accurately modeled.

These experiments give some insight into how our method might be both extended and applied to code.The inaccuracy in the method is dependent upon the degree of looseness of the statement bounds. Thesebounds can be tightened when the interaction between statements is taken into account. The percentageof inaccuracy in the estimate is also based upon the size of loop bodies. These two ideas will guide thechoice of statement cost estimate methods and at which loop level we should start to apply our summationtechnique.

6 Conclusions and Future Work

This paper presented a novel efficient counting method which can be applied to parametric WCET estimationfor rectangular and non-rectangular loop nests including those with zero-trip loops, non-unit strides, andmultiple critical iteration paths. The approach unifies low-level loop body statement WCET estimation witha loop iteration count estimation method by treating a loop nest as a weighted sum where each iteration ofthe loop body can have a specific cost. The cost of the loop body is bounded by a bounding function whichis summed over the iteration space in order to determine a WCET estimate for the loop. The approach isefficient and can handle affine and certain classes of non-affine loop bounds. Future work includes improvingthe bounding polynomialΘ used by the maximizing operator↑. We will also investigate which methods ofloop body WCET estimation are most appropriate to combine with our method. How to choose the looplevel, at which to construct the bounding functions, will also be determined by balancing the complexityof the resulting loop cost expression and the required tightness of the WCET estimate for the particularproblem being addressed.

24

References

[1] G. Alefeld. Interval arithmetic tools for range approximation and inclusion of zeros. In H. Bulgakand C. Zenger, editors,Error Control and Adaptivity in Scientific Computing, pages 1–21. KluwerAcademic Publishers, 1999.

[2] P. Altenbernd. the false path problem in hard real-time programs. InIn Proceedings of the 8th Euromi-cro Workshop on Real-Time Systems, pages 102–107, 1996.

[3] O. Bachmann.Chains of Recurrences. PhD thesis, Kent State University, College of Arts and Sciences,1996.

[4] A. bik and H. Wijshoff. Iteration space partitioning.Future Generation Computer Systems, 12:421–429, 1997.

[5] L. Bjorn. Fully automatic, parametric worst-case execution time analysis. In3rd Intl Workshop onWorst-Case Execution Time (WCET) Analysis, 2003.

[6] W. Blume and R. Eigenmann. Demand-driven, symbolic range propagation. In8th Internationalworkshop on Languages and Compilers for Parallel Computing, pages 141–160, Columbus, Ohio,USA, Aug. 1995.

[7] R. Chapman. Worst-case timing analysis via finding longest paths in spark ada basic-path graphs.Technical report ycs-94-246, Department of Computer Science, York University, 1994.

[8] P. Clauss. Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: Ap-plications to analyze and transform scientific programs. Inproceedings of the 1996 InternationalConference on Supercomputing, pages 278–285. ACM Press, 1996.

[9] J. Engblom, A. Ermedahl, M. Sjoedin, J. Gubstafsson, and H. Hansson. Worst-case execution-timeanalysis for embedded real-time systems.Journal of Software Tools for Technology Transfer, 14,2001.

[10] T. Fahringer. Efficient symbolic analysis for parallelizing compilers and performance estimators.Su-percomputing, 12(3):227–252, May 1998.

[11] R. Graham, D. Knuth, and O. Patashnik.Concrete Mathematics. Addison-Wesley, 1991.

[12] M. Haghighat and C. Polychronopoulos. Symbolic program analysis and optimization for parallelizingcompilers. In5th Annual Workshop on Languages and Compilers for Parallel Computing, LNCS 757,pages 538–562, New Haven, Connecticut, 1992. Springer Verlag.

[13] M. R. Haghighat.Symbolic Analysis for Parallelizing Compilers. Kluwer Academic Publishers, 1995.

[14] M. R. Haghighat and C. D. Polychronopoulos. Symbolic analysis: A basis for parallelization, opti-mization and scheduling of programs. In1993 Workshop on Languages and Compilers for ParallelComputing, number 768, pages 567–585, Portland, Ore., 1993. Berlin: Springer Verlag.

[15] W. H. Harrison. Compiler analysis of the value ranges of variables.IEEE Transactions on SoftwareEngineering, 3(3):243–250, May 1977.

25

[16] C. Healy, R. Arnold, F. Mueller, D. Whalley, and M. Harmon. Bounding pipeline and instruction cacheperformance.IEEE Transactions on Computers, 48(1):53–70, January 1999.

[17] C. Healy, M. Sjodin, V. Rustagi, D. Whalley, and R. van Engelen. Supporting timing analysis byautomatic bounding of loop iterations.Real-Time Systems, pages 121–148, May 2000.

[18] S. Lee and T. Sakurai. Run-time voltage hopping for low-power real-time systems. InDesign Automa-tion Conference, pages 806–809, 2000.

[19] Y.-T. S. Li and S. Malik. Performance analysis of embedded software using implicit path enumeration.In Workshop on Languages, Compilers, and Tools for Real-Time Systems, pages 88–98, 1995.

[20] Y.-T. S. Li, S. Malik, and A. Wolfe. Efficient microarchitecture modeling and path analysis for real-time software. InIEEE Real-Time Systems Symposium, pages 298–307, 1995.

[21] S.-S. Lim, Y. H. Bae, G. T. Jang, B.-D. Rhee, S. L. Min, C. Y. Park, H. Shin, K. Park, S.-M. Moon,and C.-S. Kim. An accurate worst case timing analysis for RISC processors.Software Engineering,21(7):593–604, 1995.

[22] C. Liu and J. Layland. Scheduling algorithms for multiprogramming in a hard real-time environment.Journal of the ACM, 20(1):46–61, January 1973.

[23] V. Loechner and D. K. Wilde. Parameterized polyhedra and their vertices.International Journal ofParallel Programming, 25(6):525–549, 1997.

[24] T. Lundqvist and P. Stenstrom. Integrating path and timing analysis using instruction-level simulationtechniques.Lecture Notes in Computer Science, 1474, 1998.

[25] C. H. Papadimtriou and K. Steiglitz.Combinatorial Optimization: Algorithms and Complexity.Prentice-Hall, Englewood Cliffs, NJ, 1982.

[26] C. Y. Park. Predicting program execution times by analyzing static and dynamic program paths.Real-Time Systems, 5(1):31–61, March, 1993.

[27] P. Pillai and K. G. Shin. Real-time dynamic voltage scaling for low-power embedded operating sys-tems. In18th ACM Symposium on Operating Systems Principles, 2001.

[28] W. Pugh. Counting solutions to Presburger formulas: How and why. InACM SIGPLAN Conferenceon Programming Language Design and Implementation, pages 121–134, Orlando, FL, June 1994.

[29] W. Pugh and D. Wonnacott. Eliminating false data dependences using the Omega test. InACMSIGPLAN Conference on Programming Language Design and Implementation, pages 140–151, SanFransisco, CA, June 1992.

[30] H. Ratchek and J. Rokne.Computer Methods for the Range of Functions. Ellis Horwood Limited,Chichester, West Sussex, PO19 1EB England, 1984.

[31] R. Sakellariou.On the Quest for Perfect Load Balance in Loop-Based Parallel Computations. PhDthesis, Department of Computer Science, University of Manchester, 1996.

[32] R. Sakellariou. Symbolic evaluation of sums for parallelising compilers. InIMACS World Congresson Scientific Computation, Modelling and Applied Mathematics. Wissenshaft & Technik Verlag, 1997.

26

[33] D. Shin, J. Kim, and S. Lee. Low-energy intra-task voltage scheduling using static timing analysis. InDesign Automation Conference, pages 438–443, 2001.

[34] Y. Shin and K. Choi. Power conscious fixed priority scheduling for hard real-time systems. InDesignAutomation Conference, pages 134–139, 1999.

[35] L. N. SPRU172C.TMS320C54x Reference Set. Texas Instruments Incorporated, Dallas, Texas, 2001.

[36] F. Stappert, A. Ermedahl, and J. Engblom. Efficient longest executable path search for programs withcomplex flows and pipeline effects. InCASES, pages 132–140, 2001.

[37] N. Tawbi. Estimation of nested loops execution time by integer arithmetic in convex polyhydra. Inproceedings of the 8th International Parallel Processing Symposium, pages 217–221. IEEE ComputerSociety, 1994.

[38] E. Vivancos, C. Healy, F. Mueller, and D. Whalley. Parametric timing analysis. InLCTES 2001, pages88–93, 2001.

[39] B. Walsh, R. van Engelen, K. Gallivan, J. Birch, and Y. Shou. Parametric intra-task dynamic voltagescheduling. InProceedings of COLP 2003, 2003.

[40] E. Zima. Simplification and optimization transformations of chains of recurrences. InProc. of theInternational Symposium on Symbolic and Algebraic Computing, Montreal, Canada, 1995. ACM.

27

Parametric Timing Estimation With Newton-Gregory Formulae

Documents

Transcript of Parametric Timing Estimation With Newton-Gregory Formulae