Optimizing Tunable WCET with Shared Resource Allocation ...Optimizing Tunable WCET with Shared...

Optimizing Tunable WCET with Shared Resource Allocation and Arbitration inHard Real-Time Multicore Systems

Man-Ki Yoon, Jung-Eun Kim, and Lui ShaDepartment of Computer Science, University of Illinois at Urbana-Champaign

{mkyoon, jekim314, lrs}@illinois.edu

Abstract—The unpredictable worst-case timing behavior ofmulticore architectures has been the biggest stumbling block fora widespread use of multicores in hard real-time systems. A greatdeal of research effort has been devoted to address the issue.Among others, the development of a new multicore architecturehas emerged as an attractive solution because it can eliminate theunpredictable interference sources in the first place. This opensa new possibility of system-level optimizations with multicore-based hard real-time systems. To address this issue, we propose anew perspective of WCET model called tunable WCET, in whichthe WCETs of tasks are elastically adjusted according to theoptimal shared resource allocation and arbitration methods. Forthis, we propose novel WCET-aware harmonic round-robin busscheduling and two-level cache partitioning method. We presenta mixed integer linear programming formulation as the solutionto the optimization of tunable WCETs. Our experimental resultsshow that the proposed methods can significantly lower overallsystem utilizations.

I. INTRODUCTION

Multicore processors are receiving wide attention fromavionic and automotive industries as the demand for high-endreal-time applications is rapidly growing [1], [2]. However,the major obstacle in applying multicore processors to suchdomains is that the execution time of applications can varynoticeably depending on how physical resources, such asshared cache [3]–[5] and shared bus [6]–[8], are contendedbetween tasks on multiple cores. This unpredictable inter-core interferences in current multicore architecture are a hugebarrier, especially for safety-critical systems in which the pre-dictability of the worst-case temporal behavior is of primaryimportance. One possible solution to this kind of problem liesin analytic methods that can precisely estimate the worst-caseexecution times (WCETs) of applications in the presence ofshared resource contentions [9]–[11]. The assumptions madein the existing analyses are commonly restrictive, however,and thus the results are often very pessimistic or not evenapplicable directly to the current multicore architectures.The more serious problem is that, as multicore architecturesbecome more complex, the correlation among the inter-coreinterferences sources becomes much higher than before.

Due to such fundamental limitations of the analytic meth-ods, hardware modification on multicore architectures hasemerged as an attractive and viable solution [12]–[14]. Whileanalytic methods try to analyze inter-core resource con-tentions, the new multicore architectures focus on eliminat-ing such interferences in the first place for higher timing

predictability. This line of research opens a new possibilityof system-level optimizations with multicore-based hard real-time systems, as we will explore throughout this paper.

A. Motivating Hard Real-Time Multicore Architecture

In [14], Paolieri et al. proposed a new hard real-timemulticore architecture in which accesses to shared bus andcache are controlled by hierarchical arbiters. The architectureemploys round-robin as the shared bus arbitration policy, thusthe maximum bus access delay is bounded by the numberof cores in a system. They also analyzed shared cacheinterference with regard to two factors - bank conflict andstorage conflict. The maximum bank conflict delay is similarlybounded as the bus access conflict is. They addressed cachepartitioning techniques that can eliminate storage conflicts bysplitting a cache space into private banks or columns.

This architecture provides a good architectural foundationfor future hard real-time multicore systems. First of all, sinceresource contentions are resolved by hardware arbiters, it doesnot require any modification on applications’ source codeand also does not impose any restrictions on programminglanguage or OS. Furthermore, the WCET of a task can beobtained without the knowledge of other tasks.

B. Tunable WCET and Its Optimization

Paolieri’s multicore architecture provides a high degreeof temporal predictability of the applications’ WCET; eachcore or task has its exclusive spatial and temporal partitionsfor shared resource accesses. While this makes the WCETanalysis much easier by eliminating the potential sourcesof resource contentions, one possible limitation is that theresources may have limited capacities to accommodate a givenworkload. Recall that every task is assigned to private cachebanks or columns in order to avoid storage conflicts. Bank-level partitioning requires as many banks as the number oftasks in the system. Column-level partitioning may resolvethe capacity problem, however tasks may experience bankconflict delays when accessing the banks. Therefore a properpartitioning method which can efficiently utilize the sharedcache space while minimizing such interferences is needed.

Another possible way of improvement is the use ofapplication-aware bus scheduling. While the pure round-robin scheduling can easily bound the worst-case bus accessdelay, it may be inefficient in that every bus access has towait for the same amount of worst-case delay regardless of

fixed exec time tunable delay(A)

wceti

bound

fixed exec time tunable delay(C)

fixed exec timetunable

delay(B)

configuration A

configuration B

configuration C

Figure 1: Tunable WCET model.

application characteristics; memory-bound tasks are likely toaccess the shared bus more intensively than others. If we givemore frequent chances to such tasks by lengthening others’worst-case access times, we can achieve an enhanced overallefficiency, e.g., lower system utilization. This advantage canbe magnified, especially if the system mainly consists of asubclass of numerical real-time tasks, such as signal or imageprocessing applications, that has few execution branches andwhose cache footprints rarely change from period to period.

In order to address the above problems, we propose a newperspective of WCET model called tunable WCET, in whichthe WCET of a task is partitioned into two components - fixedexecution time and tunable delay, as illustrated in Fig. 1. Inthis model, the tunable delay of a task is a function of systemconfiguration, and thus it enables system-level optimizationfor certain purposes by elastically adjusting shared resourceconfigurations. In particular, we focus on the two major inter-core interference sources - shared bus and shared cache. Inthis paper, we investigate how different configurations of busarbitration and cache partition could affect tasks’ tunable de-lay. In order to achieve this goal, we adopt Paolieri’s multicorearchitecture [14] and propose novel bus arbitration and cachepartitioning methods called harmonic round-robin and two-level cache partitioning, respectively. Our harmonic round-robin arbitration realizes application-aware bus schedulingby varying the worst-case bus access delays of differentcores. Similarly, our two-level cache partitioning schememaps banks to cores and columns to tasks in a way that bankaccess conflicts are minimized with the help of a harmonicround-robin bus schedule. As the solution to the optimizationproblem of tunable WCETs, we present a mixed integer linearprogramming (MILP) formulation. As will be discussed later,our experimental results show that the proposed methods cansignificantly lower overall system utilization.

The rest of this paper is organized as follows: Sec. IIintroduces our HRR bus arbitration and two-level cache par-titioning method and defines the tunable WCET optimizationproblem. Then, Sec. III describes in detail the tunable WCETmodel and its analysis. In Sec. IV, we present the MILPformulation for the tunable WCET optimization problem. Theexperimental results are given in Sec. V. Sec. VI summarizesthe related work, and then Sec. VII concludes this paper.

II. OPTIMIZATION OF TUNABLE WCET

In this section, we introduce our harmonic round-robin busarbitration and two-level cache partitioning method, and thendescribe how these affect our tunable WCET model in theperspective of system-level optimization.

1 2 3 4 1 2 3 4 1 2 3 4 . . .

Task A’s request

from core 1

4·LB 4·LB

Task B’s request

from core 3

1 2 1 3 1 2 1 4 1 2 1 3 . . .

2·LB 8·LB

Task A’s request

from core 1

Task B’s request

from core 3

(a) Pure round-robin (4,4,4,4).

(b) Harmonic round-robin (2,4,8,8).

Figure 2: Pure round-robin vs. Harmonic round-robin.

A. System Model

We consider a multicore system that consists of NC

homogeneous cores, C = {C1, C2, · · · , CNC}. The systemhas a shared cache B which is partitioned into NB banks{B1, B2, · · · , BNB}, each of which is subdivided into NW

columns. That is, the cache has total NX = NB·NW columns,i.e., X = {X1, X2, · · · , XNX}. On that system, we assume aset of NΓ real-time tasks Γ = {τ1, τ2, · · · , τNΓ}, each ofwhich is represented by τi = (ei, pi, N

Xi , N

Mi ); τi executes

on Cj with the execution time of ei and the period of pi,and it accesses NX

i cache columns NMi times to store/load its

instructions and data. The cache is partitioned by the two-levelpartitioning method, and the bus access is arbitrated by anHRR schedule, both of which are introduced in the followingsubsections. With these constraints, we further assume thatboth NX

i and NMi are pre-profiled by a static analysis.

B. Harmonic Round-Robin Bus Arbitration

In pure round-robin bus scheduling, the bus access delaysof every task are upper-bounded by NC · LB , where NC isthe number of cores and LB is the bus access latency 1. Asmentioned before, this is inefficient in that the same amountof bus access delay of different tasks affects a certain per-formance metric, e.g., overall system utilization, differently.For example, suppose that τA in C1 and τB in C3 havethe same worst-case bus access delay of 4 · LB as shownin Fig. 2(a). If their period is 50 but the total numbers ofcache accesses, NM

A and NMB , are 500 and 100, respectively,

then the contributions of their bus access delays to the systemutilization are ubusA =(500·4·LB)/50 and ubusB =(100·4·LB)/50,respectively. Similarly, suppose now that pA is 10 and NM

A is100. In both cases, ubusA /ubusB is 5, meaning that τA affects thesystem utilization five times more than τB . Now let us supposethat C1 and C3 are guaranteed to be able to access the busevery 2 and 8 slots, respectively, as shown in Fig. 2(b). Then,ubusA and ubusB become (100 · 2 ·LB)/10 and (100 · 8 ·LB)/50,respectively. Accordingly, the net contribution, i.e., ubusA +ubusB ,is reduced from (2400 · LB)/50 to (1800 · LB)/50.

As can be observed from this example, by giving morefrequent slots to cores on which memory-intensive or high-utilization tasks run, we can lower the overall system uti-

1We assume that a bus request should arrive at the bus before eachdesignated time slot to be granted to send the request at the bus slot.

Mk' Mk' Mk' Mk'

B B

1 2 1 3 1 2 1 4 1 2

Mk MkMkMk

B B Mk MkMkMk

1 5

B B Mk MkMkMk

B B Mk MkMkMk

B B Mk MkMkMk

...

B B Mk MkMkMk

B B Mk MkMkMk

B B Mk MkMkMk

B B Mk MkMkMk

Mk

Mk

B B Mk MkMkMk Mk

Mk

Mk

1 2 1 3 1 2 1 4 1 2 1 5 ...

(c) HRR (2,4,24,24,24,24,24,24). Core 2, 3, and 4 share Bank k. LM=2·LB. The bank conflict delays can be eliminated by scheduling the bus with a harmonic round-robin arbitration.

(d) HRR (2,4,24,24,24,24,24,24). Core 2, 3, and 4 share Bank k. LM=(5/2)·LB. An increase of LM causes bank conflict delays to Core 2, 3, and 4.

B B

1 2 3 4 5 6 7 8 1 2 ...

Mk MkMkMk

B B Mk MkMkMk

B B Mk MkMkMk

B B

1 2 3 4 5 6 7 8 1 2 …

Mk MkMkMk

B

B B Mk MkMkMk

(a) Pure RR. Core 2, 3, and 4 share Bank k. Core 3 and 4 suffer bank conflict delays.

(b) Pure RR. Core 2 and 4 share Bank k, and Core 3 uses Bank k’. By assigning a different bank to Core 3, all bank conflict delays become eliminated.

TRR

Bus AccessB B Bank Conflict Delay Mk MkMkMk Access to Bank k Access to Bank k’

Mk' Mk' Mk' Mk'B

Figure 3: Bank conflict delays with different bus and cache configurations.

lization, especially if the cache access patterns of the tasksrarely change during their executions. If the tasks are notpre-assigned to cores, we can further reduce it by groupingsuch tasks and assigning them to the more prioritized cores.Thus, as a method of core prioritization in bus scheduling,we propose Harmonic Round-Robin arbitration policy, whichcan be defined as follows:

HRR � (NC, Tmin, Tmax, TRR,P),

where P is a set of HRR periods {T1, T2, · · · , TNC},∀j Tmin ≤ Tj ≤ Tmax, and TRR is the hyper period of P,i.e., the length of one round. The HRR periods harmonize witheach other if and only if they satisfy the following conditions:

∀1�j�NC−1

Tj+1

Tj∈ N and

NC∑j=1

1

Tj= 1. (1)

Because {Tj} are bounded within [Tmin, Tmax], only a finitenumber of harmonic sets can be made within the given range.Once a set of HRR periods {T1, T2, · · · , TNC} satisfying Con-dition (1) is obtained, we can create the unique correspondingHRR table of length TRR = TNC by constraints C9–C11in Sec. IV. For example, Fig. 2 shows the scheduling tablescorresponding to (4, 4, 4, 4) and (2, 4, 8, 8), respectively.

With our harmonic round-robin arbitration, we can alsoachieve the same argument of [14] that the worst-case busaccess delay of a task can be obtained without the knowledgeof other real-time tasks. Furthermore, more importantly, ithelps our two-level cache partitioning method reduce bankconflicts, as will be described in the following subsection.

C. Two-Level Cache Partitioning

In [14], the authors consider a column-level cache partition-ing method called columnization [15] as a way to eliminate

3 1 2 1

B B

B B

B B

B B

1 2 1 3 1 2 1 4 1 2

M M M M M

M M M M M

M M M M M

M M M M M

B B M M M M M

1

B B M

B B

Bank conflict delays increase

without bounds.TRR

Figure 4: An example of unbounded bank conflict delay.

storage conflict delays. In columnization, however, a task maysuffer a bank conflict delay if another task is already accessingthe same bank. To avoid such interference, we may allocate aset of private banks to each task, which is called bankization[14]. However, this is restrictive in that the shared cache hasto have at least as many banks as the number of tasks.

However we can reduce or even eliminate bank conflictswithout giving private banks to every task by consideringthe bus schedule, as illustrated in Fig. 3(a) and (b). In theexample, C2 and C4 can share bank Bk without suffering anybank conflict delays since the request from C4 begins afterC2 completes its request. On the other hand, C3 may suffer abank conflict delay since its requests can be overlapped withthe one from C2. For the same reason, C3 and C4 cannotshare any bank without suffering delays. However, as shownin (b), if we assign another bank Bk′ to C3, none of the coresexperiences bank conflict delays.

However one may encounter a situation where no morebanks are available for C3 in Fig. 3(a). In that case, we canfurther reduce or eliminate the bank conflict delays with thehelp of a harmonic round-robin schedule, as illustrated in Fig3(c). The HRR schedule in the example enables the bankaccess requests from C2, C3, and C4 to be serial, that is tonot overlap.

Another important factor that influences on the worst-casebank conflict delay and bank-sharing is the ratio of bankaccess latency, LM , to the bus access latency, LB. To make

Core i Core j

Task A Task B

Bank kBank k-1 Bank k+1 Bank k+2

Core i uses banks [k-1,k], and

Task A occupies five columns.

Core j uses banks [k,k+2],

Task B occupies four columns, and

Task C occupies three columns.

Task C

Shared BankNon-Shared BanksShared Bank

jsB j

eB

Figure 5: Two-level cache partitioning.

the point clear, let us consider Fig. 3(c) again. As mentionedbefore, the bank conflicts among C2, C3, and C4 could beavoided since their slots are far enough apart from each other.However once LM becomes 2.5 · LB , then the cores startexperiencing bank conflict delays, as shown in Fig. 3(d).Furthermore, the delays can be accumulated beyond oneround, which we call unbounded bank conflict delay problem(see Fig. 4). In order to prevent it, the busy period of bankaccesses should be bounded by the length of one round. Sinceeach Cj can generate at most TRR

Tjbank requests within TRR,

the constraint that prevents such unbounded bank conflictdelays for a shared bank, Bk, can be expressed as follows:

∑∀ core j using bank k

LMLB · Tj ≤ 1. (2)

Thus, in Fig. 4, one of the cores should use a separate bank.As has been described above, arbitrary bank allocation may

introduce unnecessary bank conflicts, and the problem canbe aggravated by insufficient banks. One efficient way topartition a shared cache and thus to minimize bank-sharing isto allocate a contiguous subset of banks to each core and toallow any two cores to share at most one bank - the leftmostor the rightmost bank allocated to each core. Sharing onlyone bank between two cores is sufficient and better thansharing multiple banks in that the latter only increases thechance of bank conflict delays. Assume that CA and CBshare two banks, e.g., Bk and Bk+1, each of which has10 columns. Suppose that CA uses 7 columns of Bk and5 columns of Bk+1, and CB uses the rest. This is, however,equivalent to giving all columns of Bk and 2 columns ofBk+1 to CA and then letting only Bk+1 be shared betweenCA and CB . That is, sharing of m banks between any twocores can be transformed to single sharing. By this core-level partitioning, the chance of bank conflict delay can bereduced and moreover the memory address mappings canbe simplified. In addition to the core-level partitioning, wesubdivide each bank into several columns and then map eachtask to a set of contiguous private columns of the banksallocated to the core where the task runs on, as shown inFig. 5. By this task-level subpartitioning, we can eliminatestorage conflict interferences among tasks even in the samecore. Moreover, we can prioritize the tasks in allocating their

fixed exec time tunable delay

bus latencybus access delay bank latencybank conflict delay

wceti

bound

Figure 6: Delay components of tunable WCET model.

Cj

B M

LMLB

Cj Cj

diB di

M

HRR schedule

request from core j

Figure 7: Worst-case tunable delay of a cache access.

columns. Let us consider an example shown in Fig. 5, whereBk is shared between Ci and Cj . Since task τB shares Bkwith τA, it may suffer bank conflict delays. However, Bk+1

of Cj cannot be shared with any other cores by our core-level partitioning, and hence the bank accesses from τC arefree from any bank conflicts. We call such banks BCD-freebanks - a set of banks in which tasks mapped to a subsetof their columns cannot experience any bank conflict delays.Accordingly, a more memory-intensive or high-utilization taskcan benefit from using such BCD-free banks.

D. Tunable WCET

Fig. 6 illustrates the rationale behind the tunable WCETmodel proposed in this paper; the WCET of a task is parti-tioned into fixed execution time and tunable delay. While theformer is the maximum time duration that a task could take toexecute the instructions over its critical path, the latter is thesum of the delays incurred for all of its cache accesses overthe same path. In particular, we model the variable delaysas the function of bus and cache configurations. Accordingly,the tunable part is subdivided into bus access delays and bankaccess delays. Now let dBi and dMi are the upper-bounds ofbus access and bank conflict delays for each cache access, asshown in Fig. 7. Then, the WCET of τi can be defined asfollows:

wceti = ei +NMi · {L+ (dBi + dMi )}, (3)

where NMi is the number of τi’s cache accesses, L = 2·LB+

LM is the sum of fixed latencies 2.One may argue that the critical path of, and thus NM

i of,τi can be changed according to the variable delays. That is,the critical path cannot be derived without knowing HRRschedule and bank assignments in advance. While this is truein general, the analysis of tunable WCET and its optimizationwill become significantly more complex if we take a variablecritical path into account. Thus, we assume in this paper thatthere exists an execution path whose fixed execution time, ei,is so long enough that other paths cannot be longer than theobtained critical path even if they would experience maximumpossible delays.

2We assume the bus is full-duplex as was assumed by [14]. Thus, onlythe core-to-cache requests can be delayed. For the simplicity of illustrations,cache-to-core ones are not shown in any figure of this paper.

88

B B

B B

B B

B B

1 2 3 4 1 2 5 6 1 2 7

M M M M

B B

M M M M

M M M M

M M M M

M M M M

1 2 3 4 1 2 5 6 1 2 7

B B M M M M

B B

B B

B B

B B

M M M M

M M M M

M M M M

M M M M

B B M M M M B B M M M M

. . . 1 2 3 4

slot # 24121 2 3 4 5 6 7 8 9 10 11 13 14 15 16 17 18 19 20 21 22 23

2,22w

2,22sd

1st round 2nd round

of C2 at slot 22Worst-case slot delay

2,22 (22 ) 2B B Bw L L L= − ⋅ + = ⋅

that delays C2 at slot 22Worst-case busy period

Figure 8: The worst-case bank access scenario of C2, C5, C6, and C8 which share Bk. HRR : (4, 4, 12, 12, 12, 12, 12, 12).

E. Problem Description

For a given multicore system, our problem is to find theoptimal task assignments, harmonic round-robin schedule, andcore-to-banks and task-to-columns mappings that minimizethe overall system utilization, i.e.,

MinimizeNΓ∑i=1

wcetipi

. (4)

Low system utilization is generally preferred in system devel-opment since 1) a lower-utilized system can be more utilizedby accommodating additional tasks or 2) the same task set canbe implemented with lower-speed cores, which can reduce theunit cost of production. In Sec. IV, we will present the MILPformulation for this optimization problem.

III. TUNABLE WCET ANALYSIS

In this section, we explain in detail how to find the worst-case bus access, dBi , and bank conflict delays, dMi , in Eq. (3).

A. Bus Access Delay dBi

The worst-case bus access delay is defined as the maximumlength of time that a bus request should wait until it is granted.A pure round-robin schedule can bound it by NC · LB , andthus dBi is independent of which core τi is allocated to. In anHRR schedule, on the other hand, task allocation is a delayfactor since cores may have different HRR periods. Thus, ifτi runs on Cj whose HRR period is Tj , a bus access fromτi can be delayed at most Tj bus slots in the worst-case.Accordingly,

dBi = Tj · LB. (5)

B. Bank Conflict Delay dMi

Let us suppose that τi runs on Cj . By the system modelassumed in Sec. II-A, Cj uses a contiguous subset of banks,Bj = {Bjs, Bjs+1, · · · , Bjs+nB

j −1}, where nBj is the number

of banks required by Cj , which depends on the total numberof columns required by all tasks in Cj . Recall that only theleftmost and rightmost banks, i.e., Bjs and Bj

s+nBj −1

, can be

shared with others as illustrated in Fig. 5. For simplicity ofnotation, let us denote Bj

s+nBj −1

by Bje .

To identify which banks τi uses, let us denote its cachecolumns as Xi = {X i

k, Xik+1, · · · , X i

k+NX

i −1}, where NX

i

is the number of columns required by τi. Then, one of thefollowing cases holds:

Case 1. a subset or all of Xi reside in Bjs , but not in Bje ,Case 2. a subset or all of Xi reside in Bje , but not in Bjs ,Case 3. none of Xi reside in either Bjs or Bje , orCase 4. Xi stretch from Bjs to Bje .

The upper-bound of bank conflict delays can vary in differentcases. For example, in Fig. 5, τC is free from bank conflictinterferences since all of its columns are in Cj’s BCD-freebanks (Case 3). On the other hand, τA and τB use the sharedbank k (Case 1–2), and thus could experience bank conflictdelays. Meanwhile, the columns of τi may stretch from Bjsto Bje (Case 4). In this case, all accesses of τi are assumedto experience the worst-case delay, which will depend on thebank conflict delays of Bjs and Bje

3.Now let Dj

s and Dje be the upper-bounds of bank conflict

delays that a task on Cj could experience when accessingBjs and Bje , respectively. If Case 1 holds for τi, the worst-case bank conflict delay that τi could suffer, i.e., dMi , is Dj

s.Similarly, dMi for Case 2 is Dj

e. For Case 3, dMi is always0. Lastly, dMi for Case 4 is the maximum of Dj

s and Dje.

Accordingly, dMi can be expressed by the following equation:

dMi = max(δij,s ·Djs, δ

ij,e ·Dj

e), (6)

where δij,s (δij,e) is 1 if τi uses Bjs (Bje) and 0 otherwise.

1) Computation of Djk: Let Dj

k be the worst-case bankconflict delay that any task in Cj could suffer due to usingBk. To help to understand the analysis in this subsection, letus first consider an example shown in Fig. 8. We assume herethat τi runs on C2, and a subset of Xi resides in Bk.

We can first observe from Fig. 8 that the different bankaccesses from C2 experience different bank conflict delays.This is because the cores have different HRR periods; whileC3, C4, and C1 do not interfere with C2 at slot 6, C5 andC6 delay C2 at slot 10, and so on. Thus, we need to computeeach delay that τi could suffer at each slot of C2. Now let uscall such delay slot delay and define it as dsj,ϕ, where j and ϕare the indices of core and of slot, respectively. For example,in Fig. 8, ds2,6 is 0, ds2,14 is 2 ·LB , and so on. If Cj does not

3We assume the target address (column index) of an access is unknown inanalyzing the WCET of a task. Thus, if τi uses both shared and non-sharedbanks, e.g., Task B in Fig. 5, for the tractability of the analysis, we assumethat every access of τi goes to the shared bank in the worst-case.

use slot ϕ, dsj,ϕ is 0. Although the slot delays of a core canbe different with each other, we can find the maximum slotdelay in the second HRR round due to the following lemma:

Lemma 1. Slot delay dsj,ϕ+TRR is always equal to dsj,ϕ exceptfor ϕ = φj , where φj is the first slot index of Cj in a givenHRR table. For ϕ = φj , dsj,φj

≤ dsj,φj+TRR always holds.

Proof: If Cj does not use slot ϕ, then dsj,ϕ+i·TRR isalways 0 for all i = 0, 1, 2, . . .. Thus, in what follows, letus consider the case when Cj uses slot ϕ. We will prove i)dsj,ϕ ≥ dsj,ϕ+TRR and ii) dsj,ϕ ≤ dsj,ϕ+TRR . For the simplicityof notations, let us denote dsj,ϕ and dsj,ϕ+TRR by d1 and d2,respectively, as shown in Fig. 9.i) d1 ≥ d2 : Let us assume that d1 < d2. Then, d2 > 0 sinced1 ≥ 0. Because d2 is non-zero, there must exist slot ϕx(ϕ < ϕx < ϕ + TRR), where the most recent accumulationof bank accesses begins. Now let nx be the number of bankaccesses initiated in [ϕx, ϕ+ TRR − 1]. Then,

d2=ϕx+LB+nx·LM−(ϕ+TRR+LB)=ϕx+nx·LM−ϕ−TRR.Now let us consider slot ϕx − TRR and denote its slot delayby dy . Then, ϕx−TRR+LB+dy+ny ·LM is the time instantwhen the last bank access initiated before ϕ completes. Here,ny is the number of bank accesses initiated in [ϕx−TRR, ϕ−1], which is equal to nx because of the periodicity of HRRschedule. Now, suppose that d1 > 0. Then,

d1 = ϕx − TRR + LB + dy + ny · LM − (ϕ+ LB)

= ϕx − TRR + nx · LM − ϕ+ dy = d2 + dy.

The above equality results in d1 ≥ d2 because dy ≥ 0, whichcontradicts the assumption that d1 < d2. Let us now considerthe case where d1 = 0. Then,

ϕx − TRR + LB + dy + ny · LM − (ϕ+ LB) ≤ 0

⇒ϕx − TRR + nx · LM − ϕ+ dy ≤ 0 ⇒ d2 ≤ −dy,which results in d2 = 0 since dy ≥ 0. This contradicts theassumption that d1 < d2. Therefore, d1 < d2 never holds,concluding that d1 ≥ d2.ii) d1 ≤ d2 : This can be proved similarly with the abovearguments. Let us assume that d1 > d2. Then, there mustexist ϕx(< ϕ) that satisfies the following:

d1=ϕx+LB+nx · LM − (ϕ+LB)=ϕx+nx · LM −ϕ>0.

Now let us consider slot ϕx + TRR and denote its slot delayby dy . In this case also ny = nx. Then,

d2 = ϕx + TRR + LB + dy + ny · LM − (ϕ+ TRR + LB)

= ϕx + nx · LM − ϕ+ dy = d1 + dy > 0

because d1 > 0 and dy ≥ 0. Accordingly, d1 ≤ d2, whichcontradicts our assumption that d1 > d2. Thus, d1 ≤ d2.

By both i) d1 ≥ d2 and ii) d1 ≤ d2, we can thereforeconclude that d1 = d2, i.e., dsj,ϕ = dsj,ϕ+TRR , always holds.However, Case i) may not hold for the case where ϕ = φjsince slot ϕx − TRR may not exist. In Case ii), on the other

y Mn Li

Cx CxCj Cj

LB LM

ϕ

2d

RRTϕ +xϕ

. . .

LB x Mn Li

RRx Tϕ −

LB

. . .

yd LB

1d

RRT

Cx CxCj Cj

LB LM

2d

. . .

RRx Tϕ +

LB

y Mn Li

. . .

LB LM

1d

x Mn Li

LM

. . .

. . .

LB

yd

Case i)

Case ii)

1 2d d≥

1 2d d≤

ϕxϕ RRTϕ +

RRT

RRTRRT

. . .

. . .

Figure 9: Proof of Lemma 1. i) d1 ≥ d2 and ii) d1 ≤ d2.

hand, ϕx (< φj) always exists if d1 > 0, and d1 ≤ d2 alwaysholds if d1 = 0. Thus, we can conclude that only dsj,ϕ ≤dsj,ϕ+TRR holds for ϕ = φj .

By Lemma 1, we can therefore find Djk by considering only

the slots in the second round. That is,

Djk = max(dsj,ϕ), (7)

for ϕ = φj + TRR, φj + TRR + Tj, . . . , φj + 2 · TRR − Tj .However, dsj,ϕ for the slots in the first round also need to becalculated since dsj,ϕ−Tj

is used when computing dsj,ϕ, as willbe described shortly. Note that Lemma 1 does not hold in thepresence of unbounded bank conflict delays.

2) Computation of dsj,ϕ: Let us first denote the ϕth busslot as σϕ. Each slot delay dsj,ϕ is affected by how muchunfinished bank accesses have been accumulated until to σϕ.To model such busy period, let us define wj,ϕ as the timeinstant at which the most recent access to the shared bankcompletes before σϕ. To help to understand how to find wj,ϕ,let us consider the example in Fig. 8 again, and suppose thatwe want to compute w2,22. The busy period begins from thebank access of C2 at σ18 where there is no backlog of accessesto Bk. Thus, the initial busy period, w0

2,22, is 18+LB+LM =21 ·LB. It delays the access from σ19 by w0

2,22 − (19+LB),and which results in

w12,22=19+LB+(w0

2,22−(19+LB))+LM =w02,22+LM =23·LB.

Likewise, the access from C6 at the next slot is delayed bythe accumulated delay, which makes the busy period grow to

w22,22=20+LB+(w1

2,22−(20+LB))+LM =w12,22+LM =25·LB.

The busy period stops growing at σ21 because C1 does notuse Bk, thus the final value of w2,22 ends up being 25 · LB.As has been seen, the busy period grows by LM for everynew access. However, when it is discontinued, a new busyperiod continues from a new access.

Lemma 2. The worst-case busy period that can delay Cj atσϕ accessing Bk, wj,ϕ, can be found by the following iterative

Cj Cj

LB LM

. . .

LMLB . . .

, jj Tw ϕ −

,jw ϕ

jTϕ − ϕjT

, j

sj Td ϕ−

,sjd ϕ0

,jw ϕ

. . . . . .

Figure 10: Calculation of wij,ϕ and dsj,ϕ.

procedure:

w0j,ϕ =

{0 if ϕ = φj ,ϕ− Tj+LB+dsj,ϕ−Tj

+LM otherwise. (8)

If Cj at σψ does not use Bk, wi+1j,ϕ = wij,ϕ. Otherwise,

wi+1j,ϕ =

{wij,ϕ + LM if ψ + LB < wij,ϕ,ψ + LB + LM otherwise.

(9)

The procedure loops from ψ = ϕ − Tj to ϕ − 1, and thuswj,ϕ = w

Tj−1j,ϕ . If ϕ = φj , the procedure loops from ψ = 1

to φj − 1, and wj,ϕ = wφj−1j,ϕ .

Proof: We first show that the sequence of wij,ϕ isnon-decreasing. Let us consider slot σψ . If Cj at σψ doesnot access Bk, the busy period remains unchanged, i.e.,wi+1j,ϕ = wij,ϕ. Otherwise, wi+1

j,ϕ ≥ wij,ϕ + LM holds due tothe following:

i) If the new access from σψ is initiated before the busy periodwij,ϕ ends, it grows by LM . Thus, wi+1

j,ϕ = wij,ϕ + LM .

ii) If the new access from σψ is initiated at or after the endof busy period wij,ϕ, i.e., ψ + LB ≥ wij,ϕ, then,

wi+1j,ϕ = ψ + LB + LM ≥ wij,ϕ + LM .

Therefore, wi+1j,ϕ ≥ wij,ϕ always holds.

Now we will show that wij,ϕ ≤ wj,ϕ holds for all i. Firstof all, as described in Fig. 10, the initial busy period, w0

j,ϕ,can be derived from dsj,ϕ−Tj

by the following equality:

w0j,ϕ=wj,ϕ−Tj +LM =ϕ− Tj+LB+d

sj,ϕ−Tj

+LM .

That is, it is sufficient to consider only the slots in [ϕ −Tj, ϕ−1]. Accordingly, the iterative procedure loops Tj times,computing (w0

j,ϕ,w1j,ϕ,. . . , wTj−1

j,ϕ ). Since the sequence ofwij,ϕ is non-decreasing,

w0j,ϕ ≤ w1

j,ϕ ≤ · · · ≤ wTj−2j,ϕ ≤ w

Tj−1j,ϕ = wj,ϕ.

Thus, wij,ϕ ≤ wj,ϕ holds for all i = 0, 1, . . . , Tj−1. Similarly,if ϕ = φj ,

w0j,ϕ ≤ w1

j,ϕ ≤ · · · ≤ wφj−2j,ϕ ≤ w

φj−1j,ϕ = wj,ϕ.

Therefore, wj,ϕ calculated by the above procedure is theupper-bound of the busy period that can delay Cj at σϕ.

Algorithm 1 summarizes the computation process of wj,ϕ.

Algorithm 1 CALC wj,ϕ (j, ϕ, k,HRR)

1: HRR[ψ]: the idx of core using slot ψ in the given HRR.2: Bk[C]: true if core C uses bank k and false otherwise.3: φj : the idx of the 1st slot of core j in the given HRR schedule.4: if ϕ = φj then5: w0

j,ϕ := 0; ψ := 16: else7: w0

j,ϕ := ϕ− Tj + LB + dsj,ϕ−Tj+ LM ; ψ := ϕ− Tj

8: end if9: i := 0

10: while ψ < ϕ do11: if Bk[HRR[ψ]] =true then12: if ψ + LB < wi

j,ϕ then13: wi+1

j,ϕ ← wij,ϕ + LM

14: else15: wi+1

j,ϕ ← ψ + LB + LM

16: end if17: else18: wi+1

j,ϕ ← wij,ϕ

19: end if20: i← i+ 1; ψ ← ψ + 121: end while22: wj,ϕ ← wi−1

j,ϕ

Once wj,ϕ is obtained, dsj,ϕ can be computed by

dsj,ϕ = max(wj,ϕ − (ϕ+ LB), 0), (10)

as illustrated in Fig. 10.

Theorem 1. dMi computed by Eq. (6)–(10) is the worst-casebank conflict delay of task τi.

Proof: The theorem is an immediate application ofLemma 1 and Lemma 2. In summary, the slot delays of Cjfor each bank Bk in Bj are first computed by Eq. (8)–(10).Then, by Eq. (7), we can find the worst-case bank conflictdelay Dj

k that any task in Cj could suffer due to using Bk.Finally, dMi can be found by Eq. (6), of which value dependson whether τi uses any shared bank of Cj , Bjs and/or Bje .

IV. MILP FORMULATION

In this section, we present a mixed integer linear program-ming formulation for the tunable WCET optimization problemdescribed in Sec. II. Due to the space constraint, the detailedformulation is provided in [16].

A. Parameters and Variables

1) System parametersTable I shows the list of system parameters given as input.

2) Decision (zero-one) variables• αi,j : 1 if τi is allocated to Cj .• βj,k : 1 if Cj uses Bk.• σj,s : 1 if Cj uses slot s of an HRR table.• λj,p : 1 if Tj has the value of p+ Tmin − 1.• λ′j,q : 1 if Tj+1/Tj has the value of q.

3) Range variables• [bjs, b

je] : the range of banks mapped to Cj .

• [xis, xie] : the range of columns mapped to τi.

Table I: List of system parameters.

Parameter Description

NΓ number of tasksNC number of coresNB number of banks of the cacheNW number of columns in a bankNX number of columns of the cacheNX

i number of columns required for task iNM

i number of cache accesses of task iTmin, Tmax lower- and upper-limit of HRR periodsLB , LM bus and bank access latency

B. Objective Function

The optimization objective we consider in this paper is tominimize the overall system utilization, that is,

MinimizeNΓ∑i=1

wcetipi

.

C. Constraints

1) Harmonic round-robin: Firstly, Tj has to be an integerin the range of [Tmin, Tmax], thus it is expressed as follows:

C1. ∀ core j, Tj =

Tmax∑p=Tmin

p · λj,(p−Tmin+1),

where λj,p is an indicator variable that is 1 if Tj has the valueof p+ Tmin− 1 and 0 otherwise. Meanwhile, the sum of allλj,p should be equal to 1, that is,

C2. ∀ core j,Tmax−Tmin+1∑

p=1

λj,p = 1.

Secondly, each Tj+1 is a positive integer multiple of Tj ,i.e., mj =

Tj+1

Tj∈ N, where 1 ≤ mj ≤ Tmax

Tmin . Accordingly,

C3–C4. ∀ core j, mj =

mmax∑q=1

q · λ′j,q ,mmax∑q=1

λ′j,q = 1,

where mmax is Tmax

Tmin , and λ′j,q is an indicator variable thatis 1 if mj has the value of q and 0 otherwise. Now if bothλj,p and λ′j,q are 1, the value of Tj+1 should be Tj ·mj =(p+Tmin− 1) · q, but not exceed Tmax. Thus, the followingconditional constraints are needed:C5–C6. ∀ core j, ∀1≤p≤Tmax−Tmin+1, ∀1≤q≤mmax ,

λj,p = λ′j,q = 1 ⇒ λj+1,(p+Tmin−1)·q−Tmin+1 = 1,

(p+ Tmin − 1) · q ≥ Tmax + 1 ⇒ λ′j,q = 0.

Lastly, the sum of the reciprocals of HRR periods should be1. Now let Oj be the reciprocal of Tj . Then, by substitutingOj for Tj in C1, Oj can be expressed as follows:

C7. ∀ core j, Oj =

Tmax∑p=Tmin

1

p· λj,(p−Tmin+1).

We can therefore simply substitute Oj for 1Tj

in the originalcondition, which results in

0

1 2 1 3 1 2 1 4HRR

Schedule

1 2 3 4 5 6 7 8s

1

1

1 1 1

1

1

1

0 0

0 0 0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

s,1σs,2σ

s,3σs,4σ

2,1 2,2 2,3 2,4

2,2 2,3 2,4 2,5

2,3 2,4 2,5 2,6

2,4 2,5 2,6 2,7

2,5 2,6 2,7 2,8

1

1

1

1

1

σ σ σ σσ σ σ σσ σ σ σσ σ σ σσ σ σ σ

+ + + =+ + + =+ + + =+ + + =+ + + =

Figure 11: The assignments of σj,s for HRR of (2, 4, 8, 8).

C8.NC∑j=1

Oj = 1.

2) Bus scheduling table: Building an HRR table is equiv-alent to assigning a set of σj,s to each Cj . An example ofσj,s assignment is shown in Fig. 11.

First of all, a slot can be assigned to only one core. Thus,

C9. ∀ slot s,NC∑j=1

σj,s = 1.

The first slot of Cj can appear only after at least one slotis assigned to each core C1, . . . , Cj−1. Thus,

C10. ∀ core j ∀ slot s, s ≤ j − 1 ⇒ σj,s = 0.

Lastly, the periodicity of the slots of Cj can be ensuredby checking the sum of every p(= Tj) consecutive σj,s ofCj . For example, in Fig. 11, the sum of four consecutive σ2,sshould be 1 as T2 is four. Accordingly,

C11. ∀ core j, ∀Tmin≤p≤Tmax , and ∀1≤s≤Tmax−p+1,

Tj = p⇒s+p−1∑t=s

σj,t = 1.

3) Task to core mapping: Every task should be allocatedto one of NC cores, thus,

C12. ∀ task i,NC∑j=1

αi,j = 1.

4) Core to bank mapping: The minimum number of cachebanks required by Cj is nBj =

∑NΓ

i=1(αi,j · (NXi /N

W )). IfnBj = n+ 1

NW , n+1 banks are sufficient. However, if nBj ≥n+ 2

NW , n+ 2 banks can be required by Cj :

C13. ∀ core j, nBj ≤ (bje − bjs + 1) ≤ nBj + 2− 2

NW.

If nBj is 0, however, no banks should be allocated to Cj :

C14. ∀ core j, nBj = 0 ⇒ bje = bjs = NB + 1.

If Cj uses Bk, βj,k should be set to 1:

C15. ∀ core j ∀ bank k, bjs ≤ k ≤ bje ⇒ βj,k = 1.

Any two cores can share at most one bank, which is eitherthe first or the last bank of each core (see Fig. 5): C16. Foreach core 1 ≤ j ≤ NC−1, and j + 1 ≤ j′ ≤ NC,

bj′e ≤ bjs or bje ≤ bj

′s .

81 2 3 4 1 2 5 6 1 2

1st round

7 1 2 3

slot # 121 2 3 4 5 6 7 8 9 10 11 13 14 15

...

...

w1,14,k = -LB

w2,14,k = 0

w6,14,k = 2·LB

w7,14,k = LB

w8,14,k = 0

w12,14,k = 0w13,14,k = -LB

Figure 12: An illustration of ws′,s,k for s = 14, LM = 2 ·LB.

Finally, the unbounded bank conflict delay problem definedby Condition (2) in Sec. II-C should be prevented:

C17. ∀ bank k,LMLB

·NC∑j=1

βj,k · Oj ≤ 1.

Here, the product of βj,k and Oj can be linearized by addingthe following four constraints and then by replacing βj,k ·Ojin C17 with a new variable, fj,k :

C17∗. ∀ core j ∀ bank k,fj,k ≤ UOj · βj,k, fj,k ≤ Oj ,

fj,k ≥ Oj − UOj · (1− βj,k), fj,k ≥ 0,

where UOj is the upper-bound of Oj , which is 1Tmin

. If βj,k is1, fj,k has to have the value of Oj to satisfy all the constraintsin C17∗. For the detail, please refer to [17].

5) Task to column mapping: τi requires NXi columns,

C18. ∀ task i, xie − xis + 1 = NX

i .

No column can be shared between any two tasks:

C19. For each task 1 ≤ i ≤ NΓ−1, and i+ 1 ≤ i′ ≤ NΓ,

xi′e ≤ xis − 1 or xie + 1 ≤ xi

′s .

τi on Cj can occupy only a contiguous subset of thecolumns belonging to Cj ’s banks, i.e., [bjs, b

je]. Thus,

C20. ∀ task i ∀ core j,

αi,j = 1 ⇒ (bjs − 1) ·NW + 1 ≤ xis and xie ≤ bje ·NW .

6) WCET calculation: The worst-case execution time ofτi, i.e., Eq. (3) in Sec. II-D, is formulated as follows:

C21. ∀ task i,

wceti = ei +NMi · {(2 · LB + LM ) + (dBi + dMi )}.

7) Bus Access Delay dBi : Eq. (5) in Sec. III-A can beformulated as follows:

C22. ∀ task i, dBi = LB ·( NC∑j=1

αi,j · Tj).

Note that C22 can be similarly linearized as in C17∗, but nowUTj is Tmax.

8) Bank Conflict Delay dMi : Let ws′,s,k be the residualworkload generated in [s′, s−1] that could delay a bank accessfrom slot s to Bk, which can be represented as follows:

C23. ∀ bank k, ∀TRR+1≤s≤2·TRR , ∀1≤s′≤s−1,

ws′,s,k = LM ·( s−1∑t=s′

NC∑j=1

σj,t · βj,k)− LB(s− s′).

Note that σj,t · βj,k is 1 if and only if Cj at slot t usesBk. The product of two decision variables, σj,t · βj,k, can belinearized by adding the following three constraints and thenby replacing σj,t · βj,k with a new binary variable, gj,k,t:

C23∗. ∀ core j, ∀ bank k, ∀1≤t≤2·TRR−1,

gj,k,t ≤ σj,t, gj,k,t ≤ βj,k, gj,k,t ≥ σj,t + βj,k − 1.

gj,k,t is 1 only when both σj,t and βj,k have the value of1. The rationale behind this constraint, C23, is based on thatevery slot accessing Bk generates LM workload and LB isconsumed by each slot afterward. The example in Fig. 12shows possible busy periods that could delay the bank accessfrom C2 at slot 14. Among others, w6,14,k is the uniqueand exact residual workload that maximally delays the slot,since it counts from a no-backlogged slot and also there is nodiscontinuity in its busy period. Also note that we do not needto compute the slot delays in the first round, as explained inSec. III-B1.

Now let us,k be the worst-case slot delay that slot s couldexperience when accessing Bk, which is simply the maximumof ws′,s,k for all 1 ≤ s′ ≤ s− 1. Thus,C24. ∀ bank k, ∀TRR+1≤s≤2·TRR , ∀1≤s′≤s−1,

us,k ≥ ws′,s,k.

us,k should be lower-bounded by 0 since the core at slot smay not use Bk, or not share it with others.

Now let us define by zj,k the maximum of slot delays ofCj using Bk, i.e., Dj

k. The following constraint is equivalentto Eq. (7) in Sec. III-B1:C25. ∀ core j ∀ bank k, ∀TRR+1≤s≤2·TRR ,

zj,k ≥ σj,s · us,k.Note that C25 can be similarly linearized as in C17∗, but nowUus,k

is (LM − LB) · (s− 1).As explained in Sec. III-B, in order to find dMi , we need

to compute Djs and Dj

e first. The following constraints canbe used to find Dj

s and Dje from zj,k:

C26–27. ∀ core j ∀ bank k,

bjs = k ⇒ Djs = zj,k , bje = k ⇒ Dj

e = zj,k.

τi on Cj could experience bank conflict delays if it usesCj’s shared banks. The following constraints can finally findthe worst-case bank conflict delay that τi on Cj can suffer:

C28–29. ∀ task i ∀ core j ∀ bank k,αi,j = 1 and xis ≤ bjs ·NW ⇒ dMi ≥ Dj

s,

αi,j = 1 and (bje − 1) ·NW + 1 ≤ xie ⇒ dMi ≥ Dje.

Note here that if [xis, xie] stretch across [bjs, b

je], d

Mi results in

max(Djs, D

je) by these constraints.

9) Task and core utilization: To bound the WCET of τi,wceti should be restricted within its period, i.e., pi:

C30. ∀ task i, wceti ≤ pi.

Table II: Experimental parameters.

Parameter ValueNC {4, 6, 8} coresNB ×NW {4× 16, 8× 32} columnsLB , LM 2.5 ns, 5.0 nsNΓ {20, 30, 40} tasksei uniform from [10, 250] mspi uniform from [500, 10000] msNX

i uniform from [1, 5] columnsNM

i uniform from [105, 106 · {1, 3, 5, 7, 10}] times

Likewise, we need to limit each core utilization to 1, or toa specific bound, e.g., Liu and Layland’s bound [18].

C31. ∀ core j,NΓ∑i=1

wcetipi

· αi,j ≤ 1.

Similar to C17∗, the above constraint also can be linearizedwith Uwceti = pi (by C30).

V. EVALUATION

In this section, we evaluate the proposed tunable WCEToptimization problem formulated in Sec. IV in terms of theminimum achievable system utilization by using IBM ILOGCPLEX 12.1 [19]. The detailed results can be found in [16].

A. Evaluation Method

Table II summarizes the experimental parameters used forthe experiments. With these parameters, we compare thefollowing three methods:

PureRR BFD Proposed

Task allocation flexible pre-allocated flexibleBus schedule PRR HRR HRR

Cache partition two-level and flexible

• PureRR : The bus is scheduled by a pure round-robin;every core has the same slot period, i.e., NC, as in [14].• BFD : Each task is pre-assigned to a core by Best-Fit De-

creasing heuristic. For this, tasks are first sorted in decreasingorder by estimated task utilization, which is defined as

wceti/pi = [ei +NMi · {2 · LB + LM + (NC · LB)}]/pi.

Each task is then allocated in the sorted order to the corewhich after accommodating the task will have the least re-maining utilization. Note that wceti is computed by assumingτi does not experience any bank conflict delay and the bus isscheduled by a pure round-robin. However, the bus schedulemay change to a harmonic one during optimization.• Proposed : This is the proposed method in this paper.

Note that in all the methods, the shared cache is not pre-partitioned since the unbounded bank conflict delay problemmay arise with a random or fixed pre-partitioning.

1) Evaluation metric: We compare the above methods interms of minimum achievable system utilization, UPureRR,UBFD, and UProposed. Note that different task sets mayhave different baseline system utilizations. Thus, for faircomparisons, we normalize UBFD and UProposed to UPureRRfor each task set and then take the average of 20 random sets.Each error bar indicates the standard deviation of the normals.

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

4 cores20 tasks

6 cores30 tasks

8 cores40 tasks

Avg

. of m

inim

um s

yste

m u

tiliz

atio

n no

rmal

ized

to U

Pur

eRR

PureRR BFD Proposed

Figure 13: Minimum system utilization with different core counts.

B. Evaluation Result

In this section, we present the evaluation results for theabove methods obtained with different 1) core counts, 2)cache access intensities, and 3) cache configurations.

1) Impact of core count: Fig. 13 compares the minimumsystem utilization as increasing the number of cores. In thisexperiment, the cache is configured as 8 banks × 16 columns,and NM

i of tasks are randomly chosen within [105, 107].Also, in order to maintain average load for the cores, weproportionally increase NΓ as the core count increases.

As the result shows, our proposed method can achievelower system utilization than PureRR by 10%–20% in aver-age. We can see the improvement of proposed method com-pared to PureRR increases with core count. This is becausethe number of bank-sharing among cores increases with thecore and task counts due to the fixed capacity of the sharedcache. Another important factor is that with more cores anHRR schedule can be more flexible in prioritizing the cores.Thus, high-utilization tasks can benefit from being allocated tosuch cores whose HRR periods are short. Meanwhile, the im-provement gap between Proposed and BFD can be explainedin a similar manner, that is, pre-task assignments preventfurther optimization by tightening the constraints on the busschedule and bank-sharing. Another interesting observation inthe result is that BFD outperforms PureRR. This implies thateven if there is no flexibility in assigning tasks to cores, it islikely that the system utilization can significantly be loweredby employing our HRR bus scheduling, which in turn helpsreduce bank conflict delays as described in Sec. II-C.

2) Impact of cache access intensity: In order to see the im-pact of cache access intensity on the utilization improvement,we perform another experiment by increasing the upper-limitnumber of cache accesses, NM

i . In this experiment, we fixNC = 8, NΓ = 40, and NB ×NW = 8× 16, and vary NM

i

from 1 million to 10 million. With these parameters, NMi for

each task is chosen randomly between 0.1 million and NMi .

Fig. 14 shows that the utilization improvement of Proposedover PureRR increases with the cache access intensity. Thiscan be explained by the underlying rationale behind ourtunable WCET model. That is, the possibility of further

1 million 3 million 5 million 7 million 10 million0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

The upper−limit number of cache accesses

Avg

. of m

inim

um s

yste

m u

tiliz

atio

n no

rmal

ized

to U

Pur

eRR

PureRR BFD Proposed

Figure 14: Minimum system utilization with different cacheaccess intensities.

optimization grows with the ratio of tunable delay to thefixed execution time. Thus if tasks access the bus and cachemore intensively, it is more likely that the system utilizationcan be further reduced by a similar argument explained inthe previous discussion. However, this does not necessarilyimply that higher cache access intensity would always leadto lower system utilization; UProposed

UPureRRand UBFD

UPureRRconverge

to certain levels (around 0.8 and 0.85, respectively) as thetasks become more memory-intensive. This is due to the factthat as the proportion of tunable delay grows, the sensitivityof WCET variation to changes in bus schedule and cachepartition also increases. Recall that, by our tunable WCETmodel, a decrease in one’s delay naturally leads to increasesin the delays of the rest of the tasks. Therefore, if most tasksare sensitive to WCET variation, the overall improvement canbe limited because it is more likely for some tasks or coresto exceed the utilization bound, i.e., C30–C31 in Sec. IV.

3) Impact of cache configuration: Fig. 15 shows the impactof different cache configurations on the utilization improve-ment. For this, we fix NC = 8, NΓ = 40, and NM

i = 10million, and vary the shared cache configurations: 4 banks ×32 columns and 8 banks × 16 columns.

The result shows that the utilization improvements ofProposed and BFD over PureRR with 8×16 cache are slightlyhigher than those with 4 × 32 one, even though the totalnumber of columns required by all tasks is similar betweentwo cases. This difference arises mainly due to the differentgranularity of core-to-bank mappings. To put it clearly, let usconsider a core that requires 35 columns. With the 8 × 16cache, the core can fit into 3 out of 8 banks. With the4 × 32 cache, on the other hand, the core needs at least 2but possibly 3 out of 4 banks, which is equivalent to 4 or 6banks of the 8 × 16 cache. Accordingly, each core is likelyto take more banks than it actually needs with a fewer-bankscache, which results in an increased number of bank-sharing.Another factor that influences the difference is that with morecores bank conflict delays can further be reduced by the helpof a harmonic round-robin schedule, as previously discussed.

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

4 banks32 columns/bank

8 banks16 columns/bank

Avg

. of m

inim

um s

yste

m u

tiliz

atio

n no

rmal

ized

to U

Pur

eRR

PureRR BFD Proposed

Figure 15: Minimum system utilization with different cacheconfigurations.

VI. RELATED WORK

A great deal of research effort has been devoted to ad-dress the optimization of shared resource allocation andarbitration in multicore architectures. For on-chip memorypartitioning, Suhendra et al. [20] proposed an ILP formula-tion that finds the optimal scratchpad memory partition andtask allocation/scheduling which minimize tasks’ executiontimes. In [21], the authors examined the impacts of differentcombinations of cache locking and partitioning schemes onthe system utilization. In [22], Bui et al. proposed a geneticalgorithm that can find near optimal cache partition and task-to-partition assignments that minimize the system utilization.

Another line of research has focused on shared bus arbitra-tion methods. Rosen et al. [6] and Andrei et al. [23] addressedTDMA-based bus access policies that is tightly coupled withthe worst-case execution paths of tasks. They proposed anoptimization problem that finds the optimal TDMA schedulewhich minimizes the global delay of tasks, and extended it todeal with average-case delays [8]. Additionally, Schranzhoferet al. [11] analyzed the worst-case response time of real-timetasks under different cache access models for TDMA-basedbus arbitration policies.

Although it is not addressed in this paper, the issue ofshared memory contention is also receiving increasing atten-tion [24], [25].

VII. CONCLUSION AND FUTURE WORK

In this paper, we have proposed a novel perspective ofWCET model called tunable WCET, which enables system-level optimization for hard real-time multicore system. In thismodel, the WCETs of tasks are no longer dependent upona system configuration, but rather decide how to configurethe shared bus and cache of the system. As the WCET-aware shared resource arbitration and allocation methods, wehave introduced harmonic round-robin bus scheduling andtwo-level cache partitioning method. We have formulated anMILP-based optimization problem, and the experimental re-sults have shown that our proposed methods can significantlylower overall system utilizations.

In the future, we will investigate how to extend our resourceallocation methods to support soft real-time tasks as well.One possible direction is to allow soft real-time tasks toshare a few banks of the shared cache, and then take thestorage interference due to column-sharing [26] into accountin the tunable WCET model. Additionally, we plan to developa heuristic algorithm that can efficiently solve our tunableWCET optimization problem. Also, we have assumed in thispaper that the critical path does not change with the changein tunable delays. This is a clear limitation, thus, as in [6],we will investigate the possibility of combining control flowanalysis with our WCET analysis, in order to evaluate thepractical applicability of the proposed approach.

ACKNOWLEDGMENT

The authors would like to thank the reviewers for theirvaluable comments and suggestions and Sungjin Im for hishelp in the MILP formulation. This work is supported in partby a grant from Rockwell Collins, by a grant from LockheedMartin, and by ONR N00014-08-1-0896. Any opinions, find-ings, and conclusions or recommendations expressed in thispublication are those of the authors and do not necessarilyreflect the views of sponsors.

REFERENCES

[1] “Freescale QorIQ P4080 Processor,” http://www.freescale.com/webapp/sps/site/prod summary.jsp?code=P4080.

[2] “ARM11MPCore Processor,” http://www.arm.com/products/processors/classic/arm11/arm11-mpcore.php.

[3] A. Fedorova, S. Blagodurov, and S. Zhuravlev, “Managingcontention for shared resources on multicore processors,” Com-munications of the ACM, vol. 53, no. 2, pp. 49–57, 2010.

[4] S. Zhuravlev, S. Blagodurov, and A. Fedorova, “Addressingshared resource contention in multicore processors via schedul-ing,” in Proc. of Int’l Conference on Architectural Support forProgramming Languages and Operating Systems, 2010.

[5] M. Kandemir, S. P. Muralidhara, S. H. K. Narayanan, Y. Zhang,and O. Ozturk, “Optimizing shared cache behavior of chipmultiprocessors,” in Proc. of IEEE/ACM Int’l Symposium onMicroarchitecture, 2009, pp. 505–516.

[6] J. Rosen, A. Andrei, P. Eles, and Z. Peng, “Bus accessoptimization for predictable implementation of real-time appli-cations on multiprocessor systems-on-chip,” in Proc. of IEEEReal-Time Systems Symposium, 2007, pp. 49–60.

[7] S. Chattopadhyay, A. Roychoudhury, and T. Mitra, “Modelingshared cache and bus in multi-cores for timing analysis,”in Proc. of Int’l Workshop on Software and Compilers forEmbedded Systems, 2010, pp. 1–10.

[8] J. Rosen, C.-F. Neikter, P. Eles, Z. Peng, P. Burgio, andL. Benini, “Bus access design for combined worst and aver-age case execution time optimization of predictable real-timeapplications on multiprocessor systems-on-chip,” in Proc. ofIEEE Real-Time and Embedded Technology and ApplicationsSymposium, 2011.

[9] J. Yan and W. Zhang, “WCET analysis for multi-core proces-sors with shared L2 instruction caches,” in Proc. of IEEE Real-Time and Embedded Technology and Applications Symposium,2008, pp. 80–89.

[10] Y. Li, V. Suhendra, Y. Liang, T. Mitra, and A. Roychoudhury,“Timing analysis of concurrent programs running on sharedcache multi-cores,” in Proc. of IEEE Real-Time Systems Sym-posium, 2009, pp. 57–67.

[11] A. Schranzhofer, J.-J. Chen, and L. Thiele, “Timing analysisfor TDMA arbitration in resource sharing systems,” in Proc. ofIEEE Real-Time and Embedded Technology and ApplicationsSymposium, 2010, pp. 215–224.

[12] A. El-Haj-Mahmoud, A. S. AL-Zawawi, A. Anantaraman, andE. Rotenberg, “Virtual Multiprocessor: an analyzable, high-performance architecture for real-time computing,” in Proc.of Int’l Conference on Compilers, Architecture, and Synthesisfrom Embedded Systems, 2005.

[13] B. Lickly, I. Liu, S. Kim, H. D. Patel, S. A. Edwards, and E. A.Lee, “Predictable programming on a precision timed architec-ture,” in Proc. of Int’l Conference on Compilers, Architecture,and Synthesis from Embedded Systems, 2008.

[14] M. Paolieri, E. Quinones, F. J. Cazorla, G. Bernat, andM. Valero, “Hardware support for WCET analysis of hardreal-time multicore systems,” in Proc. of IEEE/ACM Int’lSymposium on Computer Architecture, 2009, pp. 57–68.

[15] D. Chiou, P. Jain, S. Devadas, and L. Rudolph, “Dynamic cachepartitioning via columnization,” in Proc. of Design AutomationConference, 2000.

[16] M.-K. Yoon, J.-E. Kim, and L. Sha, “WCET-Aware opti-mization of shared cache partition and bus arbitration forhard real-time multicore systems,” Dept. of Computer Science,University of Illinois at Urbana-Champaign, Technical report,May 2011, http://www.ideals.illinois.edu/handle/2142/25909.

[17] J. Bisschop, AIMMS optimization modeling, Paragon DecisionTechnology, 2011.

[18] C. L. Liu and J. W. Layland, “Scheduling algorithms formultiprogramming in a hard real-time environment,” Journalof the ACM, vol. 20, no. 1, pp. 46–61, 1973.

[19] “IBM ILOG CPLEX Optimizer,” http://www-01.ibm.com/software/integration/optimization/cplex-optimizer.

[20] V. Suhendra, C. Raghavan, and T. Mitra, “Integrated scratchpadmemory optimization and task scheduling for MPSoC architec-tures,” in Proc. of Int’l Conference on Compilers, Architectureand Synthesis for Embedded Systems, 2006, pp. 401–410.

[21] V. Suhendra and T. Mitra, “Exploring locking & partitioning forpredictable shared caches on multi-cores,” in Proc. of DesignAutomation Conference, 2008.

[22] B. D. Bui, M. Caccamo, L. Sha, and J. Martinez, “Impactof cache partitioning on multi-tasking real time embeddedsystems,” in Proc. of IEEE Conference on Embedded and Real-Time Computing Systems and Applications, 2008.

[23] A. Andrei, P. Eles, Z. Peng, and J. Rosen, “Predictable imple-mentation of real-time applications on multiprocessor systems-on-chip,” in Proc. of Int’l Conference on VLSI Design, 2008,pp. 103–110.

[24] O. Mutlu and T. Moscibroda, “Stall-time fair memory accessscheduling for chip multiprocessors,” in Proc. of IEEE/ACMInt’l Symposium on Microarchitecture, 2007.

[25] M. Paolieri, E. Quinones, F. J. Cazorla, and M. Valero, “Ananalyzable memory controller for hard real-time CMPs,” IEEEEmbedded Systems Letters, vol. 1, no. 4, 2009.

[26] H. Ramaprasad and F. Mueller, “Bounding preemption delaywithin data cache reference patterns for real-time tasks,” inProc. of IEEE Real-Time and Embedded Technology and Ap-plications Symposium, 2006.

Optimizing Tunable WCET with Shared Resource Allocation ...Optimizing Tunable WCET with Shared...

Documents

Transcript of Optimizing Tunable WCET with Shared Resource Allocation ...Optimizing Tunable WCET with Shared...