[IEEE Comput. Soc 16th International Conference on VLSI Design. Concurrently with the 2nd...

Optimal Code and Data Layout in Embedded Systems

T.S. Rajesh Kumar† R. Govindarajan‡ C. P. Ravi Kumar†

† Texas Instruments India Ltd. ‡ Supercomputer Education & Research CentreWind Tunnel Road Indian Institute of Science

Bangalore, 560 017, India Bangalore, 560 012, India

{tsrk,ravikumar}@ti.com [email protected]

Abstract

Efficient layout of code and data sections in vari-ous types/levels of memory in an embedded systemsis very critical not only for achieving real-time per-formance, but also for reducing its cost and powerconsumption. In this paper we formulate the opti-mal code and data section layout problem as an in-teger linear programming (ILP) problem. The pro-posed formulation can handle: (i) on-chip and off-chipmemory, (ii) multiple on-chip memory banks, (iii) sin-gle and dual ported on-chip RAMs, (iv) overlay ofdata sections, and (v) swapping of code and data(from/to external memory). Our experiments demon-strate that, for a moderately complex embedded sys-tem, the optimal results produced by our formulationtook only a few minutes on a PC, and it matches, interms of performance and on-chip memory size, with ahand-optimized code/data layout which took 1 man-month.

1 Introduction

Many of today’s embedded systems consist of one ormore embedded microcontrollers, digital signal proces-sors, application specific circuits and read-only mem-ory, all integrated into a single system-on-chip pack-age. These systems are customized to run specific ap-plications and is not user programmable. Power con-sumption, real-time performance, code-size, and costare key design considerations for embedded systems.To optimize the system cost and power, embedded sys-tems are designed with custom memory architecture.

To get real-time performance the software devel-opers use many optimizations such as memory over-lay and swapping to pack the application on to thecustom memory architecture. However, the optimiza-tion process can be complicated with IP reuse wherethe reused software modules are developed by multiple

vendors and each of the IP used is optimized indepen-dently.

Thus an application developer spends a significantamount of time on analyzing the application to obtainan optimal cost and performance mapping of code anddata sections to the custom memory architecture. Itis important to note here that the code and data sec-tion layout optimization is often tackled manually. Itis not uncommon that a significant amount (typically1 or 2 man-months) of time to be spent on this hand-optimization. The task is further complicated by thefact that each variant of an embedded system mayhave a different memory size, and even possibly a dif-ferent memory architecture. The tight time-to-marketrequirements and the competitive cost-performancebenefits offered by other vendors, necessitate obtain-ing optimal or near-optimal solution for this prob-lem in at least a semi-automated manner, and fairlyquickly.

This paper addresses the code and data section lay-out problem by formulating it as an integer linear pro-gramming (ILP) problem. The proposed formulationcan handle: (i) on-chip and off-chip memory, (ii) mul-tiple on-chip memory banks, (iii) single and dual ac-cess RAMs, (iv) overlay of data sections with non-overlapping life times, and (v) swapping of code anddata (from/to off-chip memory). We have developed aframework which can automatically generates the ILPformulation for an embedded application. The ILPformulation is solved using the public domain solver,lp solve. The framework when applied to moder-ately complex applications was able to obtain optimalsolution, within a few minutes on a workstation. Theoptimal solutions matched exactly the hand-optimizedsolutions which took 1 to 2 man-months.

The following section deals with the necessary back-ground. The ILP formulation is presented in Section 3.We report the experimental results in Section 4. Sec-tions 5 and 6 provide a discussion on related work

Proceedings of the 16th International Conference on VLSI Design (VLSI’03) 1063-9667/03 $17.00 © 2003 IEEE

and concluding remarks.

2 Background and Problem Statement

Embedded systems have an on-chip or scratch padmemory which has a single cycle access time. Typi-cally the scratch-pad memory is organized into multi-ple memory banks to facilitate multiple simultaneousdata accesses. Further, each on-chip memory bankcan be organized as a single-access RAM (SARAM)or a dual-access RAM (DARAM), to provide singleor dual accesses to the same memory bank in a singlecycle. Providing dual or multiple access is very impor-tant from a performance viewpoint, as such, concur-rent accesses to the same array are common in DSPapplications.

Embedded applications, on the other hand, have abuilt in hierarchy. An application is composed of sev-eral modules, where each module consists of one ormore code and data sections. Each data section con-sists of a set of data variables, data arrays, or func-tions, Alternatively, each data array can be named asa separate section. Software developers spend consid-erable time to achieve a careful layout of code anddata sections to get maximum performance from thescratch pad memory.

Several software optimization techniques have beenproposed in the literature [7], including:

• Allocating data arrays that are accessed simul-taneously to different on-chip memory banks toachieve single cycle access.

• Mapping a data array which is required tosupport multiple simultaneous accesses to aDARAM.

• Overlay of data structures, typically arrays, toshare the same on-chip memory space. Thesearrays are referred as scratch buffers, have non-overlapping life time.

• Swapping essential code and data sections fromoff-chip memory to on-chip memory before theexecution of the appropriate code segment.

Taking the above optimizations into consideration,a code and data section layout can be defined as amapping which specifies where (i.e., in which memorytype) the various code and data sections reside, thememory bank(s) on which the sections reside, the typeof memory access (single or dual access) supported inthe memory bank, whether or not certain code (ordata) sections are overlayed, and whether or not cer-tain code (or data) sections are swapped.

The optimal code and data section layout problemcan be stated as:

Given an application A with a set of codesections P and a set of data sections D, anda memory architecture M , derive a data lay-out L such that the application A when runon memory architecture M with the data lay-out L incurs the least number of memorystalls.

In the above definition, we consider optimality onlyfrom the point of view of memory stall cycles. This isdirectly relates to performance in terms of executiontime in embedded systems.

3 ILP Formulation

In this section we present our code and data layoutformulation in an incremental way. We start with thesimplest problem and add the different optimizationsone by one.

The ILP formulation for the optimal code and datasection layout problem, requires a number of applica-tion related parameters. We describe these first. Anembedded application is composed of N modules. LetNpj and Ndj represent the number of code and datasections in module j. In our discussion, we assumethat each data section refers to a single array. Letthe size of code (or data) section s in module j bedenoted by Pjs (or Djs, respectively). The numberof times a code section s (in module j) is accessed isdenoted by APjs. The access count for a data sectionis denoted by ADjs. The value for some of the aboveparameters, e.g., the access counts, can be obtained byprofiling the application. In our framework the profiledata is collected using an Instruction Set Simulator.

For the ILP formulation we also require memoryarchitecture parameters. The size of internal (on-chip)and external (off-chip) memory are denoted by Mi andMe respectively. We represents the the number of stallcycles incurred in accessing external memory.

3.1 Basic Formulation

A code section s in module j placed in externalmemory incurs APjs ∗ We stall cycles. To indicatewhether a code section is placed in internal or externalmemory, we use a 0-1 integer variable IPjs; IPjs is 1if the code section is placed in on-chip memory and 0otherwise. Thus the number of stall cycles due to codesection s is APjs · We · (1−IPjs). Similarly, for a datasection, ADjs · We · (1 − IDjs) is the number of stallcycles. The objective of the formulation is to minimize


the total number of memory stalls. That is,

min(∑N

j=1

∑Npj

s=1 APjs · We · (1 − IPjs) +∑N

j=1

∑Ndj

s=1 ADjs · We · (1 − IDjs))

(1)

Equations (2) and (3) enforce that the total size ofthe code and data sections that are placed in the exter-nal and internal memory do not exceed the availableexternal and internal memory respectively.

N∑j=1

Npj∑s=1

Pjs· (1−IPjs)+N∑

j=1

Ndj∑s=1

Djs· (1−IDjs) ≤ Me (2)

N∑j=1

Npj∑s=1

Pjs · IPjs +N∑

j=1

Ndj∑s=1

Djs · IDjs ≤ Mi (3)

Lastly, we add the constraint that IPjs and IDjs are0-1 integer variables.

3.2 Handling Multiple Memory Banks

In embedded DSP applications, two to three datasections are accessed simultaneously in a cycle. Tohandle this, the data variables that are accessed si-multaneously need to be placed in different memorybanks. In this section, we illustrate handling two si-multaneous accesses in our formulation. The num-ber of simultaneous accesses to two different data sec-tions s and t in a module j is represented by Bj

st.For example, if two data sections s and t, each ofsize 100 elements, are accessed simultaneously (as ins[i]+t[i]), then Bj

st = 100. Bjss refers to the num-

ber of simultaneous accesses to the same data sec-tion s. We will consider this in our formulation inthe next subsection. For the time being, we will as-sume Bj

ss = 0. The elements of the Bj matrix arefixed (constants), and can be obtained by profiling theapplication.

Let the number of internal memory banks be Nb,and the size of the ith memory bank be Mik. Thetotal size of the internal memory is Mi =

∑Nb

k=1 MikFurther, let IDkjs (or IPkjs) represent whether data(code) section s of module j reside in the kth internalbank. Lastly, we use a (derived) 1-0 variable Zkjst torepresent whether data sections s and t of module jare both placed in internal bank k. Zkjst is 1 if andonly if IDkjs = 1 and IDkjt = 1. This can beexpressed by the linear inequality

Zkjst ≥ IDkjs + IDkjt − 1, ∀ k, j, s, t : s �= t (4)

We replace Equations (2) and (3) in the basic formu-lation by:

N∑j=1

Npj∑s=1

Pjs · EPjs +N∑

j=1

Ndj∑s=1

Djs · EDjs ≤ Me (5)

N∑j=1

Npj∑s=1

Pjs · IPkjs +N∑

j=1

Ndj∑s=1

Djs · IDkjs ≤ Mik, ∀k (6)

where EPjs (or EDjs) is 1 if the sth code (data)section reside in off-chip memory. These variables canbe expressed in terms of IPkjs and IDkjs as:

EPjs = 1 −Nb∑k=1

IPkjs and EDjs = 1 −Nb∑k=1

IDkjs (7)

To enforce that each data (or code) section reside inat most 1 internal memory bank, we add:

Nb∑k=1

IPkjs ≤ 1 andNb∑k=1

IDkjs ≤ 1, ∀j, s (8)

Lastly, the objective function in this formulation is:

min(∑N

j=1

∑Npj

s=1 APjs · We · EPjs+∑N

j=1

∑Ndj

s=1 ADjs · We · EDjs+∑N

j=1

∑Nb

k=1

∑Ndj

s=1

∑Ndj

t=s+1 Bjst · Zkjst

)(9)

subject to constraints (4) to (8). While the first twoterms account stall cycles due to external memory incode and data sections, the third term in the objec-tive function accounts for the stalls incurred when sec-tions s and t are placed in different memory banks.

3.3 Handling SARAM and DARAM

In this formulation we account for the cost of simul-taneous accesses to the same data section. Let Bj

ss de-note the number of such accesses. These accesses willincur an additional stall cycle if the data section s doesnot reside in a dual access memory bank. Likewise, asimultaneous access to data sections s and t will incuran additional stall cycle when they both are in mem-ory bank k which is single ported. Let DPk = 1 de-note that memory bank k is dual ported and DPk = 0otherwise. Note that for a given memory architectureDPk is a constant (0 or 1) and known a priori.

min(∑N

j=1

∑Npj

s=1 APjs · We · EPjs+∑N

j=1

∑Ndj

s=1 ADjs · We · EDjs +∑N

j=1

∑Nb

k=1

∑Ndj

s=1

∑Ndj

t=sBj

st · (1 − DPk) · Zkjst

)(10)

subject to constraints (4) to (8).


3.4 Overlay of Data Sections

Scratch buffers are data sections that have non-overlapping life-times and can share the same on-chipmemory space. In our discussion, we assume thatscratch buffers are identified by the application de-veloper. Let Sjs = 1 denote that data section s is ascratch buffer; Sjs = 0 otherwise. The memory usedby a scratch buffer can be reused across different mod-ules, but not within the same module.

For each module j, we compute SBkj , the sum ofthe sizes of the scratch buffers in module j that arestored in the kth internal memory bank. The memoryrequired for scratch buffers in the kth internal bankcorresponds to the maximum of SBkj over all mod-ules. That is:

SBk ≥Ndj∑s=1

Djs · IDkjs · Sjs ∀j, k (11)

Further, the individual memory requirements for eachscratch buffer which is stored in kth internal memorybank can be excluded in constraint for internal mem-ory (Inequality (6)). Thus Inequality (6) is replacedby(

N∑j=1

Npj∑s=1

Pjs · IPkjs

)+ SBk +

(N∑

j=1

Ndj∑s=1

Djs · (1 − Sjs) · IDkjs

)≤ Mik, ∀k (12)

The constraint for external memory remains same(Inequality (5)). Thus the ILP formulation in thiscase has the same objective function (Equation (10))subject to constraints (4), (5), (7), (8), (11), and (12).

3.5 Swapping of Code and Data

Swapping of a code or data section is generally ap-plied in embedded DSP systems. The code/data thatis identified for swapping resides in external memoryand copied into the internal memory (on-chip RAM)only for the duration of execution/access of a section.To model swapping, we assume that one common swapmemory space SWk is allocated in the kth internalmemory. The size of SWk is the maximum of the to-tal size of all swapped sections in a module, wherethe maximum is taken across all modules. The formu-lation for swapping proceeds in a manner similar toscratch buffer, where swapped section share the samememory area in the on-chip memory bank. Addition-ally we have to account for off-chip requirement for allswapped section (

∑Nb

k=1 SWk). Lastly, the objectivefunction should account for the cost of swapping. Dueto space limitation, we skip the details here.

4 Experimental Results

4.1 Methodology

We have developed a framework which automati-cally generates the ILP formulation for an embeddedapplication. The input for the ILP formulation gener-ator, specified in XML format, are application param-eters, memory configuration parameters, and the pro-file data parameters. To obtain profile information, wedeveloped a profiler and integrated it with the C5510Instruction Set Simulator (ISS). The target memoryarchitecture used is that of TMS320C5510 processorwith 10K words of external memory. The size andconfiguration of the internal memory varies across dif-ferent experiments and are reported subsequently.

In order to evaluate the proposed approach, we usedfour complex and widely used DSP and multimediaapplications in our experiments. For reasons of in-tellectual property, we refer to these applications, asAppln. 1 to Appln. 4. Of these, Appln. 3 and Appln. 4are large kernels embedded in wrapper ‘C’ code, whichare not fully hand-optimized. Appln.2 consists of 3instances of the same application, while all other ap-plications are run as single instances. The generatedILP formulation is solved using a public domain LPsolver, viz., lp solve.

4.2 Results

To compare the performances of the different lay-outs, rather than using the number of memory stallcycles, we used a more realistic performance metric,namely the MIPS-consumed, which is commonly usedin embedded systems design. The MIPS-consumedrefers to the processing capability required to guar-antee real-time performance for a given application.Thus lower the MIPS-consumed, higher is the perfor-mance of the layout, and more applications, or highernumber of instances of the same application, can berun on the embedded device. The MIPS-consumed fora given data layout is obtained by running the appli-cation on the simulator with the given layout and forthe given memory architecture.

First, we report the MIPS-consumed by the optimalsolution obtained for the following formulations:

Basic Formulation: The internal memory is con-sidered as one single SARAM bank of size 12Kwords. We use this as the baseline model. In Ta-ble 1, we report the normalized MIPS-consumed,the number of variables and the number of con-straints for each ILP formulation, and the timetaken on a 900 MHz Pentium III machine to solvethe ILP problem.


Handling Multiple Memory Banks: In this for-mulation, the on-chip memory is split into mul-tiple SARAM banks (three 4K word SARAMbanks). The results for this case (refer to Table1) show 14%, 16%, 3.8%, and 7.1% performanceimprovement over the baseline case for the fourapplications.

Handling SARAM and DARAM: For this for-mulation we assumed that the internal memoryconsists of 2 banks, one, of 8K word size, sup-porting single access (SARAM) and another, of4K word, as DARAM. This optimization gives16%, 18%, 4.8% and 7.4% improvement over thebaseline case for the four applications.

Handling Multiple Banks and DARAM:The memory configuration considered here con-sists of two 4K word SARAM banks and one 4Kword DARAM bank. This optimization exploitsboth multiple memory banks and dual access ca-pabilities of the scratch pad memory and gives asignificant reduction (28%, 30%, 9% and 13.8%)in MIPS consumption over the baseline case.

We remark that the somewhat lower performance im-provement in Appln. 3 and Appln. 4 could be due tothe fact that multiple simultaneous memory accessesare not fully exploited in these kernel codes.

Next we compare our solution with hand-optimizeddata layout. For the first two applications (Ap-pln. 1 and Appln. 2), we obtained the hand-optimizedmemory placement done by application developers.The application developers had performed hand-optimization to come up with a number of differentcode and data layouts offering different levels of per-formance at varying costs. Among these, they pickthe one which maximizes the performance withoutincurring excessive cost. This process took approx-imately 1 man-month for each of the two applica-tions. To compare the hand-optimized solution, weused the best case memory configuration, different fordifferent applications, and obtained the optimal lay-out using our ILP formulation. When compared tothe hand-optimized layout, our optimal solution per-forms marginally better in terms of MIPS consump-tion for the same cost. It should be noted here thatour approach can be automated to solve the optimalcode and date layout problem for different memoryconfigurations and pick the most appropriate one.

Last, we test the time taken for obtaining an op-timal solution in our approach increases with the in-creasing complexity of the application. For this pur-pose, we considered Application 2 (3 instances of a

App- Parameter Basic Handling Handling Handlinglica- Formu- Multiple DARAM multipletion lation banks banks and

DARAMA Normalized 1.0 0.86 0.84 0.70p MIPS-consumedp ILP Number 22 60 60 60l of variablesn. ILP Number 13 37 37 37

of constraints1 Time taken t0 2 10 10 10

solve (in sec.)A Normalized 1.0 0.84 0.82 0.72p MIPS-consumedp ILP Number 60 152 152 152l of variablesn. ILP Number 32 88 88 88

of constraints2 Time taken to 90 480 480 480





solve (in sec.)

Table 1: Experimental Results

standard DSP application) to be running simultane-ously on the embedded device. We considered thesame memory architecture with 80K of off-chip mem-ory, 1 × 4K DARAM bank and 2 × 4K SARAM bank.The complete formulation is considered in this case,which can handle multiple memory banks, DARAM,overlay and swapping. The number of variables andthe number of constraints in the formulation increasedto 152 and 88 respectively; and the time for obtainingan optimal solution remained within a few minutes (8minutes on a 900MHz Pentium III PC and 3 minuteson a 300MHz Ultra SPARC-2). Even after includingthe one-time simulation and profiling cost, the totaltime taken by our approach is within 30 minutes.

5 Related Work

In embedded systems, the minimization of data ac-cesses in the higher levels of memory hierarchy has


been an on-going research area [2]. The scratch padmemory usage is very critical for the overall systemperformance. In [8], a memory exploration strategyis presented for determining efficient on-chip memoryarchitecture using an analytical model. In [12], thememory exploration problem is extended to includeenergy consumption as a cost factor. Kulkarni et al.,[5] present formal and heuristic algorithms to orga-nize the data in the main memory with the objectiveof reducing cache conflict misses.

In [9], a data partitioning technique is presentedthat places data into on-chip SRAM and data cachewith the objective of maximizing performance. Basedon the life times and access frequency of array vari-ables, the most conflicting arrays are identified andplaced in scratch pad RAM to reduce the conflictmisses in the data cache. Leupers et al., [6] presentan interference graph based approach for allocatingvariables that are simultaneously accessed to differ-ent on-chip memory banks. To avoid the cycle costsfrom self-conflicts (multiple simultaneous accesses tothe same array), [10] suggests partial duplication ofdata in different memory banks.

In [3], Kandemir et al., present an approach tomanage the on-chip scratch pad memory dynamically.During execution, parts of larger arrays are copied inand out of the scratch pad; this is similar to the swap-ping technique discussed in Section 2 but performedat a more finer level. In [11], Sundaram et al., presentan efficient compile-time data partitioning approachfor mapping data arrays to local and remote memorysystems using a 0/1 Knapsack algorithm. Here thedata partitioning is performed at a finer granularityand requires the modification of address computationfor functional correctness. In contrast, in our work,the data partitioning is performed at the data arraylevel and requires no additional address computation.

6 Conclusion

Code and data layout is an important optimizationthat directly addresses the efficient utilization of theon-chip scratch-pad RAM. We presented a unified in-teger linear programming (ILP) formulation for thecode and data layout problem that can handle opti-mizations like multiple data banks, single and dualaccess RAMs, overlay of data sections, and swappingof code and data (from/to external memory). Our ex-periments on four widely used embedded DSP applica-tions show that the optimal solution obtained by ourformulation matches with the hand-optimized ones interms of MIPS consumed and on-chip memory usage.However, our solution were achieved by taking only afraction of the time spent in manual optimization.

References

[1] F.Balasa, F.Catthoor, and H.De Man. Backgroundmemory area estimation for multidimensional sig-nal processing systems. IEEE Trans. VLSI system.,3:157–172, June 1995.

[2] F.Catthoor, N.D.Dutt, and C.E.Kozyrakis. How tosolve the current memory access and data transferbottlenecks: at the processor architecture or at thecompiler level? In Design, Automation and Testin Europe Conf. and Exhibition 2000, pages 426–433,2000.

[3] M.Kandemir, J.Ramanujam, M.J.Irwin,N.Vijaykrishnan, I.Kadayaif, and A.Parikh. Dynamicmanagement of scratch-pad memory space. In DesignAutomation Conf., pages 690–695, 2001.

[4] C.Kulkarni, F.Catthoor, and H.De Man. Cache trans-formations for low power caching in embedded multi-media processors. In Proc. Intl. Parallel Proc. Symp.(IPPS), pages 292–297, April 1998.

[5] C.Kulkarni, C.Ghez, M.Miranda, F.Catthoor, H.DeMan. Cache conscious data layout organization forembedded multimedia applications. In Proc. of theDesign, Automation and Test in Europe, 2001, Conf.and Exhibition, pages 686 -691, 2001

[6] R. Leupers, D. Kotte. Variable Partitioning forDual Memory Bank DSPs. Intl. Conf. on Acoustics,Speech, and Signal Processing (ICASSP), Salt LakeCity (USA), May 2001.

[7] P.R.Panda, N.D.Dutt, and A.Nicolau. Memory is-sues in Embedded Systems-on-chip:Optimizations andExploration. Kluwer Academic Publishers, Norwell,Mass., 1998.

[8] P.R.Panda, N.D.Dutt, and A.Nicolau. Local mem-ory exploration and optimization in embedded sys-tems. IEEE Trans. Computer-Aided design, 18(1):3–13, Jan. 1999.

[9] P.R.Panda, N.D.Dutt, and A.Nicolau. On-chip vs.off-chip memory: The data partitioning problem inembedded processor-based systems. ACM Trans. De-sign Automation of Electronic Systems, 5(3):682–704,July 2000.

[10] M.A.R.Saghir, P.Chow, C.G.Lee. Exploiting DualData-Memory Banks in Digital Signal Processors.In Proc. of the 7th Intl. Conf. Architectural Supportfor Programming Languages and Operating Systems,pp.234-243, October 1996.

[11] Sundaram A. and Pande S. An Efficient Data Parti-tioning Method for Limited Memory Embedded Sys-tems. In 1998 ACM SIGPLAN Workshop on Lan-guages, Compilers and Tools for Embedded Systems(in conjunction with PLDI ’98), pp.205-218, 1998.

[12] W.T.Shiue and C.Chakrabarti. Memory explorationfor low power, embedded systems. In Proc. DesignAutomation Conf, pages 140–145. ACM Press, NewYork, 1999.


[IEEE Comput. Soc 16th International Conference on VLSI Design. Concurrently with the 2nd...

Documents

Transcript of [IEEE Comput. Soc 16th International Conference on VLSI Design. Concurrently with the 2nd...