[IEEE 2012 IEEE 15th International Conference on Computational Science and Engineering (CSE) -...

Resource-Efficient Designs using anAspect-Oriented Approach

Jose G. F. Coutinho1, Sujit Bhattacharya2, Wayne Luk1, George A. Constantinides2,Joao M. P. Cardoso3, Tiago Carvalho3, Pedro C. Diniz4, Zlatko Petrov5

1 Dept. of Computing, Imperial College London, UK, Email: {gabriel.figueiredo, w.luk}@imperial.ac.uk2 Dept. of Electrical and Electronic Eng., Imperial College London, UK, Email: {sujit.bhattacharya, g.constantinides}@imperial.ac.uk

3 Universidade do Porto, Dep. de Engenharia Informatica, Portugal, Email: jmpc@acm,org, [email protected] INESC-ID Lisboa, Portugal, Email: [email protected]

5 Honeywell International, Bulgaria, Email: [email protected]

Abstract—The increasing capability and flexibility of recon-figurable hardware, such as Field-Programmable Gate Arrays(FPGAs), give developers a wide range of architectural choicesthat can satisfy various non-functional requirements, such asthose involving performance, resource and energy efficiency. Thispaper describes a novel approach, based on an aspect-orientedlanguage called LARA, that enables systematic coding and re-use of optimisation strategies that address such non-functionalrequirements. Our approach will be presented in three steps.First, this approach is shown to support design space exploration(DSE) which makes use of various compilation and optimisationtools, through the deployment of a master weaver and multipleslave weavers. Second, we present three compilation and synthesisstrategies for word-length optimisation based on this approach,which involve three tools: the WLOT word-length optimiserdeploying a combination of analytical methods; the AutoESLtool compiling C-based descriptions into hardware; and theISE tool targeting Xilinx devices. Third, the effectiveness of theapproach is evaluated. In addition to promoting design re-use,our approach can be used to automatically produce a range ofdesigns with different trade-offs in resource usage and numericalaccuracy according to a given LARA-based strategy. For example,one implementation for a subband filter in an MPEG encoderprovides 31% savings in area using non-uniform quantizers whencompared to a floating-point description with a similar errorspecification at the output. Another fixed-point implementationfor the gridIterate kernel used by a 3D path planning applicationconsumed 25% less resources when the error specification isincreased from 1e-6 to 1e-4.

I. INTRODUCTION

The flexibility of reconfigurable fabrics can lead to a largenumber of architectural decisions and solutions that can satisfydifferent non-functional requirements, such as performance,resource and energy efficiency. These concerns, as well asothers, such as safety and fault-tolerance, can be satisfiedthrough a number of techniques that can be applied individuallyor combined to form a strategy to target specific application andplatform domains. A strategy can include techniques such ashardware/software partitioning, code specialisation, source-codetransformations, data-representation optimisation, insertion ofmodules to monitor runtime behaviour, among others.

However, strategies derived from user-knowledge and exper-tise cannot be coded using current design tools and toolchains.As a result strategies need to be manually applied by theuser, usually involving a limited set of optimisations and codetransformations available in the toolchain. After a few designiterations, the resulting transformed source-code often becomesmore specialised, less portable, more polluted with sourceannotations and with multiple versions to manage, making

them difficult to understand and thus to maintain. Anothersource of complexity arises from the need to manage andguide multiple tools, often with incompatible interfaces thatprevent their use in a coherent and cohesive fashion.

As architectures become increasingly more dynamic andheterogeneous, the corresponding tools need to be moresophisticated as they expose to the developers an increasingdegree of tuning and configuration control. To meet the myriadof requirements developers have to engage in sophisticateddesign space exploration (DSE) strategies to manage underlyingtrade-offs between performance, precision and even reliabil-ity. Without powerful design abstractions and correspondingadequate tools, the development of designs may become sub-optimal and error-prone.

To address these issues we developed LARA, a domain-specific language that supports design strategies [1]. Designstrategies define how applications are mapped to platforms suchthat the final solution satisfies non-functional requirements,including performance, resource efficiency, energy efficiencyor a trade-off between these factors. LARA descriptions arecompletely decoupled from the behavioural description of theapplication, and both description types - function and non-functional - can be combined through an automated processcalled weaving [1], which allows each functional and non-functional concern to be maintained independently.

In our approach, design strategies capture how and whentools are executed, and how sources are transformed andoptimised using the same language and mechanisms. Forinstance, a strategy can direct the execution of some ofthe components of the toolchain in multiple passes, wheresource-level transformations and profiling are performed ineach iteration until a certain condition is met, followed byinvoking the backend components of the toolchain on achosen design generated by the previous step. Using a singlemechanism to control toolchain execution and source analysisand manipulation simplifies design space exploration, whileallowing the creation of powerful strategies for optimisationsthat can satisfy non-functional requirements.

While existing DSE approaches [2]–[4] offer scripting andtuning mechanisms that allow customisation, composition andparameterisation of compiler transformations, they do not allowmore comprehensive forms of user-defined analysis that involvecapturing fine-grained source-code properties such as variables,expressions and statements, which can be an important partof the design-exploration process. We should note, however,that our approach can complement and extend existing DSE

2012 IEEE 15th International Conference on Computational Science and Engineering

978-0-7695-4914-9/12 $26.00 © 2012 IEEE

DOI 10.1109/ICCSE.2012.62

399

frameworks, as it has been designed to operate with a varietyof tools.

In this paper, we demonstrate how our aspect-orientedapproach can be used to generate resource efficient hardwaredesigns using customised strategies using our current design-flow, which includes commercial and non-commercial tools.In particular, this paper makes the following contributions:

1) An approach that supports design space explorationwhich makes use of various compilation and optimisationtools, through the deployment of a master weaver andmultiple slave weavers.

2) Three strategies for word-length optimisation based onthis approach, which involve three tools: the WLOTword-length optimiser deploying a combination of ana-lytical methods; the AutoESL tool compiling C-baseddescriptions into hardware; and the ISE tool targetingXilinx devices.

3) An evaluation of the effectiveness of the proposedapproach. In addition to promoting design re-use, ourapproach can be used to automatically produce a rangeof designs with different trade-offs in resource usage andnumerical accuracy according to a given LARA-basedstrategy.

This paper is organized as follows. Section II provides asurvey of related work. Section III describes our aspect-orientedapproach, showing how multiple weavers can be accommodated.Section IV illustrates our approach with three word-lengthoptimisation strategies to generate resource-efficient designs.Section V covers experimental results based on the proposedstrategies. Section VI presents concluding remarks.

II. RELATED WORK

In this paper, we describe a framework that combinesscripting, aspect-oriented design and word-length optimisationto produce a design-flow that is capable of generating resource-efficient designs according to a given strategy. Next we reportrelated work in these areas.

A. Scripting LanguagesScripting languages, like Python [5] and Tcl [6], included

in popular products from Mentor, Synopsys, Xilinx and Alteraprovide a simple mechanism to automate and customiseFPGA-based design-flows. Scripting languages provide anefficient mechanism for glue-logic, allowing easy integrationand parameterisation of coarse-grained components and tools.However, they have limited scope when it comes to the complextask of dealing with application sources and manipulating itsproperties. To overcome this limitation, the LARA languagecombines the power of the Javascript [7] language with aspect-oriented mechanisms, which we explain next.

B. Aspect-Oriented ProgrammingAspect-Oriented programming provides powerful mecha-

nisms to intercept program execution points, query its proper-ties, and alter its behaviour. Existing aspect-oriented languagesinclude AspectJ [8] and AspectC++ [9], which share similarcapabilities and target Java and C++ respectively. The mainmotivation behind these approaches is to increase modularity byhaving each functional concern described independently in theform of an aspect, and then have these concerns woven (merged)at run-time. The LARA approach [1], on the other hand,focuses on the problem of capturing non-functional concerns.In this case, the weaving process is performed at compile-time

where compilation tools and source-code are controlled andmanipulated to address, for instance, performance and resourceefficiency. Existing work in LARA includes developing high-level synthesis strategies [10], strategies involving monitoring,specialisation and hardware/software partitioning [11], andusing LARA to capture safety requirements [12].

C. Data Representation OptimisationWord-length optimisation is an important technique in high-

level synthesis to reduce the size of variables and expressions,which often leads to a simultaneous reduction in area andpower consumption while maintaining the required accuracy.Automatic word-length optimisation techniques allow trade-offbetween the number of bits representing variables and algorithmaccuracy. Authors have reported improvements of up to 80%in hardware area from word-length selection techniques [13].However, it has been demonstrated in [14] that determining theoptimal word-length under area, speed, and error constraints isan NP-hard problem. There are several published techniquesto find optimal solutions within this large design space inreasonable amounts of time, and can be divided into twomain approaches [15]: static (or analytical) and dynamic (orsimulation). Dynamic analysis relies on the use of stimuli inputsignals whereas static analysis techniques model the worst-caseerror as a function of operand word-lengths. The analyticalapproaches are mainly based on Interval arithmetic (IA), AffineArithmetic (AA) [16], SAT and Handelman representation [17].It is shown in [18] that static analysis is capable of providingguaranteed analytical bounds on the output error. A comparativestudy of the performance of these techniques is presentedin [19] and various approaches to the data representationproblem are addressed in [20].

In this paper, we use existing static analysis techniquesto optimise both range and precision for the variables in afixed-point design. Our word-length optimisation process isdesigned to be fast but is conservative and tends to overestimatethe bounds of variables. This is because interval arithmetic isunable to compute tight bounds whenever dependencies existbetween operations, while affine arithmetic loses informationwhen approximating non-affine operations, such as generalmultiplication or division. Techniques such as Handelmanrepresentation are being integrated to minimise this limitation.In particular, this approach constructs rational functions whichretain all the information about the operations in the code andbounds this rational function using Handelman representations.Over a range of benchmarks, this approach has been shown toresult in substantially less overestimation error [17].

III. ASPECT-ORIENTED DESIGN-FLOW

In this section we describe our design-flow approach, whichis illustrated in Fig. 1. It consists of a master weaver, in thiscase the DSE weaver, managing several slave weavers, suchas the C weaver and the tool weavers. We have developeda hardware synthesis toolchain which is capable of derivingefficient FPGA designs from a C kernel description using anumber of tools, including our word-length optimisation tool(WLOT), Xilinx AutoESL [21] and Xilinx ISE [22].

A. OverviewThe basic component of our design-flow is a weaver, which

operates on a script called an aspect. In particular, a weaverapplies one or more aspects to application sources at compile-time, which can lead to one or more optimised implementations.

400

master weaver(DSE weaver)

slave weaver(WLOT: C weaver)

design(implementation)

Xilinx ISE

Xilinx AutoESL

slave weaver(tool weaver)

slave weaver(tool weaver)

WLOT coordinator(Listing 2)

Xilinx ISEcoordinator

(LARA aspect)

Xilinx AutoESL

coordinator(LARA aspect)

��optimiseddesigns

��reports

inputsresults

inputsresults

input LARA strategies

(a)

(b)

(c)

(e)

(d)

(d)

select, apply

AST, DFG

gridIterate.c

subband.c

C application

��C app��strategy ��strategy args

uniform_modeinput_rangetarget_error

fnsources

flags inputs

sourcesachieved_error

outputs

sourcesflags

inputs

inputsdsp48_modehwsourcesflags

strategy1.lara(Listing 1)



main inputs

main outputs

from log

log

to log

from log

from log

Fig. 1: A hardware synthesis design-flow using our aspect-oriented approach. (a) The main inputs to the design-flow are a set of C sources, one or moreLARA aspects that define a strategy, and arguments for parameterisable strategies. A strategy description coordinates toolchain execution and source analysisand transformations. (b) The DSE weaver, which acts as a master, interfaces with other weavers using special aspects called coordinators. In this design-flow,we include (c) WLOT (Word-Length Optimisation Toolkit) which acts as a C weaver, and (d) Xilinx AutoESL and Xilinx ISE using tool weaving interfaces. Cweavers allow analysis and manipulation of C sources, whereas tool weavers connect with third-party tools. When a DSE weaver dispatches a weaver toperform a certain task, the associated coordinator aspect is invoked, and arguments are passed down automatically from a global registry called log to theweaver. Results are stored back to the log once the weaver terminates, so that subsequent weavers can use them. Each log object is associated with a (e) designobject, which represents an implementation at a specific point in the design process.

These aspects are described using the LARA language [1]and define a compilation strategy (Fig.1(a)). In our currentdesign-flow framework, we classify two types of weavers: (i) amaster weaver which, in this case, is the DSE (Design SpaceExploration) weaver, and (ii) one or more slave weavers, suchas C weavers and tool weavers. A design-flow will usuallyinclude a single master weaver which acts as a manager, andis the entry point for processing a strategy (Fig.1(b)). Othertypes of weavers in the design-flow process are deployed bythe master weaver to process a subset of the strategies.

Having different types of weavers allows us to configure thedesign-flow with different components to address new require-ments. Since weavers in the design-flow operate on a strategy,they must all be capable of processing LARA descriptions.Associated with each slave weaver is a special LARA aspectcalled a coordinator, which provides an interface between themaster and the slave weavers (Fig.1(c)). An example of acoordinator aspect is shown in Listing 2. Coordinator aspectsmust be first registered to the master weaver, so that theyare automatically invoked when the corresponding weaver isdispatched. In this case, input arguments are first loaded froma global registry inside the master weaver to the coordinator.Once the slave weaver terminates, the results are stored backto the global registry, and thus available to subsequent weaver

invocations. This process of storing and retrieving results toand from the global registry is hidden from users by defaultduring the dispach process. Next, we explain each design-flowcomponent in more detail.

B. Design-Space Exploration

Design-space exploration is performed by the DSE weaver(Fig.1(b)), which takes as input: (1) a C application, (2) oneor more aspects that compose a strategy, and (3) a set ofarguments for parameterised strategies (Fig.1(a)). The outputof the DSE weaver, on the other hand, is defined by the strategyitself. For instance, it could be the result of executing a singleiteration of all available tools or, alternatively, running multipleexecutions of a subset of these tools. Strategies, therefore,define how arguments are passed through different components,what and when tools are executed, including the analysis andmanipulation of tool reports and application sources.

At the heart of the DSE weaver is the design object (Fig.1(e)),which represents an implementation at a specific point in thedesign process. Objects in LARA are similar to those found inobject-oriented languages, where methods and properties areused to interface enclosed data. The design object provides aninterface which allows users to manipulate all weavers in the

401

design-flow and build a particular implementation. Multipleimplementations require instantiating multiple design objects.

The DSE weaver dispatches other weavers in the design-flowto execute part of the strategy. For instance, aspects that dealwith C source manipulation are processed by the C weavers,whereas aspects that deal with specific third-party tools areprocessed by tool weavers. In order to dispatch a weaver inthe design-flow, the DSE weaver must have registered a list ofcoordinator aspects associated with their respective weavers.Each coordinator, which is written in LARA, has an aspectdefinition with an interface specification that enumerates allinput (lines 2–5 in Listing 2) and output (line 6 in Listing 2)parameters. In the dispatch process, all data produced andconsumed by weavers are stored automatically in a globalregistry, which is a hash map object called log. The log objectresides inside each design object, and can be manipulateddirectly by the DSE weaver to manage data manually ifnecessary. Other weavers do not access the log object directly,as arguments and results are dispatched automatically to andfrom the weavers through the input/output interface of theaspect definition.

A strategy can instantiate one or more design objects, whichcan evolve differently to generate different implementationsdepending on how the design-flow is executed (Listing 1).When a design object is instantiated, a new workspace (folder)is created so that implementations can be built independently.Design objects can also be saved offline, and then retrieved inlatter DSE executions, for instance, serving as an implementa-tion cache (Listing 3).

C. Source WeavingThe LARA language has unique features that enable precise

analysis and manipulation of source-code constructs andproperties at compile-time. Consider the following example:

1 select function.loop end2 apply3 $loop.unroll("full");4 $function.map("processor", "VIRTEX5");5 end6 condition7 $loop.is_innermost &&8 $function.hwcost < $function.swcost &&9 $function.is_hwsynth

10 end

In this example, the select statement (line 1) finds loopsin functions for a given application, and places the results in atable with two columns (in this case function and loop asdefined in the select expression). Next, the apply statement(line 2) executes all enclosed actions (lines 3–4) for every rowin this table, as long as the condition (lines 7–9) is satisfied.Note that for each apply iteration, the elements of the table roware referenced by $function and $loop objects. Each ofthese elements reference join points, which intercept specificpoints in the program at compile-time. Associated to each joinpoint we have attributes which provide information about them.For instance, $loop.is_innermost returns true when thecorresponding join point is an innermost loop. In addition,actions can target join points and modify them at compile-timeas shown in lines 3–4. More precisely, our example requeststhat all innermost loops be unrolled and functions that enclosethese loops be mapped to hardware if their estimated mappingcost to hardware is lower than the mapping cost in software.In this paper, we use LARA’s select-apply mechanism to applyword-length optimisation for a C function (Listing 2). Otherfeatures such as loop transformations and hardware-softwarepartitioning are discussed elsewhere [10] [11].

kernel IR

range analysis

word-length results

error function generation

cost function generation

output error requirement

input data ranges

IWL

coarse fractional word-length analysis

fine fractional word-length analysis

precision analysis

FWL

Fig. 2: Word-Length Optimisation Toolkit (WLOT). The word-length optimi-sation toolkit is divided into two tasks, namely: range analysis and precisionanalysis. Range analysis involves determining the integer word-length (IWL),whereas the precision analysis method determines the required fractional word-lengths (FWL) of fixed-point variables. The error function captures the outputerror as a function of the word-length of the signals of the design. The costfunction returns the area cost as a function of the signal word-length andtheir arithmetic operators. WLOT takes as input a representation of the Ckernel, input ranges and output error requirements. Results from the word-length optimisation process can be translated into an aspect description ordirectly into a transformed (woven) C code with the appropriate fixed-typerepresentation.

D. Tool IntegrationA tool weaver, like other types of weavers in the design-flow,

processes LARA descriptions. However it also incorporatesspecific functionalities for managing third-party tools (Fig.1(d)).In particular, the tool weaver interpreter contains facilities toexecute tools, passing down parameters through the command-line, generating arbitrary scripts, and source instrumentation.The instrumentation facility is useful, for instance, to insertpragmas on top of functions or loops. It uses the same select-apply mechanisms used in source weavers, however it doesnot support other actions related to code transformationsand optimisations. Once tools complete their execution, toolweavers have facilities to parse tool report files.

Adding a new tool in the design-flow (such as profilersand backend compilers) involves writing a coordinator aspectand using a tool weaver to control it, and then registeringthe coordinator aspect in the DSE weaver. The coordinatoraspect must specify all the necessary parameters to perform therequired tasks and to output the results (lines 2–6 in Listing 2)to other weavers in the design-flow.

E. Word-length OptimisationOur design-flow includes a word-length optimisation tool

called WLOT (Fig. 2) which interfaces with the rest of thecomponents using the WLOT coordinator aspect (Listing 2).This tool takes as input a C function and optimises the sizes ofdata variables to generate efficient fixed-point designs based ongiven ranges for input variables and required output accuracy.WLOT supports multiple outputs depending on compilationflags. For instance, when interfacing with other C weavers, thenit outputs an aspect that captures the word-length specification

402

for all variables in the kernel. Alternatively, when interfacingwith tool weavers, such as Xilinx AutoESL, then the originalfloating-point description is transformed to a fixed-point designwith the appropriate word-length specification embedded inthe source-code.

Our word-length approach uses static analysis techniques toprovide word-length estimates along with the analytical errormodels to optimise both ranges and precisions for variables ina kernel. Range analysis involves determining the integer word-length (IWL) via standard IA and AA techniques [16]. Theprecision analysis method determines the required fractionalword-lengths (FWL) of fixed-point, and is based on examiningthe worst-case error bounds at each stage based on multi-interval arithmetic method to calculate the smallest absolutevalues that must be represented in each variable.

In the next section, we cover three design strategies thatexploit this word-length optimisation process.

IV. OPTIMISATION STRATEGIES

In general, the word-length optimisation process can benefitfrom strategies that provide an iterative mechanism to searchfor the optimal solution in a large design space. This is achievedby controlling non-functional parameters and transformationsthat affect the final design through constraints and directivesof various tools, in particular: loop unrolling factors to balanceresource utility and throughput, use of specialised blockslike DSP48, dedicated multipliers and IP cores, the use ofuniform or non-uniform quantization, level of resource sharingto improving resource utilization, latency and throughputconstraints, which affect scheduling and resource allocation. Inaddition, strategies can improve word-length optimisation timeby: grouping variables, user-specified word-lengths for a subsetof variables, and use a combination of analytical methods.

In this section we present three design strategies that exploitthe design-flow shown in Fig. 1 concerning resource efficiency.With these case studies, we wish to demonstrate that designexploration is simplified and enhanced when tool execution andsouce-code analysis are not specified separately, but insteadare described as part of the same strategy. Moreover, we wishto show that:

• strategies can solve different problems using the samedesign-flow components,

• strategies can be parameterised which helps design-exploration, for instance to select architectural detailsor to scale the search space,

• strategies can be reused and applied to different applica-tions.

A. Strategy 1

With this strategy, we wish to generate four designs accordingto two architectural decisions: the use of DSP48 resourcesand the use of a uniform representation where all variableshave the same fixed-point specification. Listing 1 presents theaspect description used to implement this strategy. In this code,rather than specifying the arguments for each slave weaver,we store them in the log object (lines 14–19 in Listing 1),and these arguments are automatically dispatched to theappropriate weaver according to the coordinator aspect interface.Conversely, once each slave weaver completes execution, theresults are automatically stored back to the log object. Theinvocation of all slave weavers in the design-flow is performedin lines 20–22 in Listing 1.

1 aspectdef Strategy12 input max_abs_error, fn, sources, flags end3 output designs end

5 config =6 [ c1: { dsp48_mode: false,7 uniform_mode: false },8 c2: {..}, c3: {..}, c4: {..}9 ];

11 designs = {};12 for (c in config) {13 var design = toolchain.new(c);14 design.log.input_range = {default: [0,1]};15 design.log.target_error = max_abs_error;16 design.log.uniform_mode =17 config[c].uniform_mode;18 design.log.dsp48_mode = config[c].dsp48_mode;19 ...20 design.execute([’wlot’,21 ’autoESL’,22 ’ise’]);

24 designs[c] = design;25 }26 end

Listing 1: LARA aspect definition for Strategy 1 (Section IV-A). This strategygenerates four designs based on two parameters (lines 5–9): (1) whether to useDSP48 slices, and whether to (2) use uniform or non-uniform modes. For eachconfiguration, we instantiate a design object (line 13), set the parameters onthe design log (lines 14–19) and execute all three slave weavers in sequence(lines 20–22): word-length optimiser (wlot), Xilinx AutoESL and Xilinx ISE.Each of the four designs is kept in a hash map (line 24) which is defined asan output parameter (line 3). Listing 2 refers to the aspect used when wlot isinvoked.

1 aspectdef WLOT_coordinator2 input uniform_mode, input_range,3 target_error, fn,4 sources, flags5 end6 output sources, achieved_error end

8 select function{fn}.var{type=="float",9 type=="double"}

10 end11 apply $var.def({type: "real"}); end

14 select function{name==fn}.var end15 apply16 if ($var.is_in)17 $var.def({"in_range": input_range.default});18 if ($var.is_out)19 $var.def({"max_abs_err": target_error});20 end

22 select function{name==fn}23 apply24 var res =25 $function.wlot(setUniform = uniform_mode);26 achieved_error = res.achieved_error;27 sources = res.woven_sources;28 end29 end

Listing 2: A LARA coordinator aspect which controls the WLOT weaver(Fig 1(c)). To configure the WLOT tool, we mark floating-point variables astargets for optimisation (Lines 8–11), specify the ranges for input variablesand (Lines 16–17), and set the maximum absolute error for output variables(lines 18–19). We run the WLOT process in lines 24–25, and set the outputparameters in lines 26–27.

This example can support additional architectural choicesby extending the configuration vector (lines 5–9 in Listing 1)to automatically generate a new pool of implementations.

B. Strategy 2

We now adopt a different strategy (Listing 3) using the samedesign-flow components. In this case, we wish to generate apool of pareto-optimal implementations with two objectives

403

1 aspectdef Strategy22 input3 err_vector, fn,4 mode_op, target_error5 end6 output7 best_design8 end

10 var designs = { };

12 for (err in err_vector) {13 var design_name = fn + "_" + err;14 design = toolchain.load(design_name)15 if (design == null) {16 design = toolchain.new(design_name);17 design.log.input_range = {default: [0,1]};18 design.log.target_error = err;19 design.log.uniform_mode = false;20 design.log.dsp48_mode = false;21 design.execute([’wlot’,’autoESL’,’ise’]);22 design.save();23 }24 designs[design_name] = design;25 }

27 if (mode_op == "accuracy") {28 best_design = design_by_acc(designs, ←↩

target_error);29 } else if (mode_op == "resource") {30 best_design = design_by_area(designs, ←↩

target_error);31 }32 end

Listing 3: LARA aspect definition for Strategy 2 (Section IV-B). This strategyhas four parameters (lines 2–5): a vector with a set of maximum absolute errors,the name of the function which we wish to optimise, an operation mode and atarget value. This strategy has two steps. The first step (lines 12–25) uses a listof maximum absolute errors provided by the user to generate implementationsfrom the pareto optimal front under minimum resource utilisation. Thesedesigns are cached and restored in future uses of this strategy. In a secondstep, users can select from this implementation cache, the most accurate orefficient design according to a mode of operation (lines 27–31).

measures, maximum absolute error and minimum area, for agiven C function. This strategy has two steps. The first step isoptional, and is executed when the user specifies a vector withmaximum absolute errors. In this case, the toolchain derivesfor each element of this vector the associated dominant valuefor the Pareto optimal front. Once these designs are generated,they are stored and can be referenced in future executions ofthis strategy.

In the second step, this strategy allows users to select fromcached implementations (generated in the first step) the mostaccurate or efficient design in area that satisfies the maximumabsolute error defined in the input target parameter. Becausedesigns are cached, this strategy can produce fast results at theexpense of optimality.

C. Strategy 3

In this strategy, we wish to explore how much more accuracywe can achieve for a given area limit (Listing 4). For thispurpose, we specify the minimum accuracy for our design anda maximum area limit, and the strategy increases the accuracyby a given step for each iteration until this limit is reached.All tools are executed in each iteration. Since this strategy maynot terminate, we put a limit on the number of iterations. Thisstrategy can be extended to support other problems, such asfinding the accuracy of a design that matches an area target,or incorporate optimised searching techniques such as binarysearch.

1 aspectdef Strategy32 input3 initial_err, fn, step, max_util4 end5 output6 design7 end

9 var err = initial_err;10 var prev_design, design;11 var num_iter = 0;12 do {13 design = toolchain.new(fn + "_" + err);14 design.log.input_range = {default: [0,1]};15 design.log.target_error = err;16 design.log.uniform_mode = false;17 design.log.dsp48_mode = false;

19 design.execute([’wlot’,’autoESL’,’ise’]);

21 prev_design = design;22 err = err * step;23 num_iter++;24 } while(design.log.slice_util < max_util25 && num_iter < 100);26 design = prev_design;27 end

Listing 4: LARA aspect definition for Strategy 3 (Section IV-C). This strategygenerates designs using an iterative process starting with a minimum accuracyprovided by the user (line 3), and then gradually increasing the accuracy by agiven step (line 22) until the area reaches the limit of maximum utilization orthe maximum number of iterations allowed (lines 24–25).

V. EVALUATION

In this section, we evaluate three strategies described inSection IV for two application kernels: the subband filterused by an MPEG encoder implementation and the griditeratekernel used by the 3D-Path planning application [10]. Bothimplementations are written in C. In our design-flow we haveincluded Xilinx AutoESL 2011.2 and Xilinx ISE 13.2, andtargeted Xilinx Virtex-5 XC5VSX240t:ff1738-2 device. Weconsider a 3 ns clock period constraint with 1.25 ns uncertaintyfor all the designs.

The subband kernel consists of a pre-summer followed by a64-tap FIR filter along with four quantizers Q1 to Q4 (Fig. 3).For the gridIterate kernel, we use two quantizers Q1 and Q2(Fig. 4). Each of these quantizers appear in the result tables(Tables I–IV) with the form (IWL, FWL), where IWL is theinteger word-length and FWL is the fractional word-length.

N=64

filter

yQ1

Q2

Q3 Q4z

Fig. 3: Block diagram for the subband kernel using four quantizers.

Q1 XX XXQ2 Q1 Q2

v1

v2

v6

v1

v6

Fig. 4: Block diagram for the gridIterate kernel.

404

TABLE I: Results for the subband kernel using Strategy 1

ErrorSpec.

Q1..Q4AppError

LUTW/ DSP

DSP LUT1e-4 (3,21),(2,23),(10,15),(1,24) 9.19e-5 48076 128 23101e-5 (3,24),(2,26),(10,18),(1,27) 9.59e-6 62019 256 25711e-6 (3,27),(2,30),(10,21),(1,31) 9.57e-7 76772 256 4909

– float 6.88e-5 89896 320 5728

TABLE II: Results for the gridIterate kernel using Strategy 1

ErrorSpec.

Q1..Q2AppError

LUTWith DSP

DSP LUT1e-4 (3,14),(1,16) 7.63e-5 18661 14 42041e-5 (3,17),(1,19) 9.53e-6 22524 28 47301e-6 (3,20),(1,23) 8.34e-7 24982 28 5340

– float 4.17e-7 34289 60 7079

A. Strategy 1Table I shows the results obtained for the subband kernel

with different output accuracies and using non-uniform word-length mode. The input data range is defined as [-0.16, 0.51]and the filter coefficient in the range of [-1, 1]. The resourceutilisation (LUT) for implementations without DSP48 compo-nents is shown in the fourth column. The last two columnsrepresent the resource utilisation using DSP48 components.The last row in Table I is the result from the floating pointimplementation when the word-length optimisation process isnot applied.

If we compare the fixed-point design with an error specifi-cation of 1e-5 against the floating-point description with anoutput error specification of 9.50e-6, we get approximately31% savings in area (LUTs) using non-uniform quantizersand no DSP48 components. When using DSP48 components,the resource savings is 55% in LUTs and 20% in DSP48components.

Table II shows the results obtained for the gridIterate kernelusing the non-uniform word-length optimisation mode withan input range of [0,1] with a constant coefficient multiplier.The last row of Table II shows the result of the floating pointimplementation of the gridIterate kernel when the word-lengthoptimisation is not used. In this case, we have around 27%savings in resources for an achieved error of 4.17e-7 usingnon-uniform quantizers when comparing with the fixed designwith an error specification of 1e-6.

From the above, Strategy 1 can be seen to automaticallyderive designs that offer different tradeoffs between accuracyand resource usage according to various parameters (DSP48mode, non-uniform mode, target error, input ranges). Forinstance, in Table II, the fixed-point implementation withan error specification of 1e-4 and no DSP48 componentsconsumes 25% less resources than a similar design with anerror specification of 1e-6. When using DSP48 components,the reduction is 21% in LUTs and 50% in DSP48 components.

New implementations and tradeoffs can be derived byrevising the LARA code in Listing 1, in particular theconfiguration vector in lines 5–9 in Listing 1, and by providingdifferent input arguments to this strategy.

B. Strategy 2This strategy specifies the maximum absolute error range

between 1e-4 and 1e-15 for output variables as shown in thefirst column of Table III. In this case, the toolchain generatespareto-optimal solutions for two conflicting objectives, namely,

TABLE III: Results for the subband kernel using Strategy 2

ErrorSpec.

Q1..Q4AppError

LUT

1e-4 (3,21),(2,23),(10,15),(1,24) 9.19e-5 480761e-5 (3,24),(2,26),(10,18),(1,27) 9.59e-6 620191e-6 (3,27),(2,30),(10,21),(1,31) 9.57e-7 767721e-8 (3,34),(2,36),(10,28),(1,37) 9.37e-9 1115101e-10 (3,41),(2,43),(10,35),(1,44) 8.77e-10 1498331e-12 (3,47),(2,50),(10,41),(1,51) 9.13e-13 2026501e-15 (3,57),(2,60),(10,51),(1,61) 8.91e-16 272034

10-16

10-14

10-12

10-10

10-8

10-6

10-4

0

0.5

1

1.5

2

2.5

3x 10

5

Are

a (L

UT

)

Application Error

Fig. 5: Pareto-optimal front for the subband kernel derived from Strategy 2.

TABLE IV: Results for the gridIterate kernel using Strategy 2

ErrorSpec.

Q1..Q2AppError

LUT

1e-4 (3,14),(1,16) 7.63e-5 186611e-5 (3,17),(1,19) 9.53e-6 225241e-6 (3,20),(1,23) 8.34e-7 249821e-8 (3,27),(1,29) 9.31e-9 33028

1e-10 (3,34),(1,36) 7.28e-11 466651e-12 (3,40),(1,43) 7.95e-13 521391e-15 (3,50),(1,53) 7.74e-16 66225

minimization of resource utilisation and error (Fig. 5). For thegridIterate kernel the data are tabulated in Table IV. This datacan provide the designer flexibility in choosing designs, whenit is uncertain the amount of resources or how much outputnoise can be tolerated. For example, if we request a designwith an accuracy of 1e-11 or 1e-12 for the subband kernel(Table III), then we can easily obtain from the generated poolof implementations a design with a maximum absolute errorof 9.13e-13 corresponding to a total area of 202,650 LUTs.Moreover, since the implementations are cached, the LARAbased design strategy can derive the results instantly ratherthan having to use the full synthesis flow.

C. Strategy 3This strategy, shown in Listing 4, specifies for the subband

kernel a minimum accuracy of 1e-1 (initial_err) and amaximum LUT utilisation (max_util) of 50% of the total

405

TABLE V: Results for the subband kernel using Strategy 3

ErrorSpec.

Q1..Q4AppError

LUT %Util.

1e-1 (3,11),(2,13),(10,5),(1,14) 9.42e-2 18551 12.41e-2 (3,14),(2,16),(10,8),(1,17) 9.82e-3 26919 18.11e-3 (3,17),(2,20),(10,11),(1,21) 9.80e-4 35843 24.11e-4 (3,21),(2,23),(10,15),(1,24) 9.19e-5 48076 32.31e-5 (3,24),(2,26),(10,18),(1,27) 9.59e-6 62019 41.71e-6 (3,27),(2,30),(10,21),(1,31) 9.57e-7 76772 51.6

TABLE VI: Results for the gridIterate kernel using Strategy 3

ErrorSpec.

Q1..Q2AppError

LUT %Util.

1e-1 (3,4),(1,6) 7.81e-2 12251 8.231e-2 (3,7),(1,9) 9.76e-3 14037 9.431e-3 (3,10),(1,13) 8.54e-4 15994 10.81e-4 (3,14),(1,16) 7.63e-5 18661 12.51e-5 (3,17),(1,19) 9.53e-6 22524 15.1

148,760 LUTs available. Here we consider a non-uniformquantizer with no DSP48 usage. As shown in Table V, thedesign flow starts with a minimum accuracy of 1e-1, and findsthat the percentage of LUTs used is 12.4%, which is wellwithin the target limit of 50%. The accuracy is then increasedby an order of magnitude, and the process is repeated untilresource usage exceeds 50%, which in this case is 1e-6, asshown in Table V. This iterative strategy selects the designusing 41.7% of the FPGA LUT resources with an accuracyof 1e-5. Smaller step sizes can be specified possibly allowingthe same iterative design approach to arrive at a better designresult at the expense of time to solution.

For the gridIterate kernel, we set the maximum numberof LUT usage to 15% of the total LUT available. As in theprevious kernel, we consider non-uniform quantizers for thevariables and no usage of DSP48. Table VI shows the iterationsteps followed to meet the percentage LUT usage of 15%. Thedesign with 1e-5 accuracy occupying 12.5% of the LUT meetsthe required user constraint of 15%.

This strategy exemplifies how design-space exploration canbe customised with our approach to invoke a design-flowiteratively until an implementation is generated that satisfiesgiven requirements.

VI. CONCLUSION

This paper presents an approach where developers control awide range of compilation and hardware synthesis tools usingan aspect-oriented domain-specific language named LARA.LARA allows developers to define and automate the executionof strategies to derive efficient hardware/software designsby exploiting the combined capabilities of available toolsin the context of design-space exploration. We demonstratethe effectiveness of this approach by presenting a completedesign-flow that synthesizes C functions into resource-efficienthardware designs exploiting word-length optimisation. Ourword-length optimiser takes a C floating point kernel and usesstatic analysis techniques to optimise range and precision forvariables to produce efficient fixed-point design implementa-tions. In addition, we highlight the flexibility of this approachby describing three strategies that allow the tool to quicklyand automatically derive designs with different resource andapplication accuracies using the same design-flow components.

Future work includes designing strategies that exploit source-level transformations such as pipelining and unrolling inaddition to word-length optimisation. We also wish to extendour word-length optimisation process to target C kernels that

exhibit more complex control-flow, and support Handelmanrepresentation [17] to overcome our current limitation ofoverestimation.

ACNOWLEDGEMENTS

This work was partially supported by the European Com-munity’s Framework Programme 7 (FP7) under contract No.248976, 257906 and 287804, and UK EPSRC. The authorsare grateful to the members of the REFLECT project for theirsupport.

REFERENCES

[1] J.M.P. Cardoso, T. Carvalho, J.G.F. Coutinho, W. Luk, R. Nobre,P.C. Diniz and Z. Petrov, “LARA: an Aspect-Oriented ProgrammingLanguage for Embedded Systems,” in ACM Proc. of the 11th Annual Intl.Conf. on Aspect-oriented Software Development, 2012, pp. 179–190.

[2] A. Tiwari, C. Chen, J. Chame, M. Hall, and J. Hollingsworth, “A ScalableAuto-Tuning Framework for Compiler Optimization,” in IEEE Intl. Symp.on Parallel & Distributed Processing (IPDPS), 2009, pp. 1–12.

[3] J. Xiong, J. Johnson, R. Johnson, and D. Padua, “SPL: A Languageand Compiler for DSP Algorithms,” in ACM SIGPLAN Notices, vol. 36,no. 5, 2001, pp. 298–308.

[4] Q. Liu and T. Todman and J.G.F. Coutinho and W. Luk and G.A. Constan-tinides, “Optimising Designs by Combining Model-based and Pattern-based Transformations,” in IEEE Intl. Conf. on Field ProgrammableLogic and Applications (FPL), 2009, pp. 308–313.

[5] M. Lutz, Programming Python. O’Reilly Media, Inc., 2006.[6] J.K. Ousterhout and K. Jones, Tcl and the Tk toolkit. Second Edition.

Addison-Wesley, 2000.[7] D. Flanagan, JavaScript: the Definitive Guide. O’Reilly Media, 2006.[8] R. Laddad, “AspectJ in Action: Practical Aspect-Oriented Programming,”

Recherche, vol. 67, 2003.[9] O. Spinczyk, D. Lohmann, and M. Urban, “Advances in AOP with

AspectC++,” in Proc. of Conf. on New Trends in Software Methodologies,Tools and Techniques (IOS Press), 2005, pp. 33–53.

[10] J.M.P Cardoso, J. Teixeira, J.C. Alves, R. Nobre, P.C. Diniz,J.G.F. Coutinho, W. Luk, “Specifying Compiler Strategies for FPGA-based Systems,” in IEEE 20th Intl. Symp. on Field-Programmable CustomComputing Machines, 2012, pp. 192–199.

[11] J.G.F. Coutinho, T. Carvalho, S. Durand, J.M.P. Cardoso, R. Nobre,P.C. Diniz, W. Luk, “Experiments with the LARA Aspect-OrientedApproach,” in ACM Proc. of the 11th Annual Intl. Conf. on Aspect-Oriented Software Development Companion, 2012, pp. 27–30.

[12] Z. Petrov, K. Kratky, J.M.P. Cardoso, P.C. Diniz, “Programming SafetyRequirements in the REFLECT Design Flow,” in 9th IEEE Intl. Conf.on Industrial Informatics (INDIN), 2011, pp. 841–847.

[13] G. A. Constantinides, “Perturbation Analysis for Word-length Opti-mization,” in 11th Annual IEEE Symp. on Field-Programmable CustomComputing Machines (FCCM), 2003, pp. 81–90.

[14] G. A. Constantinides, “The Complexity of Multiple Wordlength Assign-ment,” Applied Mathematics Letters, vol. 15, no. 2, pp. 137–140, Feb.2002.

[15] S. Roy and P. Banerjee, “An Algorithm for Trading Off QuantizationError with Hardware Resources for MATLAB-Based FPGA Design,”IEEE Trans. on Computers, vol. 54, no. 7, pp. 886–896, Jul. 2005.

[16] J. Stolfi, “Self-Validated Numerical Methods and Applications,” Mono-graph for 21st Brazilian Mathematics, 1997.

[17] D. Boland and G. A. Constantinides, “Automated Precision Analysis: APolynomial Algebraic Approach,” in 18th IEEE Annual Intl. Symp. onField-Programmable Custom Computing Machines, 2010, pp. 157–164.

[18] D. U. Lee, A. A. Gaffar, R. C. C. Cheung, O. Mencer, W. Luk, and G. A.Constantinides, “Accuracy Guaranteed Bit-Width Optimization,” IEEETrans. on Computer-Aided Design of Integrated Circuits and Systems,vol. 25, no. 10, pp. 1990–2000, 2006.

[19] M. Cantin, Y. Savaria, and P. Lavoie, “A Comparison of Automatic WordLength Optimization Procedures,” in IEEE Intl. Symp. on Circuits andSystems, vol. 2, 2002, pp. 612–615.

[20] G. Constantinides, A. Kinsman, and N. Nicolici, “Numerical DataRepresentations for FPGA-Based Scientific Computing,” IEEE Design& Test of Computers, vol. 28, pp. 8–17, 2011.

[21] J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers and Z. Zhang,“High-Level Synthesis for FPGAs: From Prototyping to Deployment,”IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems, vol. 30, no. 4, pp. 473–491, 2011.

[22] Xilinx, ISE, “ISE Design Suite Software Manuals and Help,” 2010.

406

[IEEE 2012 IEEE 15th International Conference on Computational Science and Engineering (CSE) -...

Documents

Transcript of [IEEE 2012 IEEE 15th International Conference on Computational Science and Engineering (CSE) -...