HiPART: A New Hierarchical Semi-Interactive HW-/SW ...

HiPART:A New Hierarchical Semi-Interactive HW-/SW Partitioning Approachwith Fast Debugging for Real-Time Embedded Systems

Thomas Hollstein, Jiirgen Becker, Andreas Kirschbaum, Manfred GlesnerDarmstadt University of TechnologyInstitute of Microelectronic Systems

Karlstr 15,64291 Darmstadt{thomaslbeckerlglesner}@mes.tu-darmstadtde

Abstract

In this contribution we present a new system-level hard-ware/softwarepartitioning approach (HiPART) which isrunin theframe of an integrated hardware software designmethodologyfor embedded system design. The benefits oftheapproach resultfrom an hierarchical partitioning algo-rithm,consisting of three phases of constructive and itera-tive methods. The main advantage of the system is a freelyselectabledegree of user interaction and manual partition-ing. A permanent observation of timing constraint viola-tions during partitioning guarantees the applicability forreal-timesystems.

1. Introduction

The scene of hardware/software codesign has introduceda number of hardware/software partitioning approachesto speed-up performance, to optimize hardware/softwaretrade-offs, and to reduce total design time [6], [3], [II], [4],[12], [8], [7], [1] among others. The introduced approachesperform their techniques on different partitioning granular-ities, ranging from fine-grain [6], over medium-grain [3],[11], to coarse-grain [12], [7] granularities. These auto-mated hardware/software partitioners use a variety of par-titioning heuristics: extended greedy-algorithm [6], clus-tering [2],[9], simulated annealing [3], [7], dynamic pro-gramming [11] as well as a modified Kernighan/Lin heuris-tic in [12]. Input programs are partitioned among softwareand custom (reconfigurable) hardware parts (processors), aswell as optimized embedded systems are designed.

As part of DICE (Darmstadt Interactive Codesign ofEmbedded Systems), the proposed HiPART hierarchicalpartitioning approach implements a set of communicatingC and VHDL processes onto a heterogenous target hard-ware platform, consisting of different types of hardwaremodules (Microprocessors, DSPs, FPGAs, ASICs) beingapplication-dependent connected in an optimized synthe-sized reconfigurable communication architecture [5], [10].

1092-6100/98 $10.00 @ 1998 IEEE

This library-based communication synthesis step improvesoverall system partitioning and design, since the corre-sponding informations are reflected in the cost function tobe optimized during the partitioning and debugging process.

The introduced hierarchical partitioning method clus-ters closely related objects (instructions) in order to min-imize communication and to handle increasing complex-ity. Such data-dependent clusters are analyzed due to theirhardware and software performance in using pre-computedhardware and software performance (cost) values of all ob-jects. The analyzed clusters are partitioned by a coarse-grained simulated annealing based heuristic. The obtainedhardware/software partition is finally optimized in apply-ing an extended Fiduccia/Mattheyses algorithm performinga "fine-tuning" on instruction-level granularity. Thus, theproposed hierarchical partitioning approach realizes a newhybrid fine-/coarse-grained partitioning strategy, also incor-porating optimization with respect to subsequent synthesisof communication architectures.

Since the user has the option to move clusters from hard-ware to software and vice versa during the partitioning pro-cess, this approach offers the possibility to integrate sys-tem designers and/or application knowledge. Due to theincremental update technique of the cost function to be op-timized, alternative partitions are generated rapidly. Thus,the introduced method (including graphical user interface)represents a step in direction of realizing fast debuggingtools for hardware/software systems, comparable to soft-ware debugging. The development of such debugging toolsseems to be necessary, because hardware/software codesignapproaches should also include application information intothe codesign process in order to obtain efficient results.

Initially this paper gives a brief description of the hard-ware/software partitioning problem. Section 3 presents thecodesign system DICE briefly and its design flow. In sec-tion 4 the HiPART partitioning approach is described in de-tail. Section 5 discusses a computation-intensive applica-tion example followed by a summary in section 6.

29

2. Problem Definition

A promising design methodology for real-time embed-ded systems has to cope with several tasks in order toachieve an implementation which satisfies the system con-straints. Mixed hardware/software systems are required ifthe targeted system performance can't be achieved by apure software solution. The main design tasks in hard-ware/software codesign are hardware/software partitioningand the synthesis of communication structures. For valida-tion of the functionality of the system at the actual designstate, hardware/software cosimulation is applied at differ-ent levels of abstraction and rapid prototyping for the finalimplementation.

The system level functional partitioning problem can bedefined as follows:

Definition 2.1 partitioning problem

Instance: Given a partitioning graph G = (V,E) con-sisting of a set V = {VI,V2, . . . Vnv} of vertices anda set E = {el, e2, . . . enE}, E ~ V x V of edgesand an edge cost function c : E -t IR+. Nodes Viare junctional objects (operations, processes or op-eration clusters) of a heterogeneous system specifi-cation. Edges ei represent interrelations of nodes(control flow dependencies, data dependencies, com-mon properties and relations of interest). Given a setU = {UI, U2,... Unp} of target units (UI is a soft-ware module, all other units are hardware modules),a set AC = {ac2,"', acnp} of unit hardware areaconstraints and a set TC = {tcI,"', tcnTc} of tim-ing constraints.

Configurations: All np-way partitionings P{PI,P2,...Pnp} with Pi ~ V, andU~:IPi = V,and Pi n Pj = <Pfori, j E {l,. . . , n p }, i :j:.j

Solutions: All partitionings P = {Po,PI, . . .Pnp} suchthat all aCi E AC (with i E {2,.. . , np}) and all tCj ETC (with j E {l,. . . , nTc}) are satisfied

Minimize: c(P) =L.:~~lc(ei)

For performing partitioning constructive and iterative algo-rithms can be applied. In section 4 we introduce a combinedpartitioning method with close interactive designer involve-ment. The main challenge of hardware/software partition-ing is to find a trade-off between algorithm runtime and ap-propriate cost functions. A new feature of the HiPARTalgo-rithm is the capability to guarantee to find solutions withina space bounded by predefined timing constraints.

3. DICE:HW/SW Codesign Environment

The design flow of the DICE system is shown in Fig. 1.A system specification, consisting of any number of con-current VHDL and C processes is converted to a concurrentCDFG (CCDFG).

t---- ---

VHDL-D..orlptlon' ~-En C 1 . c-o..orlptlon:1 AroMecture P1 Pn k Processe.n Process.. Ck

Hord""ro endSoftwa.. Estimation

Dstabaa..

UConvera!on /0

CCOFG

"= Flnsl System,

-One or more ASIC.

- One or more u-ControUer.

- Comm.-structwe adapted to requirement.

Figure 1. Design Flow in the DICE CodesignEnvironment

Communication in between of processes is done by ab-stract synchronous and asynchronous send and receivefunctions. During HW/SW cosimulation, profiling infor-mation is collected to be used for hardware and softwareperformance estimations during partitioning. HW/SW par-titioning is performed on the CCDFG based on a semi-interactive approach influenced by the designer using theHiPARTgraphical user interface. After HW/SW reparti-tioning code transformations C H VHDL are performed forcode pieces with modified implementation attributes. Dur-ing communication synthesis, abstract communication op-erators are replaced by connections using buses, buses withshared memory or FIFO buffers. In a subsequent cosim-ulation step a validation of the resulting functionality andthe compliance with constraints is checked. If some con-

30

,

straintsare violated, a HW/SW repartitioning with strongerconstraints is necessary. If not, the code can be synthe-sized/compiled to target architectures as ASICs, FPGAs,/lC or the DICE rapid prototyping platform REPLICA.

4. HWISW Partitioning

HiPARTis a new hierarchical hardware/software parti-tioningalgorithm. An initial system specification, consist-ing of any number of concurrent C and VHDL processes,is converted into a CCDFG (concurrent control/dataflowgraph). This graph is visualized by the HiPART graphi-cal user interface (GU!) CCDFGView, which provides allfunctionalityfor interactive partitioning control. The struc-tureof the HiPART partitioning environment is depicted inFig.2.

Figure 2. HiPART Interactive PartitioningMethodology

TheHW/SW partitioning strategy is to keep as much aspossiblein software. Timing constraints may be attachedbythe designer using the GUI. The partitioning takes careof these constraints and hardware processes will be gen-erated,if this is necessary to satisfy constraints. The ba-sicpartitioninggranularity is fixed in a pre-clustering step.For CCDFGswith a large number of nodes, a reduction ofcomplexityto a parametrizeable number of nodes nOI isachievedby clustering. The partitioning of resulting clus-ters(orCCDFG nodes) is performed by application of sim-ulatedannealing (s.a.). Since clustering reduces the opti-mizationspace, the s.a. may not find the absolute globalcost minimum. Therefore additional optimization poten-tial is exploredby a refinement step, where clusters are it-erativelyresolved and post-partitioned using the Fiduccia-

Mattheyses heuristic (still applying constraint observation).In between of each automatic partitioning step any manualpartitioning may be applied via GUI. An absolute and dif-ferential cost monitoring of each step gives the designer thepossibility to explore dependencies and to tune/debug thesetting of algorithmic parameters efficiently.

4.1 Pre-Clustering

In the pre-clustering step the granularity of basic par-titioning entities will be fixed. In classical partitioningapproaches single operations or basic blocks are used asatomic elements for partitioning. Since operations may dif-fer significantly with respect to hardware area and execu-tion performance (e.g. for simple logic operators comparedto full parallel multiplication) a general selection of basicblock granularity will not lead to an optimal initial graphconfiguration. Furthermore designer knowledge cannot beincluded in systems with fixed granularity. Therefore weselected a heterogeneous granularity approach. First of all,an operation or basic block related automatic pre-clusteringis executed. In an additional interactive step, the user maypre-cluster sets of nodes, which have to be mapped ontothe same partition. Additionally operations, which may berealized in software only (e.g. pointer operations) will bepre-clustered automatically into one common partition.

4.2 Clustering

Clustering is applied in order to reduce the complexityof the subsequent simulated annealing partitioning step interms of nodes. The maximum number of nodes is limitedto a user-defined number nOI of cluster trees. By that thepartitioning time is kept in reasonable limits. If the numberof CDFG nodes is less than nOl, the clustering step has notto be performed.

The closeness function d between of two clusters is de-fined as follows:

d'(Vi,Vj) kcomm. dcomm(Vi,Vj)

+kshare. dshare(vi,vj)(I- dflex(Vi,Vj))

+kflex . dflex(Vi, vi)

+kinitial. dinitial(Vi,Vj)

kcomm, kshare, kflex and kinitial are user adjustable coef-ficients. dcomm is the communication closeness, containingall data dependencies in between of Vi and Vj multiplied bytheir profiling execution count and bitwidth. In order to en-able hardware sharing, dshare is a measure for the similarityof operations contained in both clusters:

dshare(Vi,Vj) =

'.EotESharedOpsize(ot, min{ Wmax(ot, Vi), Wmax(ot, Vj)})

'.EotESharedOpsize(ot, Wmax(ot, Vi U Vi))

31

.I,

(with denominator zero check) where Wmax (ot, v) is:

wmax(ot,v) = max{bitwidth(ot,n)ln E v}

dflex is a user defined measure for the probability of laterspecificationchanges of the code (no changes expected: 0,probably changed: 1). Nodes with high flexibility shouldbe partitioned into software. dinitial is a measure for thedegree of coincidence with the initial specification (HW orSW).

Algorithm 1 Clustering

initialize all nodes Vi of GCI(V, E) by CDFG nodesinitialize edge heap: H = <P

for all pairs of nodes Vi, Vj E V dod(Vi, Vj) = ComputeCloseness( Vi,Vj)if d(Vi, Vj) > 0 then

insert edge {Vi, Vj} into heap Hend if

end forwhile IVI > nCI and H not empty do

select from Hedge {Vi, Vj} with maximum d(Vi, Vj)for all Vk E adj(vi), k =Ii, k E {I,.. .Iadj (Vi)l) do

H:=H\{Vi,vdend forfor all Vk E adj (Vj ), k =Ij, k E {I, . . . Iadj (Vj)l) do

H:=H\{vj,Vk}end forVnew := Vi U Vjproperties(vnew) =MergeProperties(vi,Vj)V:= (V \ {Vi,Vj}) U {Vnew}for all Vk E (adj(vi) U adj(vj)) \ {Vi, Vj, vnew} do

d(vnew, Vk) = ComputeClosess(vnew, Vk)ifd(vnew,vk) > 0 then

insert edge {vnew, Vk} into Hend if

end forend while

The value of the closeness measure d' (Vi, Vj) increaseswith the size of Vi and Vj. In order to be able to achieve anequalized cluster size distribution, the closeness is reducedas follows:

{

d'(Vi,Vj) II+II >N- (IVil+lvol )

attenuahon Vi Vj

d(Vi,Vj) - ~d'(Vi,Vj) else

with N = threshold. IVcDFcl/ncl. The values oftreshold and attenuation can be parametrized.

The clustering algorithm itself is straightforward (Alg.1) and follows the strategy to select clusters to be fused bycloseness priority. In each clustering step a cost update toall clusters adjacent to the fused ones has to be performed.

4.3 Partitioning

The main partitioning step is done by application of sim-ulated annealing (s.a.). Advantages of s.a. are the ability

to to overcome local minima of the cost function and ro-bustness with respect to modifications of the cost function.Given a partitioning graph Gp (V, E) (resulting from clus-tering or manual prepartitioning), a set TC of timing con-straints and a set AC of hardware area constraints, dur-ing partitioning a cost function c(P, Gp (V, E), TC, AC),which assigns a cost value to each possible partitioning,isminimized. In difference to definition 2.1 constraints areincluded in the cost function. Constraint violations willbepunished by very high cost increase. Benefits of this con-cept are a unified cost function and the possibility of inter-mediate entrance into forbidden areas during optimization.The cost function is defined as follows:

m01cl

ccl

Ce(

c(P,Gp(V,E),TC,AC) = c,

Pnp

L (carea(E,p) + kshareCshare(P))p=P2

=

Cb1

+WI L kflexcflex(V)VEV\Pl

1

+ WI L kinitial Cinitial (v)vEV

+ L kcommccomm(e)eEE

+ L cconstr(tc)tcETC

c

=

II"ac

4

karea, kshare, kflex, kinitial and kcomm are coefficientswhich allow cost function adjustment by the designer. Thearea costs of hardware partitions are calculated as s

carea(E,p) = kareaarea(E,p) + carea_violation(E,p)fif

witha

area(E,p) = area_op(p) + area_wire(E,p)

carea_violation(E,p) =

{0 area(E,p) < ac(p)

= e2o(area(E,p)-ac(p))- 1 else

I:

j:a

j:st

The area area_op(p) required for functional operators con-sists of separated contributions for shared operators (withrough multiplexer estimation) and non-shared operators.The wiring area is derived by a constant factor from thetotal hardware area. The uniformity of shared operationsinhardware partitions is targeted by inclusion of the factor

I

1r(

Cshare (p) = ntypes (p)ntypes

~

where ntypes is the number of shared operator types in thewhole system specification. This contribution forces thatoperation types, which have to be shared due to designerselection are drawn into the same partition. The flexibility

32

-- - -

ro-Ion.us-bn-Uf-

),

measure generates costs for preliminary hardware clustersonly and is computed based on the nodes n contained in aclusterv:

1

Cflex(V) =~ L flex(n)nEv

Costs for deviation from initial specification type are in-cludedfor clusters implemented in hardware or software:

Cinitial(V) = I~I L initial(n)nEv

,ISIre

be'O-

r:er-

n. Communication costs are considered for inter-partitionedges eext = (Vi,Vj) by inclusionof:

ccomm(e) =-

{0 part(vi) = part(vj)

- bitwidth(e) . prafiling(e) else

Constraintsare includedwith strongviolationpunishmentby an exponential function (ac =over-constraining):

Cconstr(tC) =

{0 if oc(tc) . d(path(tc)) ::; tmax(tc)

= ekconotr(Oc(tc).d(path(tc»-tmaz(tc»- 1 else

In the real implementation the exponential function is piece-wise defined for a certain interval and then continued bya linear function with huge gradient (gradient is continuedconstantly from interval edge).

4.4 Refinement

During refinement initial clustering stages may be re-solvediterativelycombined with partitioning using a modi-fied (complexcost function) Fiduccia-Mattheyses heuristic.Hereby a further local minimization of the cost function isachieved.

S. Partitioning Results

The HiPART algorithm has been applied to several ap-plications: fuzzy controller, combustion engine control anda compress algorithm. The first two examples could bepartitioned within seconds without clustering, due to asmall number of nodes « I00). The partitioning has alsobeen applied to C compress algorithm (868 nodes). Tim-ing constraints have been set on the full execution time.The partitioning into a mixed HW/SW realization has beenperformed successfully within five minutes of CPO time(SPARC20).

6. Conclusion

The paper presented a new hierarchical partitioning ap-proach realizing a new hybrid fine-/coarse-grained parti-tioning strategy which allows any degree of user interac-tion. Due to fast incremental updates of the partitioning cost

function values, the proposed method is a step in directionof debugging tools for hardware/software systems. Accord-ing to the increasing complexity and heterogeneousness ofapplications, system designers and application informationcan be incorporated into the overall codesign process. Inmany cases this seems to be necessary to obtain efficientsolutions.

DICE represents a hardware/software codesign systemapplicable to both: implementing complex applicationsonto a heterogeneous target hardware platform (inc!. op-timized synthesized communication architectures), and forrapid system prototyping of efficient implementation solu-tions for efficient systems-on-the-chip (SaCs).

References

[1] J. K. Adams and D. E. Thomas. The Design of Mixed Hard-ware/Software Systems. In Proceedings of the Design Au-tomation Conference,pages 515-520, June 1996.

[2] E. Barros, W. Rosenstiel, and X. Xiong. A Method forPartitioning UNITY Language in Hardware and Software.In Proceedingsof the European Conference on Design Au-tomation, pages 220-225, Sept. 1994.

[3] R. Ernst, J. Henkel, and T. Benner. Hardware-SoftwareCosynthesis for Microcontrollers. In IEEE Design & Test,pages 64-75, Dec. 1993.

[4] D. Gajski, F. Vahid, S. Narayan, and J. Gong. Specificationand Design of Embedded Systems. Prentice-Hall, 1994.

[5] M. Gasteier. Cosimulation und Kommunikationssynthese imEntwurf gemischter Hardware/Software-Systeme. PhD the-sis, Darmstadt University of Technology.

[6] R. K. Gupta, C. N. Ceolho, and G. De Micheli. ProgramImplementation Schemes for Hardware-Software Systems.IEEE Computer, pages 48-55, Jan. 1994.

[7] R. W. Hartenstein and J. Becker. Two-Level Partitioning ofImage Processing Algorithms for the Parallel Map-orientedMachine. In Proceedings of the 4th Int. Workshopon Hard-ware/Software Co-Design CODES/CASHE '96, Pittsburgh.

[8] J. Hou and W. Wolf. Process Partitioning for DistributedEmbedded Systems. In Proceedings of the Int. Workshopon HW/SW Codesign (CODES/CASHE», pages 70-76, Mar.1996.

[9] T. B. Ismail, K. O'Brien, and A. Jerraya. Interactive System-level Partitioning with PARTIF. In Proceedings of the Euro-pean Design & Test Conference, pages 464-468, Mar. 1994.

[10] A. Kirschbaum and M. Glesner. Rapid Prototyping of Com-munication Architectures. In IEEE Workshop on RapidSys-tem Prototyping, pages 136-141, Chapel Hill, USA, June1997.

[II] P. V.Knudsen and 1.Madsen. A Dynamic ProgrammingAl-gorithm for Hardware/Software Partitioning. In Proceedingsof the CODES/CASHE '96, Pittsburgh, USA.

[12] F. Vahid. Modifying Min-Cut for Hardware and SoftwareFunctional Partitioning. In Proceedings of the 5th Int. Work-shop on Hardware/Software Co-Design CODES/CASHE'97, Braunschweig, Germany.

33

HiPART: A New Hierarchical Semi-Interactive HW-/SW ...

Documents

Transcript of HiPART: A New Hierarchical Semi-Interactive HW-/SW ...