IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR...

11
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS 1 Tap Delay-and-Accumulate Cost Aware Coefficient Synthesis Algorithm for the Design of Area-Power Efficient FIR Filters Jiajia Chen, Member, IEEE, Chip-Hong Chang, Senior Member, IEEE, Jiatao Ding, Rui Qiao, and Mathias Faust, Member, IEEE Abstract—Finite-impulse response filters are widely used in digital signal processing applications. Prodigious research in the past two decades has substantially reduced the implementation cost of the multiple constant multiplication blocks. Further area and power consumption savings are stagnated by the structural adders and registers in the tap delay-and-accumulate line, which unfortunately dominate the overall hardware cost of FIR filter and are difficult to minimize by existing resource sharing approaches. Retiming or relocating the structural adders and registers can improve merely the throughput. To close the area-power efficiency gap, we reformulate the filter coefficient synthesis problem to explore the design space for the tap delay- and accumulate line by bisecting at some tap position. An efficient Genetic Algorithm is proposed to solve this integer programming problem at quadratic computational complexity by refining the search space for finding an optimized solution to fulfill the frequency response specifications. Field programmable gate array and application specific integrated circuit logic synthesis results from twelve benchmark filter specifications showed that the average area and power consumptions of the solutions generated by our proposed algorithm have been reduced by up to 26.8% and 27.5% respectively, in comparison with the solutions obtained by existing design methods. Index Terms—FIR filter design, structural adder, digital signal processing. I. I NTRODUCTION F INITE Impulse Response (FIR) filter for signal condi- tioning is commonly realized in digital domain owing to its stability and linear phase response. The research in low complexity FIR filter design methodologies continues to thrive as footprint and power budget on smart devices are increasingly squeezed by technology revolution such as the upcoming 5G communication and recent proliferation of augmented and virtual reality multimedia applications. Manuscript received May 8, 2017; revised July 6, 2017; accepted July 7, 2017. This work was supported by the SUTD-MIT International Design Centre, Singapore University of Technology and Design. This paper was recommended by Associate Editor G. Jovanovic Dolecek. (Corresponding author: Chip-Hong Chang.) J. Chen, J. Ding, and R. Qiao are with the Singapore University of Technology and Design, Singapore 487372 (e-mail: [email protected]). C.-H. Chang is with the School of Electrical and Electronic Engi- neering, Nanyang Technological University, Singapore 639798 (e-mail: [email protected]). M. Faust is with Mfnet GmbH, 4132 Muttenz, Switzerland (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSI.2017.2725916 The emerging Internet of Things (IoT) [1] entails the digitiza- tion and interconnection of ubiquitous sensors and distributed nodes with central computing resource. Industry 4.0 also mandates integration of heterogeneous sensors into intelligent sensor network to provide fully automated control. Appro- priate signal conditioning circuits are needed to remove the undesired noises and interferences [2]–[5]. As application- specific digital filters play influential role in the SNR, response time, robustness and mobility of sensor network, the stagnation in their size, power, sensitivity and accuracy will impact the techno-optimism of IoT and the related technologies [6]. Existing algorithms [7]–[13] for the design of low complex- ity FIR filter mostly minimize the multiple constant multipli- cation (MCM) block and neglect the complexity of structural adders and registers in the tap delay-and-accumulate (TDA) block of FIR filter. The main reason is the operands of the structural adders are derived from different time-delayed input samples. Unlike the MCM block, there is neither exploitable correlation nor apparent redundancy to enable resource sharing by common subexpression elimination. Unfortunately, these structural adders are more expensive and power hungry than the sharable adders in the optimized MCM block, which makes them the hindrance to further area and power reduction of FIR filter implementation. Some recent algorithms, such as [14] and [15], investigated this problem and modified the structure of the TDA block to reduce the overall complex- ity or increase the throughput of the filter. In [14], splitting and remerging of operands to selected structural adders is used to control the progressive growth in the sizes of structural adders and registers in different segments of TDA block. In [15], pipeline registers are introduced to retime the critical path and increase the throughput of the filters. The effectiveness of these methods is still largely limited by the pre-determined filter coefficient set. The area and power consumption of the filter designed in [15] can and are likely to increase due to the additional pipeline registers. This is because the primary objective of this method is to reduce the filter delay at the expense of reasonable hardware overhead. Taking cognizance of the limitation imposed by fixed filter coefficients on the splitting of structural adders for complexity reduction, we propose a new design algorithm to unleash the restriction of TDA block optimization. Instead of retiming the critical adder paths or synthesizing filter coefficients to 1549-8328 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Transcript of IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR...

Page 1: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR ...kresttechnology.com/krest-academic-projects/krest-mtech-projects/E… · Design Centre, Singapore University of Technology

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS 1

Tap Delay-and-Accumulate Cost Aware CoefficientSynthesis Algorithm for the Design of

Area-Power Efficient FIR FiltersJiajia Chen, Member, IEEE, Chip-Hong Chang, Senior Member, IEEE, Jiatao Ding,

Rui Qiao, and Mathias Faust, Member, IEEE

Abstract— Finite-impulse response filters are widely used indigital signal processing applications. Prodigious research in thepast two decades has substantially reduced the implementationcost of the multiple constant multiplication blocks. Furtherarea and power consumption savings are stagnated by thestructural adders and registers in the tap delay-and-accumulateline, which unfortunately dominate the overall hardware costof FIR filter and are difficult to minimize by existing resourcesharing approaches. Retiming or relocating the structural addersand registers can improve merely the throughput. To close thearea-power efficiency gap, we reformulate the filter coefficientsynthesis problem to explore the design space for the tap delay-and accumulate line by bisecting at some tap position. An efficientGenetic Algorithm is proposed to solve this integer programmingproblem at quadratic computational complexity by refining thesearch space for finding an optimized solution to fulfill thefrequency response specifications. Field programmable gate arrayand application specific integrated circuit logic synthesis resultsfrom twelve benchmark filter specifications showed that theaverage area and power consumptions of the solutions generatedby our proposed algorithm have been reduced by up to 26.8%and 27.5% respectively, in comparison with the solutions obtainedby existing design methods.

Index Terms— FIR filter design, structural adder, digital signalprocessing.

I. INTRODUCTION

F INITE Impulse Response (FIR) filter for signal condi-tioning is commonly realized in digital domain owing

to its stability and linear phase response. The research inlow complexity FIR filter design methodologies continuesto thrive as footprint and power budget on smart devicesare increasingly squeezed by technology revolution such asthe upcoming 5G communication and recent proliferationof augmented and virtual reality multimedia applications.

Manuscript received May 8, 2017; revised July 6, 2017; acceptedJuly 7, 2017. This work was supported by the SUTD-MIT InternationalDesign Centre, Singapore University of Technology and Design. Thispaper was recommended by Associate Editor G. Jovanovic Dolecek.(Corresponding author: Chip-Hong Chang.)

J. Chen, J. Ding, and R. Qiao are with the Singapore University ofTechnology and Design, Singapore 487372 (e-mail: [email protected]).

C.-H. Chang is with the School of Electrical and Electronic Engi-neering, Nanyang Technological University, Singapore 639798 (e-mail:[email protected]).

M. Faust is with Mfnet GmbH, 4132 Muttenz, Switzerland (e-mail:[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSI.2017.2725916

The emerging Internet of Things (IoT) [1] entails the digitiza-tion and interconnection of ubiquitous sensors and distributednodes with central computing resource. Industry 4.0 alsomandates integration of heterogeneous sensors into intelligentsensor network to provide fully automated control. Appro-priate signal conditioning circuits are needed to remove theundesired noises and interferences [2]–[5]. As application-specific digital filters play influential role in the SNR, responsetime, robustness and mobility of sensor network, the stagnationin their size, power, sensitivity and accuracy will impact thetechno-optimism of IoT and the related technologies [6].

Existing algorithms [7]–[13] for the design of low complex-ity FIR filter mostly minimize the multiple constant multipli-cation (MCM) block and neglect the complexity of structuraladders and registers in the tap delay-and-accumulate (TDA)block of FIR filter. The main reason is the operands of thestructural adders are derived from different time-delayed inputsamples. Unlike the MCM block, there is neither exploitablecorrelation nor apparent redundancy to enable resource sharingby common subexpression elimination. Unfortunately, thesestructural adders are more expensive and power hungry thanthe sharable adders in the optimized MCM block, whichmakes them the hindrance to further area and power reductionof FIR filter implementation. Some recent algorithms, suchas [14] and [15], investigated this problem and modified thestructure of the TDA block to reduce the overall complex-ity or increase the throughput of the filter. In [14], splitting andremerging of operands to selected structural adders is used tocontrol the progressive growth in the sizes of structural addersand registers in different segments of TDA block. In [15],pipeline registers are introduced to retime the critical pathand increase the throughput of the filters. The effectivenessof these methods is still largely limited by the pre-determinedfilter coefficient set. The area and power consumption of thefilter designed in [15] can and are likely to increase due tothe additional pipeline registers. This is because the primaryobjective of this method is to reduce the filter delay at theexpense of reasonable hardware overhead.

Taking cognizance of the limitation imposed by fixed filtercoefficients on the splitting of structural adders for complexityreduction, we propose a new design algorithm to unleash therestriction of TDA block optimization. Instead of retimingthe critical adder paths or synthesizing filter coefficients to

1549-8328 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR ...kresttechnology.com/krest-academic-projects/krest-mtech-projects/E… · Design Centre, Singapore University of Technology

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

Fig. 1. Transposed direct form FIR filter with structural adders A0 to AN−2and registers R0 to RN−2.

maximize sharing of common subexpressions in MCM blockwithout compromising the magnitude response specification,the finite-precision filter coefficients are synthesized to findan optimal splitting point to achieve the maximum effectivecost saving of TDA block with a simple split-once technique.The coefficient synthesis problem is modelled as an integerprogramming problem and solved by a customized GeneticAlgorithm (GA) [16]. To the best of our knowledge, this isthe first filter coefficient synthesis algorithm that tackles theminimization of TDA block in the literature.

The rest of this paper is organized as follows. Section IIprovides an overview of TDA bisection technique and howthe performance is affected by choosing different bisectionposition. Section III introduces our formulation of the filtercoefficient synthesis problem and presents our customizedGA solver to this problem. Section IV presents and comparesthe synthesis results of a group of benchmark filters, gener-ated by our proposed algorithm and three other competingalgorithms. Lastly, our conclusion is provided in Section V.

II. PRELIMINARIES AND MOTIVATION

A. Complexity of MCM and TDA Blocks in FIR Filter

An N-tap finite impulse response (FIR) digital filter can berealized from the following discrete time convolution.

y[n] = h[n] ∗ x[n] =N−1∑

i=0

h[i ]x [n − i] (1)

where x[n] and y[n] are the n-th time domain input andoutput data samples, respectively. h[i ], i = 0, 1, . . . , N − 1,are the filter coefficients of the impulse response function

H (z) =N−1∑i=0

h [i ] z−i .

Fig. 1 shows the transposed direct form implementation ofa FIR filter. For linear-phase FIR filter, h[n] = h[N − 1 − n]and

H (z) =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎩

h

(N − 1

2

)z− N−1

2

+N−1

2 −1∑

i=0

h (i)(

z−i + z−N−1−i)

odd N

N2 −1∑

i=0

h (i)(

z−i + z−N−1−i)

even N

(2)

TABLE I

COMPARISON OF HARDWARE RESOURCESBETWEEN MCM AND TDA BLOCKS

Due to the coefficient symmetry, only (N−1)/2 distinctcoefficients need to be computed within the MCM blockof Fig. 1, but the numbers of structural adders (SAs) andregisters required to sum the delayed partial sums (whichare the product of the input sample x(n) and filtercoefficient h(i)) in the cascaded TDA line remain unchanged.Unfortunately, TDA block uses more than double the hardwareresources of MCM block, as evidenced by the number of fulladders (#FA), full subtractors (#FS) and flip-flops (#FF) usedby the MCM and TDA blocks of four FIR filters [7], [18]–[20]that are widely

used for benchmarking MCM minimization algorithmsin Table I. The filter tap N , coefficient wordlength w andcoefficients of these filters are obtained by Park-Mclleanalgorithm [17] and the MCM blocks are optimized using Paskoalgorithm [7]. 8-bit input variable x is assumed for all filters.The ratio of TDA area to MCM area is shown in the last rowof Table I by assuming that the area of a FF is comparableto that of a FA. It is apparent that it takes only a smallfraction of TDA area to outweigh any area savings derivedfrom the MCM block minimization. For the case of FIR2,a trivial 5.5% saving from TDA block is more than sufficientto cover the area of the entire MCM block. This is because theSAs in TDA block are much larger than the adders and sub-tractors in MCM. Unfortunately, as the input signals to eachSA are derived from the input signal sampled at differentclock cycles, the addends are largely uncorrelated. Hence, theSAs cannot be reduced by common subexpression eliminationas in MCM block.

As the impulse response is a sinc-like function, the outputsof MCM block have a 1/x envelope. As each addend is theproduct of a constant and a fixed-width integer variable x ,the inputs to the SA are discrete random integer variables.Thus, the number of distinct integer output values is far lowerthan 2wx N for the accumulation of N continuous randominteger variables, where wx is the word length of the inputvariable x . For example, the 121-tap FIR filter [19] with 8-bitinput x has less than 225 ≈ 3.3 × 107 possible output valuesas opposed to 256121 ≈ 2.5 × 10291 different output values.It has been shown in [21] that central limit theorem holdsfor the probability distribution of SA output values. As theprobability of recurrent integer outputs increases after somenumber of accumulations, the bit width required to representthe output of a SA grows towards the output of the filter till themiddle tap and remains relatively constant from the middle tap

Page 3: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR ...kresttechnology.com/krest-academic-projects/krest-mtech-projects/E… · Design Centre, Singapore University of Technology

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEN et al.: TDA COST AWARE COEFFICIENT SYNTHESIS ALGORITHM 3

Fig. 2. Bisection of partial sum p j before SA A j .

to the output. The probability of the output values of a fixed-coefficient FIR filter can thus be approximated by a normaldistribution. In fact, the tail values of the actual distribution areeven less likely to occur than those at the tails of the normaldistribution. This means that the range of SA outputs in thelater half of the TDA can be accurately estimated from wx

and the predetermined filter coefficient values to reduce thelengths of the corresponding SAs.

B. SA Reduction by Bisection of Long Operand

Let pi and qi be the two inputs to the i -th SA, denotedby Ai , where pi is the output of the i -th register Ri of theTDA block and qi is the i -th coefficient multiplier output h(i)·x[n] of MCM block in Fig. 1. Let w(α) denote the word lengthof a signal variable α. In [14], the sizes of some SAs arereduced by bisecting a selected long partial sum p j at thej -th tap into two parts pu

j and plj as shown in Fig. 2. By saving

the upper part puj for m (1 ≤ m ≤ j) delays, the lower part

plj can be accumulated to qi by reduced word length SAs,

Ai for i = j, j − 1, . . . , j − m. After m − 1 cycle delays,the saved pu

j is added to the signed extended plj−m−1 where

w(

plj−m−1

)> w

(pl

j

). If m = j , then p−1 represents the

output of the last SA A0. The bit width w(Ai ) of each reducedSA Ai , for i = j, j−1, . . . , j−m, can be estimated by intervalarithmetic as follows:

w (Ai )>

⎡⎢⎢⎢

log2

⎝max

⎝2w(pl

i

)−1+

j∑

i= j−m

v+i ,−

j∑

i= j−m

v−i

⎤⎥⎥⎥(3)

where �·� is the ceiling function and[v+

j , v−j

]is the dynamic

range of q j .As v−

i ≤ 0 and v+i ≥ 0, w(Ai ) increases monotonically

with i . Assuming ripple carry adder (RCA) is used for theSA implementation to conserve area and power, the totalnumber of FAs saved by the m reduced SAs starting fromthe j -th tap is given by:

� j,m =m∑

i=0

[w

(p j

) − w(

A j−i) − ρ

(w

(A j−i

) − w(

plj

))]

−[w

(A j−m

) − w(

plj

)](4)

In (3), w(

p j) − w

(A j−i

)represents the FA cost saving and

w(

A j−i) − w

(pl

j

)represents the register overhead required

Fig. 3. TDA block with bisection technique at SA A j with m = j .

TABLE II

FA COST AND FA SAVING � j BY DIRECT APPLICATION OF TDABISECTION TECHNIQUE AT DIFFERENT j FOR FIR1 TO FIR4

to save puj for m cycles, where ρ is the area ratio of a FA to

a FF. The last term w(

A j−m) − w

(pl

j

)refers to the FA cost

required to merge the saved puj .

Although w(

plj

)can be any value smaller than w

(p j

),

to minimize the SAs, w(

plj

)is made equal to w

(q j

).

Apparently, � j,m depends on the tap position j to split p j

and the number of delay cycles, which is m − 1, before themerging. To investigate the effect of maximizing the SA costsaved by performing only one splitting of addend at sometap position j with m = j such that all the SAs from thej -th tap to the output can be reduced, as shown in Fig. 3.Table II lists the values of � j = � j, j at different splittingposition j for FIR1 to FIR4 with ρ = 1. The first row ofTable II with j = −1 represents the equivalent FA cost of theTDA block without applying the bisection method. It showsthat the register overhead incurred by the bisection increaseswith m and outweighs the saving in SAs if the cost of a FF iscomparable to a FA.

The negative � j results of Table II signify that limited sav-ings can be obtained even if different combinations of j and mare explored simultaneously with multiple bisections andmerges. The trade-off between the register overhead andSA savings is severely constrained by the pre-determined filtercoefficients. In cognizant of the influence of filter coefficientson the word length distribution of qi and the dominance ofTDA block area on the overall filter complexity, we proposeto synthesize the FIR filter coefficients to achieve notableTDA block minimization by simple bisection with j = m.

Page 4: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR ...kresttechnology.com/krest-academic-projects/krest-mtech-projects/E… · Design Centre, Singapore University of Technology

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

III. TDA MINIMIZATION ALGORITHM BY

FILTER COEFFICIENT SYNTHESIS

A. FIR Filter Coefficients and Frequency Response

The frequency response of a linear-phase FIR filter canbe described by a real-valued zero-phase frequency responsefunction HR(ω) and a real-valued phase function θ (ω), i.e.,

H (ω) = HR (ω) e jθ(ω) (5)

where ω is the angular frequency in radian per second.Since

∣∣e jθ(ω)∣∣ = 0, |H (ω)| = |HR (ω)|. Ignoring the

linear phase term, e− jω (N−1)/2, the frequency response of aFIR filter of length N can be expressed as [22]:

HR (ω) = a0 + 2�N/2∑

i=1

ai T (ω, i) (6)

where ai is a function of the filter coefficient h(i), T (ω, i) isa trigonometric function.

For a linear phase FIR filter, T (ω, i) = cos(iω). If N isodd, then a0 = h

( N−12

)and ai = h

( N−12 − i

). If N is even,

then a0= 0 and ai = h( N

2 − i). Thus, the magnitude response

of H (ω) can also be expressed as:

H (ω) =

⎧⎪⎪⎪⎪⎪⎪⎪⎪⎨

⎪⎪⎪⎪⎪⎪⎪⎪⎩

h

(N − 1

2

)

+2

N−12 −1∑i=0

h (i) cos

((N − 1

2− i

)N is odd

2

N2 −1∑i=0

h (i) cos

((N − 1

2− i

)N is even

(7)

For fixed point implementation, the real-valued coeffi-cients h(i) for i = 1, 2, . . . , N can be directly truncated intoa set of L-bit finite precision coefficients denoted by H.

Different finite precision coefficient sets H′ can be furtherderived from H to meet the minimum filter response specifica-tions, which are typically specified in terms of the filter pass-band and stopband frequencies, maximum allowable passbandripple, and minimum stopband attenuation. The tolerance offrequency response error E (ω) = ∣∣H (ω) − H ′ (ω)

∣∣, whereH ′ (ω) is the approximated magnitude response obtainedby H′, is exploited by filter design algorithms to synthesize H′with a minimal number of signed power-of-two terms [19] or amaximal number of common subexpressions [8], [22]. Unfor-tunately, the filter coefficient set H′ synthesized by all existingdesign algorithms is helpful in reducing the complexity of onlythe MCM block but not the TDA block.

B. Problem Formulation

For an input signal x of word length w(x), the word lengthof qi = h′ (i) x (n), where h′ (i) is the i -th coefficient of H′,is given by:

w (qi ) = w(h′ (i)

) + w (x) − 1 (8)

From (4), if the operand p j of SA A j is chosen for bisectionand m = j , then the net FA saving � j of the bisection method

TABLE III

PASSBAND, TRANSITION BAND AND STOPBANDFOR DIFFERENT FILTER TYPES

based on the coefficient set H′ can be expressed as:

� j

=j∑

i=0

[w(p j )−w(A j−i)−ρ(w(A j−i)−w(h′( j)−w(x)+1))]

− [w(A0) − w(h′( j) − w(x) + 1)] (9)

The passband, transition band and stopband for differentfilter types can be specified by the four lower and uppertransition band edge frequencies, ω1, ω2, ω3 and ω4, as shownin Table III, where ω1 < ω2 < ω3 < ω4. For low pass filter,ω3 and ω4 represent the passband and stopband edge frequen-cies, respectively, and the lower transition band edge frequen-cies, ω1 = ω2 = 0. For high pass filter, ω1 and ω2 representthe stopband and passband edge frequencies, respectively, andthe upper transition band edge frequencies, ω3 = ω4 = ∞. Forband pass filter, ω1 and ω4 are the stopband edge frequencieswhile ω2 and ω3 are the passband edge frequencies. For theband stop filter, ω1 and ω4 are the passband edge frequencieswhile ω2 and ω3 are the stopband edge frequencies.

Let δp and δs denote the passband ripple and stopband atten-uation, respectively. Then, to reduce the TDA block complex-ity, H′ should be synthesized such that its magnitude responsefulfills δp ≤ δpmax and δs ≥ δsmin with the maximum � j ,where δpmax and δsmin are the maximum allowable passbandripple and the minimum stopband attenuation, respectively.The FIR filter design problem can be reformulated as:

FindN, j and H′ to maximize � j s.t.

H ′ (ω) ≤ δp max + 1 for ω ∈ passband, and

H ′ (ω) ≤ 1 − δs min for ω ∈ stopband (10)

where j ∈ [1, N − 1] is the optimal tap position to bisectthe input operand p j to the j -th SA, A j , where the index jis numbered from the output tap ( j = 0) towards the input( j = N − 1) of the filter.

The solution to this integer linear programming problemcan be evolved from an initial finite precision coefficientset H of word length L and filter order N determined by thePark-Mcllean algorithm based on H (ω). Although N and Lof H are not large, the search space for a solution coefficientset H′ that possesses an optimal point to maximize � j is huge.To ensure a fast convergence to an optimal solution, two addi-tional constraints are imposed. First, the solution set H′ mustmaintain the linear phase response. Secondly, the dynamicrange of each coefficient h′ (i) should not be significantlydifferent from that of its corresponding coefficient h (i). This isaccomplished by limiting their precision difference to ±b units

Page 5: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR ...kresttechnology.com/krest-academic-projects/krest-mtech-projects/E… · Design Centre, Singapore University of Technology

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEN et al.: TDA COST AWARE COEFFICIENT SYNTHESIS ALGORITHM 5

of least precision (ulp) of h (i). Forh (i) with L-bit precision,its ulp = 2−L . By setting b = 2, a more constrained integerlinear programming problem is defined:

Find N, j and H′ to maximize � j s.t.

h′ (i) = h′ (N − i − 1) ∀i ∈ [0, N − 1] ,

h′ (i) ∈[h (i) − 21−L , h (i) + 21−L

]∀i ∈ [0, N − 1] ,

H ′ (ω) ≤ δp max + 1 for ω ∈ passband and

H ′ (ω) ≤ 1 − δs min for ω ∈ stopband (11)

C. Proposed Coefficient Synthesis Algorithm

The integer linear programming problem described by (11)can be solved by Genetic Algorithm (GA) [16]. By introducingstochastic search [23] to escape local minima, GA helps inconverging the solution towards greater optimality uniformlyfrom many different initial coefficient sets H′ without demand-ing precise modeling of magnitude response errors.

The GA is initialized with a population of K randomlyselected solutions as the parent set. For the problem of (10),each of these K solutions is a combination of a bisectionpoint j and a coefficient set H′. A portion of this populationis then selected to breed a new generation of solutions throughan iterative process of crossover and mutation operations untilthe termination criterion is met, which is determined by thepoint of diminishing return in the net positive saving � j .

The solutions in the parent set are paired to perform thecrossover. Selecting the two most optimal solutions directlyfrom the parent set to crossover can hardly evade the localminima problem. If an optimal solution is paired with a non-optimal solution and crossover, an additional optimal solutionis more likely to be evolved. In our method, S solutions arefirst randomly selected from the parent set. Two solutionswith the largest � j in these S randomly selected solutions arepaired and the remaining S − 2 solutions are returned to theparent set. Next, another S unpaired solutions in the parentset are randomly selected and two of them with the largest� j are paired. This pairing continues until all solutions inthe parent set are paired. Let h′

A (α) and h′B (α) be the α-th

coefficient (α ∈ [0, N − 1]) of a pair with solutions, A and B ,respectively. Crossover is performed by swapping two differentbits in the same position of h′

A (α) and h′B (α) once, starting

from the least significant bit (LSB) position. The two newsolutions generated by a crossover will undergo the constraintcheck based on (11). The qualified new solutions that meetthe constraints are added into the next generation parent setand the disqualified ones are discarded. Once a successfulcrossover between h′

A (α) and h′B (α) is made, the remaining

bits of h′A (α) and h′

B (α) will not be swapped. The crossoverwill be performed for all coefficients of a pair and for allpaired solutions in the parent set of current generation.

The pairing and crossover operations are described by thepseudocodes in Fig. 4. The function top_two_saving selectstwo candidate solutions with the greatest savings from thegroup of S solutions. The function swap(Paired_grp(i), α)swaps a digit in one solution with that in another solutionof the i -th candidate solution pair at the same bit position.

Fig. 4. Crossover operation of proposed GA-based filter coefficient synthesisalgorithm.

The check(new_solution) function tests if the resulting newsolution meets the constraints of (11), and returns 1 if it fulfillsthe constraints.

Mutation is performed after crossover, where one bit of acoefficient in a solution is changed to yield a new candidatesolution, starting from the LSB. In practice, the constraintsof (11) are unlikely to be met when bit weighted heavierthan the second LSB is varied, as verified by rigorous experi-mentation. Therefore, mutation is only performed on the twoLSBs of a coefficient in our algorithm, which produces atmost four possible new candidate solutions. Even then, onlya few big coefficients of a FIR filter can afford to have theirtwo LSBs changed without violating the constraints of (11).To reduce the search space of our algorithm, highly probableineffective mutations are avoided by applying the mutationoperation to only the M largest coefficients of a candidatesolution. Thus, the maximum number of mutations performedis 4M for each solution. All the newly generated qualifiedsolutions by mutation are added into the parent set of the nextgeneration.

The size of the parent set has expanded after crossoverand mutation due to many more newly generated candidatesolutions. The size of the parent set is reduced back to Kbefore the next iteration by keeping only the K candidatesolutions with the largest net positive savings � j . The overallsaving of these K new candidate solutions is expected to begreater than or equal to the overall saving of the K candidatesolutions in the parent set of the previous generation. As theaverage cost of the candidate solutions continue to reducefrom generation to generation, the effect of crossover andmutation becomes less prominent. Therefore, the iteration will

Page 6: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR ...kresttechnology.com/krest-academic-projects/krest-mtech-projects/E… · Design Centre, Singapore University of Technology

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

Fig. 5. GA program flow for TDA cost driven coefficient synthesis.

terminate if the increment in average cost saving between twoiterations has reduced to below certain empirical threshold,which indicates that further iteration gains significantly lesssavings for the same computational effort. To limit the com-putation time, the number of iterations is also capped at somemaximum value. The GA terminates when either criterion ismet. Upon termination, the GA returns the L-bit precisioncoefficient set H′ with the position j of the TDA block wherethe bisection is applied.

Fig. 5 presents the GA program flow of the TDA cost drivencoefficient synthesis. The function GA(δpmax, δsmin , ω1, ω2,ω3,ω4, H) takes the design specifications and initial coefficientset as inputs and generates the optimized finite coefficientset H′. The function random_generate (H, constraint, K )randomly generates K candidate solutions into the parent set,by varying j and H. Each solution is represented by aninteger j and a coefficient set H′ fulfilling the constraintsspecified in (11). The function saving_compute(parent_set)evaluates and adds up the savings of all solutions in theparent set. The functions crossover and mutation performcrossover and mutation, respectively, as described earlier, andgenerate new candidate solutions into the parent set of thenext generation. The function select(K , parent_set) selects theK solutions with the largest savings from the parent set forthe next iteration. When one of the two termination criteriais met, the solution with the largest saving is selected byselect(1, parent_set) as the final solution.

Fig. 6 describes the complete design flow of the proposedTDA block minimization (TDAmin) algorithm. TDAmin gen-erates a succinct implementation topology of optimized TDAand MCM blocks based on the design specification of aFIR filter. The minimum filter length N is estimated byPM_Min_N and the N filter coefficients of H are synthesizedby PM using Park-Mcllean algorithm. The coefficient set H′that minimizes the TDA block cost is then generated by callingour GA function to solve the integer programming problemformulated in (11). The function Design_TDA returns thedesign of TDA block by applying the bisection technique at

Fig. 6. Proposed TDAmin algorithm.

SA position j . To complete the FIR filter implementation,the MCM block is designed and optimized by Design_MCMusing one of the existing MCM algorithms with the finiteprecision coefficients of H′ generated by our GA algorithm.

D. Algorithm Complexity

According to the constraints of (11), the search for H′in our proposed GA algorithm is bounded by ±2(−L+1) ofeach L-bit coefficient originated from h (i) of H. Due tocarry and borrow, at most three LSBs of a coefficient ofthe candidate solution can possibly be changed regardlessof L. Different bisection points j ∈ (1, N − 1) are testedto evaluate the net positive saving of each candidate solution.The search for the optimal bisection point of each candidatesolution grows linearly with N . In practice, crossover islimited to the two LSBs of any coefficients. In each iter-ation, the parent set maintains K coefficient sets to formK /2 pairs and each coefficient set has N coefficients. So thecomplexity of crossover is O(2N K/2) = O(N K ). Similarly,four possible mutations can be performed on each of theM largest coefficients. So the mutation complexity for Kcandidate solutions of the parent set is O(4M K ). The overallcomplexity of crossover and mutation for each generationis O(N(N K + 4M K )) = O(N2 K + 4M K N). From theexperimental results of algorithmic convergence of practicalFIR filters, both the number of iterations and the chosenK value are well bounded and independent of N . Hence,the overall complexity can be approximated by O(N2+4M N).It is reasonable to assume M is proportional to N . Hence, ourproposed GA algorithm has a quadratic complexity of O(N2).In practice, the FIR filter order N rarely exceeds 200. Thecomputation efficiency is attested by running the proposedalgorithm on one exceptionally long filter of order N = 293from [24]. It took only 8.2s to obtain a solution with optimizedTDA cost on a PC equipped with Intel i7-4500CPU runningat 1.8GHz and 16GB RAM.

E. Design Example

FIR1 from [18] is used to demonstrate the design flow ofour proposed algorithm. This is a low pass filter with thenormalized passband and stopband frequencies at ω3 = 0.25and ω4 = 0.3, respectively. The maximum passband rippleand minimum stopband attenuation are 0.5 dB and 27 dB,respectively. The input signal wordlength s is assumed tobe 8 bits. With these specifications, the filter order determinedby Park-Mcllean algorithm is N = 24. The 24 infinite-precision coefficients obtained by Park-Mcllean algorithm are

Page 7: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR ...kresttechnology.com/krest-academic-projects/krest-mtech-projects/E… · Design Centre, Singapore University of Technology

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEN et al.: TDA COST AWARE COEFFICIENT SYNTHESIS ALGORITHM 7

Fig. 7. (a) Frequency responses of FIR1 designed by TDAmin, [8], [10]and [15], (b) Frequency responses in passband.

rounded at the 50th bit position. Their decimal values of therounded coefficients are listed below.

H = {0.028225,−0.012702,−0.017768, 0.024334,

0.013875,−0.038491,−0.0078193, 0.066189,−0.017955,

−0.11552, 0.10696, 0.48576, 0.48576, 0.10696,−0.11552,

−0.017955, 0.066189,−0.0078193,−0.038491, 0.013875,

0.024334,−0.017768,−0.012702, 0.028225}.With H and the constraints specified in (11), we performthe coefficient synthesis algorithm presented in Fig. 5. Theproduced decimal values of the coefficients of this optimalsolution set H′ are listed below.

H′ = {0.027344,−0.012695,−0.017578,

0.023438, 0.013672,−0.038086,−0.0078125, 0.06543,

−0.017578,−0.11523, 0.10645, 0.48633, 0.48633, 0.10645,

−0.11523,−0.017578, 0.06543,−0.0078125,−0.038086,

0.013672, 0.023438,−0.017578,−0.012695, 0.027344}.The frequency response of this coefficient set H′ is plotted inFig. 7, which meets all specifications of this FIR filter [18].The frequency responses of this filter designed by three othermethods [8], [10], and [15] are also plotted in Fig. 7 forcomparison. All four responses meet the design specificationswith different hardware implementation costs. The compar-isons of their hardware area and power consumptions in fieldprogrammable gate array (FPGA) and application specificintegrated circuit (ASIC) implementations will be presentedand discussed in Section IV.

Fig. 8. FIR filter (FIR1) designed by TDAmin.

The optimized TDA block produced by H′ with j = 9is shown in Fig. 8. Since h′ (9) = 0.000111010 and theword length of input sample, s = 8, w(q9) = 13. The wordlength of the partial sum p9, w(p9) = 25, which is longerthan 13 bits. Using conventional TDA block design, 25 FAsare required for the carry ripple adder (CRA) of A9. Theword lengths of subsequent adders A8 to A0 as well as theregisters R8 to R0 also increase accumulatively. Assume anarea ratio of 1:1 between a FA and a FF, the equivalent FAcost from A9 to A0 in the TDA line is 590 without using thebisection technique. With our proposed algorithm, the 25-bitlong p9 is split into a 13-bit operand pl

9 and a 12-bit operandpu

9 . pl9 is added with q9 and pu

9 is latched until the outputof A0 is generated. As the wordlength w(A9) has reducedfrom 25 bits to 13 bits, the sizes of subsequent A8 to A0and their accompanying registers R8 to R0 are all reduced.Their area costs total up to 370 FAs. Accounting for the areaoverhead of 37 FAs required for the additional final adder andthe registers to latch pu

9 , there is still a significant 31% FA costsaving of � j = 590 − 370 − 37 = 183.

The other three filters FIR2 to FIR4 in Table II are alsodesigned by TDAmin algorithm. Positive savings are obtainedas opposed to the negative savings obtained using the coef-ficients generated from Park-Mcllean algorithm directly. Thiscomparison shows that simple TDA bisection technique caneffectively reduce the overall complexity of FIR filter withthe help of coefficient set generated by TDAmin. To furtherevaluate the effectiveness of the proposed TDAmin algorithm,in the next section, logic synthesis will be performed on thebenchmark filters designed by TDAmin and compared withsolutions generated by other methods.

IV. LOGIC SYNTHESIS RESULTS AND DISCUSSION

In this section, a group of practical filters are designedby the proposed algorithm and a few recently publishedcompeting methods. The logic synthesis and power simulationresults will be compared and discussed. As the filter length ofmost practical filters ranges from 25 to 200 and the precisionof their coefficients ranges from 8 bits to 16 bits, the parentset size K of our algorithm is set to 50, the group size Sfor pairing the solutions is set to 10 and the number of largecoefficients M is set to N /10 rounded to its nearest integer.The latter implies that only 10% of the largest coefficients of acoefficient set are eligible for mutation. The minimum numberof iteration T is set to 6 and the threshold in the average savingincrement for termination is set to 1/K . After H′ is generated

Page 8: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR ...kresttechnology.com/krest-academic-projects/krest-mtech-projects/E… · Design Centre, Singapore University of Technology

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

TABLE IV

SPECIFICATIONS OF THE TWELVE BENCHMARK FILTERS

by TDAmin, Pasko algorithm [7] is used to minimize theMCM block.

The benchmark filter specifications are listed in Table IV,which are used by the proposed TDAmin and its competingalgorithms to synthesize the filter coefficients. FIR1 [18]is a low pass filter, which is used as design example inSection III.E. FIR2 to FIR12 are benchmark filters presentedin the respective references cited next to them. These filtersinclude practical filters with different specifications and filterlengths. For example, FIR2 and FIR4 are filters that have verylarge stopband attenuation (> 80dB). These filters are usedfor the applications that demand very high SNR. FIR4, FIR8,FIR10 and FIR12 are high selectivity filters with very narrowtransition band. These filters have lengths from 20 to 293,representing a good sampling of short, median and long filters.The designs generated by the TDAmin are compared againstthose generated in [8], [10], and [15]. The design method in [8]is also a linear programming based algorithm. The objectiveof its coefficient synthesis is to maximize the sharing of twomost frequently encountered types of common subexpressionsin the MCM block without considering the optimization ofTDA block. As it is also formulated based on integer linearprogramming, it is a good candidate to compare the overallcost reduction achievable by TDA cost aware and typicalMCM cost aware design algorithms. The design algorithm [10]is a recently proposed low complexity FIR filter synthesisalgorithm. By using nonlinear quantization with uneven wordlength, it minimizes the bit width of filter coefficients. Thesolutions generated by the coefficients of this latest MCMblock minimization algorithm [10] can further attests thedominance of structural adder costs and the savings achievedby TDAmin is far more than the savings achievable by thestate-of-the-art minimization of MCM block. The methodproposed in [15] is the most recent design method targetingSA block minimization. Thus, its comparison provides themost compelling proof of the effectiveness of TDAmin inproducing a more optimized TDA block.

All the designs are mapped to Xilinx Spartan6, xc6slx75tFPGA device and synthesized using Xilinx ISE WebPACKv14.7. The synthesized areas in number of LUTs and delays inns are presented in Table V. After the design is implemented,

TABLE V

SYNTHESIZED FPGA AREAS IN #LUT AND DELAYS IN ns FORFIR FILTERS DESIGNED IN [8], [10], AND [15] AND TDAmin

TABLE VI

TOTAL POWER IN μW ON FPGA IMPLEMENTATION OF FIRFILTERS DESIGNED IN [8], [10], [15] AND TDAmin

the NCD file output from Place & Route (PAR) is read byXilinx ISE Xpower Analyzer (XPA) to estimate the activityrates. Input frequency and supply voltage among others arespecified in the settings of simulation activity files for XPAto perform the power analysis. A clock frequency of 70 MHzand a supply voltage of 1.2V are set consistently for all thedesigns. The power dissipation results in μW simulated byXPA are listed in Table VI.

From Tables V and VI, the proposed TDAmin algorithmreduces the LUT costs by 26.8%, 8.7% and 9.4% on averageover [8], [10] and [15], respectively. As the MCM blocksare implemented by the same algorithm using the coefficientsets synthesized by all algorithms in comparison, the lowerimplementation complexity of the FIR filters designed byTDAmin is attributed solely to the optimization of TDA blockwith appropriate choice of SA splitting position. The delays ofFIR filters designed by TDAmin are also 1.2%, 4.6% and 1.2%shorter than [8], [10], and [15], respectively. From the totalpower comparison, it is evident that the solutions generated byTDAmin are the most power efficient among all. The powerreductions over [8], [10] and [15] are on average 9.7%, 3.0%and 27.5%, respectively.

All the generated designs are also mapped to STM 65nmstandard cell library and synthesized by Synopsys Design-Compiler™. The synthesized areas in μm2 and delays in

Page 9: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR ...kresttechnology.com/krest-academic-projects/krest-mtech-projects/E… · Design Centre, Singapore University of Technology

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEN et al.: TDA COST AWARE COEFFICIENT SYNTHESIS ALGORITHM 9

TABLE VII

SYNTHESIZED ASIC AREAS IN μm2 AND DELAYS IN ns FOR FIRFILTERS DESIGNED IN [8], [10], AND [15] AND TDAmin

TABLE VIII

SIMULATED TOTAL POWER IN μW ON ASIC IMPLEMENTATION OF

FIR FILTERS DESIGNED IN [8], [10], AND [15] AND TDAmin

ns of the designs generated by the proposed and competingalgorithms are presented in Table VII. The logic synthesis andoptimization effort is set to minimum delay. The total power,including dynamic power and leakage power, consumed bythe FIR circuits are simulated by Synopsys Prime TimePX version: Z-2006.12. The supply voltage and clockfrequency are set to 0.9V and 250 MHz, respectively.As switching power is input dependent, Monte Carlo powersimulation [29] was adopted by applying randomly generatedinput vectors epoch by epoch until the maximum error in meanpower falling within the 95% confident interval convergesto 5% or lower. This goal was reached with slightly morethan 360 test vectors. Eventually, 400 random input vectorswere applied, which worked out to a statistical error boundof 4% within 95% confidence interval. The power simulationresults are presented in Table VIII.

From Tables VII, the average area reductions of the FIRfilters designed by the proposed TDAmin algorithm are 28.8%,15.8% and 11.2% compared with those designed in [8], [10],and [15], respectively. The noteworthy 28.8% area reductionover [8] indicates that the TDA block optimization contributesmore significantly to the overall complexity reduction than theMCM block minimization. As for delay, TDAmin producessolutions with 1.1%, 0.2% and 3.4% shorter delays than those

TABLE IX

THE POWER-AREA PRODUCT IN μW × μm2 FOR ASIC IMPLEMENTATION

OF FIR FILTERS DESIGNED IN [8], [10], AND [15] AND TDAmin

of [8], [10], and [15], respectively. The timing optimizationeffort made by the synthesis tool is capable of closing thegap between the critical path delays of solutions generatedby different design algorithms. From Table VIII, the totalpower reductions of the solutions generated by our proposedalgorithm over [8], [10], and [15] are 22.1%, 11.3% and 6.7%,respectively. The savings in power consumption are ascribedto the less complicated adders and registers of TDA block.A couple of short filters, such as FIR1 of N = 24 andFIR2 of N = 36, designed by TDAmin are found to consumeslightly more area and power than those designed in [15].This is because the short TDA line has limited the numberof structural adders that can be benefited from the lengthreduction after the bisection point. The optimal bisection pointdepends on the design specifications. TDAmin works better ifthe coefficient values after the bisection are much smaller thanthose values before bisection. If the design specifications ofshort filters are too restrictive, the low variance in coefficientvalues may push the optimal bisection point closest to theoutput. Even though the cost of the solution generated bythe bisection technique may not be the least under suchcircumstance, it is not too far off. FIR1 and FIR2 designedby TDAmin consume on average only 2.9% more area thanthe best solutions designed in [15]. For most other short filters,such as FIR6 with bisection at j = 13 for N = 20, FIR7 withbisection at j = 13 for N = 26 and FIR11 with bisection atj = 17 for N = 28, the complexity of their structural addersafter the bisection point can still be substantially reduced byTDAmin, making them the least area and power consumptionamong all solutions in comparison. Excluding FIR1 and FIR2,TDAmin saves on average 14% area over [15]. The sameobservation goes for power consumption. The complexityof TDA block increases faster than the MCM block asN increases. FIR10 is an exemplary case with an extremelynarrow transition band. FIR10 has more than 270 coefficientsof 16-bit precision. Its area, delay and power consumptionhave been reduced by 22.6%, 20.4% and 18.5%, respectivelyby TDAmin compared with the solution generated in [15].

As the delays of the solutions generated by these algorithmsare not appreciably different, to better compare their moresignificant overall performance deviation in logic complexity

Page 10: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR ...kresttechnology.com/krest-academic-projects/krest-mtech-projects/E… · Design Centre, Singapore University of Technology

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

10 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR PAPERS

Fig. 9. Average AP complexities in μW × μm2 of the twelve benchmarkfilters designed in [8], [10], and [15] and TDAmin.

and power consumption, the area-power (AP) complexity iscalculated as the product of silicon area in μm2 and totalpower in μW. The AP complexities of the twelve benchmarkfilters designed by the four algorithms in comparison are listedin Table IX. It is evident that, on average, the AP complexityof the filters designed by the proposed TDAmin algorithmis lower than those of [8], [10, and [15] by 43.8%, 25.1%,and 16.8%, respectively. The average AP complexities of thetwelve filters designed by the four algorithms are plottedin Fig. 9.

V. CONCLUSION

Motivated by the dominating cost of TDA block in FIRfilter which cannot be effectively reduced by existing designalgorithms that maximize sharing of adders in the MCM block,this paper presents a new design methodology for area-powerefficient FIR filter implementation by synthesizing the filtercoefficients to specifically maximize the cost savings of TDAblock. The effective implementation cost of TDA block isdetermined based on operand bisection at some optimal tapposition to reduce the sizes of subsequent structural addersand registers without violating the filtering specifications. Therenewed coefficient synthesis problem is solved by a dedicatedGenetic Algorithm. Our proposed solutions are at least 16.8%more area-power efficient based on the comparison of syn-thesis results in 65nm standard cell technology with designsgenerated by three other state-of-the-art design algorithms fortwelve benchmark filter specifications.

VI. ACKNOWLEDGEMENT

They would like to thank Dr. Lou Xin for his kind shar-ing of the source code of his recently published algorithm.They would also thank Dr. Sumedh Somnath Dhabu andMr. Zhang Yufei for their kind support in part of theexperiments.

REFERENCES

[1] H. Ma, L. Liu, A. Zhou, and D. Zhao, “On networking of Internet ofThings: Explorations and challenges,” IEEE Internet Things J., vol. 3,no. 4, pp. 441–452, Aug. 2016.

[2] J. Chen and C. H. Chang, “High-level synthesis algorithm for thedesign of reconfigurable constant multiplier,” IEEE Trans. Comput.-Aided Design Integr., vol. 28, no. 12, pp. 1844–1856, Dec. 2009.

[3] T. Chen, Y. V. Zakharow, and C. Liu, “Low-complexity channel-estimatebased adaptive linear equalizer,” IEEE Signal Process. Lett., vol. 18,no. 7, pp. 427–430, Jul. 2011.

[4] M. R. Zahabi, V. Meghdadi, J. P. Cances, and A. Saemi, “Mixed-signal matched filter for high-rate communication systems,” IET SignalProcess., vol. 2, no. 4, pp. 354–360, Dec. 2008.

[5] M. S. Prakash and R. A. Shaik, “Low-area and high-throughput archi-tecture for an adaptive filter using distributed arithmetic,” IEEE Trans.Circuits Syst. II, Exp. Briefs, vol. 60, no. 11, pp. 781–785, Nov. 2013.

[6] H. Mahdavifar, M. El-Khamy, J. Lee, and I. Kang, “Performance limitsand practical decoding of interleaved Reed–Solomon polar concatenatedcodes,” IEEE Trans. Commun., vol. 62, no. 5, pp. 1406–1417, May 2014.

[7] R. Pasko, P. Schaumont, V. Derudder, S. Vernalde, and D. Durackova,“A new algorithm for elimination of common subexpressions,” IEEETrans. Comput.-Aided Design Integr., vol. 18, no. 1, pp. 58–68,Jan. 1999.

[8] Y. J. Yu and Y. C. Lim, “Design of linear phase FIR filters insubexpression space using mixed integer linear programming,” IEEETrans. Circuits Syst. I, Reg. Papers, vol. 54, no. 10, pp. 2330–2338,Oct. 2007.

[9] R. Mahesh and A. P. Vinod, “New reconfigurable architectures for imple-menting FIR filters with low complexity,” IEEE Trans. Comput.-AidedDesign Integr. Circuits Syst., vol. 29, no. 2, pp. 275–288, Feb. 2010.

[10] S. Hsiao, J. Z. Jian, and M. Chen, “Low-cost FIR filter designsbased on faithfully rounded truncated multiple constant multiplica-tion/accumulation,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 60,no. 5, pp. 287–291, May 2013.

[11] J. Chen, C. H. Chang, F. Feng, W. Ding, and J. Ding, “Novel design algo-rithm for low complexity programmable FIR filters based on extendeddouble base number system,” IEEE Trans. Circuits Syst. I, Reg. Papers,vol. 62, no. 1, pp. 224–233, Jan. 2015.

[12] F. Feng, J. Chen, and C. H. Chang, “Hypergraph based minimumarborescence algorithm for the optimization and reoptimization of mul-tiple constant multiplications,” IEEE Trans. Circuits Syst. I, Reg. Papers,vol. 63, no. 2, pp. 233–244, Feb. 2016.

[13] J. Ding, J. Chen, and C. H. Chang, “A new paradigm of commonsubexpression elimination by unification of addition and subtraction,”IEEE Trans. Comput. Aided Designs, vol. 35, no. 10, pp. 1605–1617,Oct. 2016.

[14] M. Faust and C. H. Chang, “Optimization of structural adders in fixedcoefficient transposed direct form FIR filters,” in Proc. IEEE Int. Symp.Circuits Syst., Taipei, Taiwan, May 2009, pp. 2185–2188.

[15] X. Lou, Y. Yu, and P. K. Meher, “Analysis and optimization of product-accumulation section for efficient implementation of FIR filters,” IEEETrans. Circuits Syst. I, Reg. Papers, vol. 63, no. 10, pp. 1701–1713,Oct. 2016.

[16] J. E. Beasley, Advances in Linear and Integer Programming. Oxford,U.K.: Oxford Science, 1996.

[17] J. H. McClellan and T. W. Parks, “A personal history of theParks–McClellan algorithm,” IEEE Signal Process. Mag., vol. 22, no. 2,pp. 82–86, Mar. 2005.

[18] A. G. Dempster, S. S. Dimirsoy, and I. Kale, “Designing multiplierblocks with low logic depth,” in Proc. IEEE Int. Symp. Circuits Syst.,Scottsdale, Arizona, May 2002, pp. 773–776.

[19] Y. Lim and S. Parker, “Discrete coefficient FIR digital filter design basedupon an LMS criteria,” IEEE Trans. Circuits Syst., vol. 30, no. 10,pp. 723–739, Oct. 1983.

[20] J. Laskowski and H. Samueli, “A 150-MHz 43-tap half-band FIR digitalfilter in 1.2- μm CMOS generated by silicon compiler,” in Proc. IEEECustom Integr. Circuits Conf., May 1992, pp. 11.4.1–11.4.4.

[21] M. Faust, “Design methodologies for complexity reduction of FIR fil-ters,” Ph.D. dissertation, School Elect. Electron. Eng., Nanyang Technol.Univ., Singapore, 2013.

[22] J. Chen, J. Tan, C.-H. Chang, and F. Feng, “A new cost-aware sensitivity-driven algorithm for the design of FIR filters,” IEEE Trans. CircuitsSyst. I, Reg. Papers, vol. 64, no. 6, pp. 1588–1598, Jun. 2017.

[23] K. Deep, K. P. Singh, M. L. Kansal, and C. Mohan, “A real codedgenetic algorithm for solving integer and mixed integer optimizationproblems,” Appl. Math. Comput., vol. 212, no. 2, pp. 505–518, 2009.

[24] M. Aktan, A. Yurdakul, and G. Dundar, “An algorithm for the design oflow-power hardware-efficient FIR filters,” IEEE Trans. Circuits Syst. I,Reg. Papers, vol. 55, no. 6, pp. 1536–1545, Jul. 2008.

[25] S. R. Kotha, S. Vij, and S. K. Sahoo, “A study on strategies and Mutantfactor in differential evolution algorithm for FIR filter design,” in Proc.Int. Conf. Signal Process. Integr. Netw. (SPIN), Feb. 2014, pp. 50–55.

[26] S. Mandal, S. P. Ghoshal, R. Kar, and D. Mandal, “Design of optimallinear phase FIR high pass filter using craziness based particle swarmoptimization technique,” J. Comput. King Saud Univ. Inf. Sci., vol. 24,no. 1, pp. 83–92, Jan. 2012.

Page 11: IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS–I: REGULAR ...kresttechnology.com/krest-academic-projects/krest-mtech-projects/E… · Design Centre, Singapore University of Technology

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

CHEN et al.: TDA COST AWARE COEFFICIENT SYNTHESIS ALGORITHM 11

[27] C.-L. Chen and A. N. Willson, Jr., “A trellis search algorithm forthe design of FIR filters with signed-powers-of-two coefficients,” IEEETrans. Circuits Syst. II, Analog Digit. Signal Process., vol. 46, no. 1,pp. 29–39, Jan. 1999.

[28] A. Shahein, Q. Zhang, N. Lotze, and Y. Manoli, “A novel hybridmonotonic local search algorithm for FIR filter coefficients optimiza-tion,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 59, no. 3,pp. 616–627, Mar. 2012.

[29] R. Burch, F. N. Najm, P. Yang, and T. N. Trick, “A Monte Carlo approachfor power estimation,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,vol. 1, no. 1, pp. 63–71, Mar. 1993.

Jiajia Chen (M’14) received the B.Eng. (Hons.)and Ph.D. degrees from Nanyang Technological Uni-versity, Singapore, in 2004 and 2010, respectively.Since 2012, he has been with the Singapore Uni-versity of Technology and Design, where he is cur-rently a Senior Lecturer. He has authored one bookchapter and over 30 research papers in SCI journalsand international conferences. His research interestincludes the computational transformations of low-complexity digital filters, programmable filters, anddigital signal processing. He served as the Web Chair

of the Asia-Pacific Computer Systems Architecture Conference 2005 and theTechnical Program Committee Member of the European Signal ProcessingConference 2014. He has been an Associate Editor of the Springer EURASIPJournal on Embedded Systems since 2016.

Chip-Hong Chang (S’92–M’98–SM’03) receivedthe B.Eng. (Hons.) degree from the National Uni-versity of Singapore in 1989, and the M.Eng. andPh.D. degrees from Nanyang Technological Uni-versity (NTU) in 1993 and 1998, respectively. Heserved as a Technical Consultant in industry prior tojoining the School of Electrical and Electronic Engi-neering (EEE), NTU, in 1999, where he is currentlyan Associate Professor. He held joint appointmentswith NTU as an Assistant Chair of Alumni of theSchool of EEE from 2008 to 2014, the Deputy

Director of the Center for High Performance Embedded Systems from 2000 to2011, and the Program Director of the Center for Integrated Circuits andSystems from 2003 to 2009. He has edited four books, published ten bookchapters, 87 international journal papers (two-thirds are IEEE) and over160 refereed international conference papers (mostly in the IEEE), anddelivered over 30 colloquia. His current research interests include hardwaresecurity and trustable computing, low-power and fault-tolerant computing,residue number systems, and application-specific digital signal processing.

Dr. Chang has been an Associate Editor of the IEEE TRANSAC-TIONS ON VERY LARGE SCALE INTEGRATION SYSTEMS since 2011,the IEEE ACCESS since 2013, the IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS and the IEEETRANSACTIONS ON INFORMATION FORENSIC AND SECURITY since 2016,the Springer Journal of Hardware and System Security since 2016, andthe Microelectronics Journal since 2014. He was an Associate Editor ofthe IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I from 2010 to2013, and Integration, the VLSI Journal from 2013 to 2015. He was theEditorial Advisory Board Member of the Open Electrical and ElectronicEngineering Journal from 2007 to 2013 and the Journal of Electricaland Computer Engineering from 2008 to 2014, and the Guest Editor ofspecial issues of the IEEE TRANSACTIONS ON DEPENDABLE AND SECURE

COMPUTING, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS-I,the Journal of Circuits, Systems and Computers, and the Journal of Electricaland Computer Engineering. He also served in the Organizing and TechnicalProgram Committee for over 50 international conferences (mostly the IEEE).He is a fellow of the IET.

Jiatao Ding is currently pursuing the B.Eng. degreein computer engineering with the Singapore Univer-sity of Technology and Design. He is currently aPart-Time Research Assistant with the SUTD-MITInternational Design Center, Singapore Universityof Technology and Design. His research interestsinclude digital filter design and implementation,CAD algorithms for high-level synthesis, and newlogical operator design.

Rui Qiao is currently pursuing the B.Eng. degreein computer science with the Singapore Univer-sity of Technology and Design. He is currentlya Student Research Assistant with the SUTD-MITInternational Design Center, Singapore Universityof Technology and Design. His research interestsinclude digital filter design, system optimization, anddata analysis.

Mathias Faust (S’07–M’14) received the Dipl.Ing.and B.Eng. degrees from the University of AppliedSciences Rapperswil, Switzerland, in 2007, andthe Ph.D. degree from Nanyang TechnologicalUniversity, Singapore, in 2014. He is currently anIT Manager, a System Engineer, and a Consultantin industry. His research interest includes low com-plexity digital signal processing, digital filter designand optimization, and computer-aided design forcomputer arithmetic circuits. He is a member of theIET and Swiss Engineering STV.