IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ... IEEE PAPERS/vlsi/a high.pdf · IEEE...

9
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 A High-Performance FIR Filter Architecture for Fixed and Reconfigurable Applications Basant Kumar Mohanty, Senior Member, IEEE , and Pramod Kumar Meher, Senior Member, IEEE Abstract— Transpose form finite-impulse response (FIR) filters are inherently pipelined and support multiple constant multiplications (MCM) technique that results in significant sav- ing of computation. However, transpose form configuration does not directly support the block processing unlike direct- form configuration. In this paper, we explore the possibility of realization of block FIR filter in transpose form configuration for area-delay efficient realization of large order FIR filters for both fixed and reconfigurable applications. Based on a detailed computational analysis of transpose form configuration of FIR filter, we have derived a flow graph for transpose form block FIR filter with optimized register complexity. A generalized block formulation is presented for transpose form FIR filter. We have derived a general multiplier-based architecture for the proposed transpose form block filter for reconfigurable applications. A low-complexity design using the MCM scheme is also presented for the block implementation of fixed FIR filters. The proposed structure involves significantly less area- delay product (ADP) and less energy per sample (EPS) than the existing block implementation of direct-form structure for medium or large filter lengths, while for the short-length filters, the block implementation of direct-form FIR structure has less ADP and less EPS than the proposed structure. Application- specific integrated circuit synthesis result shows that the proposed structure for block size 4 and filter length 64 involves 42% less ADP and 40% less EPS than the best available FIR filter structure proposed for reconfigurable applications. For the same filter length and the same block size, the proposed structure involves 13% less ADP and 12.8% less EPS than that of the existing direct-form block FIR structure. Index Terms— Block processing, finite-impulse response (FIR) filter, reconfigurable architecture, VLSI. I. I NTRODUCTION F INITE-IMPULSE response (FIR) digital filter is widely used in several digital signal processing applications, such as speech processing, loud speaker equalization, echo cancella- tion, adaptive noise cancellation, and various communication applications, including software-defined radio (SDR) and so on [1]. Many of these applications require FIR filters of large order to meet the stringent frequency specifications [2]–[4]. Very often these filters need to support high sampling rate for high-speed digital communication [5]. The number of multiplications and additions required for each filter output, however, increases linearly with the filter order. Since there is Manuscript received November 19, 2014; accepted February 28, 2015. B. K. Mohanty is with the Department of Electronics and Communication Engineering, Jaypee University of Engineering and Technology, Guna 473226, India (e-mail: [email protected]). P. K. Meher is with the School of Computer Engineering, Nanyang Technological University, Singapore 639798 (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVLSI.2015.2412556 no redundant computation available in the FIR filter algorithm, real-time implementation of a large order FIR filter in a resource constrained environment is a challenging task. Filter coefficients very often remain constant and known a priori in signal processing applications. This feature has been utilized to reduce the complexity of realization of multiplications. Several designs have been suggested by various researchers for efficient realization of FIR filters (having fixed coeffi- cients) using distributed arithmetic (DA) [18] and multiple constant multiplication (MCM) methods [7], [11]–[13]. DA-based designs use lookup tables (LUTs) to store precomputed results to reduce the computational complexity. The MCM method on the other hand reduces the number of additions required for the realization of multiplications by common subexpression sharing, when a given input is multiplied with a set of constants. The MCM scheme is more effective, when a common operand is multiplied with more number of constants. Therefore, the MCM scheme is suitable for the implementation of large order FIR filters with fixed coefficients. But, MCM blocks can be formed only in the transpose form configuration of FIR filters. Block-processing method is popularly used to derive high-throughput hardware structures. It not only provides throughput-scalable design but also improves the area-delay efficiency. The derivation of block-based FIR structure is straightforward when direct-form configuration is used [16], whereas the transpose form configuration does not directly support block processing. But, to take the computational advantage of the MCM, FIR filter is required to be realized by transpose form configuration. Apart from that, transpose form structures are inherently pipelined and supposed to offer higher operating frequency to support higher sampling rate. There are some applications, such as SDR channel- izer, where FIR filters need to be implemented in a reconfigurable hardware to support multistandard wireless communication [6]. Several designs have been suggested during the last decade for efficient realization of reconfig- urable FIR (RFIR) using general multipliers and constant multiplication schemes [7]–[10]. A RFIR filter architecture using computation sharing vector-scaling technique has been proposed in [7]. Chen and Chiueh [8] have proposed a canonic sign digit (CSD)-based RFIR filter, where the nonzero CSD values are modified to reduce the precision of filter coefficients without significant impact on filter behavior. But, the reconfiguration overhead is significantly large and does not provide an area-delay efficient structure. The architectures in [7] and [8] are more appropriate for lower order filters and not suitable for channel filters due to their large area complexity. Constant shift method (CSM) and programmable shift method 1063-8210 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Transcript of IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ... IEEE PAPERS/vlsi/a high.pdf · IEEE...

Page 1: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ... IEEE PAPERS/vlsi/a high.pdf · IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 ... 2015. B. K. Mohanty

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

A High-Performance FIR Filter Architecture forFixed and Reconfigurable Applications

Basant Kumar Mohanty, Senior Member, IEEE, and Pramod Kumar Meher, Senior Member, IEEE

Abstract— Transpose form finite-impulse response (FIR)filters are inherently pipelined and support multiple constantmultiplications (MCM) technique that results in significant sav-ing of computation. However, transpose form configurationdoes not directly support the block processing unlike direct-form configuration. In this paper, we explore the possibility ofrealization of block FIR filter in transpose form configurationfor area-delay efficient realization of large order FIR filters forboth fixed and reconfigurable applications. Based on a detailedcomputational analysis of transpose form configuration of FIRfilter, we have derived a flow graph for transpose form blockFIR filter with optimized register complexity. A generalizedblock formulation is presented for transpose form FIR filter.We have derived a general multiplier-based architecture forthe proposed transpose form block filter for reconfigurableapplications. A low-complexity design using the MCM schemeis also presented for the block implementation of fixed FIRfilters. The proposed structure involves significantly less area-delay product (ADP) and less energy per sample (EPS) thanthe existing block implementation of direct-form structure formedium or large filter lengths, while for the short-length filters,the block implementation of direct-form FIR structure has lessADP and less EPS than the proposed structure. Application-specific integrated circuit synthesis result shows that theproposed structure for block size 4 and filter length 64 involves42% less ADP and 40% less EPS than the best available FIR filterstructure proposed for reconfigurable applications. For the samefilter length and the same block size, the proposed structureinvolves 13% less ADP and 12.8% less EPS than that of theexisting direct-form block FIR structure.

Index Terms— Block processing, finite-impulse response (FIR)filter, reconfigurable architecture, VLSI.

I. INTRODUCTION

F INITE-IMPULSE response (FIR) digital filter is widelyused in several digital signal processing applications, such

as speech processing, loud speaker equalization, echo cancella-tion, adaptive noise cancellation, and various communicationapplications, including software-defined radio (SDR) and soon [1]. Many of these applications require FIR filters of largeorder to meet the stringent frequency specifications [2]–[4].Very often these filters need to support high sampling ratefor high-speed digital communication [5]. The number ofmultiplications and additions required for each filter output,however, increases linearly with the filter order. Since there is

Manuscript received November 19, 2014; accepted February 28, 2015.B. K. Mohanty is with the Department of Electronics and Communication

Engineering, Jaypee University of Engineering and Technology, Guna 473226,India (e-mail: [email protected]).

P. K. Meher is with the School of Computer Engineering, NanyangTechnological University, Singapore 639798 (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TVLSI.2015.2412556

no redundant computation available in the FIR filter algorithm,real-time implementation of a large order FIR filter in aresource constrained environment is a challenging task. Filtercoefficients very often remain constant and known a priori insignal processing applications. This feature has been utilizedto reduce the complexity of realization of multiplications.Several designs have been suggested by various researchersfor efficient realization of FIR filters (having fixed coeffi-cients) using distributed arithmetic (DA) [18] and multipleconstant multiplication (MCM) methods [7], [11]–[13].DA-based designs use lookup tables (LUTs) to storeprecomputed results to reduce the computational complexity.The MCM method on the other hand reduces the numberof additions required for the realization of multiplicationsby common subexpression sharing, when a given input ismultiplied with a set of constants. The MCM scheme is moreeffective, when a common operand is multiplied with morenumber of constants. Therefore, the MCM scheme is suitablefor the implementation of large order FIR filters with fixedcoefficients. But, MCM blocks can be formed only in thetranspose form configuration of FIR filters.

Block-processing method is popularly used to derivehigh-throughput hardware structures. It not only providesthroughput-scalable design but also improves the area-delayefficiency. The derivation of block-based FIR structure isstraightforward when direct-form configuration is used [16],whereas the transpose form configuration does not directlysupport block processing. But, to take the computationaladvantage of the MCM, FIR filter is required to be realizedby transpose form configuration. Apart from that, transposeform structures are inherently pipelined and supposed to offerhigher operating frequency to support higher sampling rate.

There are some applications, such as SDR channel-izer, where FIR filters need to be implemented in areconfigurable hardware to support multistandard wirelesscommunication [6]. Several designs have been suggestedduring the last decade for efficient realization of reconfig-urable FIR (RFIR) using general multipliers and constantmultiplication schemes [7]–[10]. A RFIR filter architectureusing computation sharing vector-scaling technique has beenproposed in [7]. Chen and Chiueh [8] have proposed acanonic sign digit (CSD)-based RFIR filter, where the nonzeroCSD values are modified to reduce the precision of filtercoefficients without significant impact on filter behavior. But,the reconfiguration overhead is significantly large and does notprovide an area-delay efficient structure. The architectures in[7] and [8] are more appropriate for lower order filters and notsuitable for channel filters due to their large area complexity.Constant shift method (CSM) and programmable shift method

1063-8210 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ... IEEE PAPERS/vlsi/a high.pdf · IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 ... 2015. B. K. Mohanty

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

have been proposed in [9] for RFIR filters, specificallyfor SDR channelizer. Recently, Park and Meher [10] haveproposed an interesting DA-based architecture for RFIR filter.The existing multiplier-based structures use either direct-form configuration or transpose form configuration. But, themultiplier-less structures of [9] use transpose form configura-tion, whereas the DA-based structure of [10] uses direct-formconfiguration. But, we do not find any specific block-baseddesign for RFIR filter in the literature. A block-based RFIRstructure can easily be derived using the scheme proposedin [15] and [16]. But, we find that the block structure obtainedfrom [15] and [16] is not efficient for large filter lengths andvariable filter coefficients, such as SDR channelizer. Therefore,the design methods proposed in [15] and [16] are more suitablefor 2-D FIR filters and block least mean square adaptive filters.

In this paper, we explore the possibility of realization ofblock FIR filter in transpose form configuration in orderto take advantage of the MCM schemes and the inherentpipelining for area-delay efficient realization of large orderFIR filters for both fixed and reconfigurable applications. Themain contributions of this paper are as follows.

1) Computational analysis of transpose form configura-tion of FIR filter and derivation of flow graph fortranspose form block FIR filter with reduced registercomplexity.

2) Block formulation for transpose form FIR filter.3) Design of transpose form block filter for reconfigurable

applications.4) A low-complexity design method using MCM scheme

for the block implementation of fixed FIR filters.The remainder of this paper is organized as follows.

In Section II, computational analysis and mathematicalformulation of block transpose form FIR filter are presented.The proposed architectures for fixed and reconfigurableapplications are presented in Section III. Hardware and timecomplexities along with performance comparison arepresented in Section IV. Finally, the conclusion is drawnin Section V.

II. COMPUTATIONAL ANALYSIS AND MATHEMATICAL

FORMULATION OF BLOCK TRANSPOSE

FORM FIR FILTER

The output of an FIR filter of length N can be computedusing the relation

y(n) =N−1∑

i=0

h(i) · x(n − i). (1)

The computation of (1) can be expressed by the recurrencerelation

Y (z) = [z−1(· · · (z−1(z−1h(N − 1) + h(N − 2))+ h(N − 3))

· · · + h(1)) + h(0)]X (z). (2)

A. Computational Analysis

The data-flow graphs (DFG-1 and DFG-2) of transpose formFIR filter for filter length N = 6, as shown in Fig. 1, for

Fig. 1. DFG of transpose form structure for N = 6. (a) DFG-1 foroutput y(n). (b) DFG-2 for output y(n − 1).

Fig. 2. (a) DFT of multipliers of DFG shown in Fig. 1(a) corresponding tooutput y(n). (b) DFT of multipliers of DFG shown in Fig. 1(b) correspondingto output y(n − 1). Arrow: accumulation path of the products.

a block of two successive outputs {y(n), y(n − 1)} that arederived from (2). The product values and their accumulationpaths in DFG-1 and DFG-2 of Fig. 1 are shown in data-flow tables (DFT-1 and DFT-2) of Fig. 2. The arrows inDFT-1 and DFT-2 of Fig. 2 represent the accumulation path ofthe products. We find that five values of each column of DFT-1are same as those of DFT-2 (shown in gray color in Fig. 2).These redundant computation of DFG-1 and DFG-2 can beavoided using nonoverlapped sequence of input blocks, asshown in Fig. 3. DFT-3 and DFT-4 of DFG-1 and DFG-2for nonoverlapping input blocks are, respectively, shownin Fig. 3(a) and (b). As shown in Fig. 3(a) and (b), DFT-3and DFT-4 do not involve redundant computation. It is easyto find that the entries in gray cells in DFT-3 and DFT-4 ofFig. 3(a) and (b) correspond to the output y(n), whereas theother entries of DFT-3 and DFT-4 correspond to y(n −1). TheDFG of Fig. 1 needs to be transformed appropriately to obtainthe computations according to DFT-3 and DFT-4.

B. DFG Transformation

The computation of DFT-3 and DFT-4 can be realized byDFG-3 of nonoverlapping blocks, as shown in Fig. 4. We refer

Page 3: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ... IEEE PAPERS/vlsi/a high.pdf · IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 ... 2015. B. K. Mohanty

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MOHANTY AND MEHER: HIGH-PERFORMANCE FIR FILTER ARCHITECTURE 3

Fig. 3. DFT of DFG-1 and DFG-2 for three nonoverlapped input blocks[x(n), x(n − 1)], [x(n − 2), x(n − 3)], and [x(n − 4), x(n − 5)]. (a) DFT-3 forcomputation of output y(n). (b) DFT-4 for computation of output y(n − 1).

Fig. 4. Merged DFG (DFG-3: transpose form type-I configuration for blockFIR structure).

Fig. 5. DFG-4 (retimed DFG-3) transpose form type-II configuration forblock FIR structure.

it to block transpose form type-I configuration of block FIRfilter. The DFG-3 can be retimed to obtain the DFG-4 of Fig. 5,which is referred to block transpose form type-II configuration.Note that both type-I and type-II configurations involve thesame number of multipliers and adders, but type-II configura-tion involves nearly L times less delay elements than those of

type-I configuration. We have, therefore, used block transposeform type-II configuration to derive the proposed structure.In Section II-C, we present mathematical formulation of blocktranspose form type-II FIR filter for a generalized formulationof the concept of block-based computation of transpose formFIR filers.

C. Mathematical Formulation of the TransposeForm Block FIR Filter

Suppose in every cycle, the block FIR filter takes a blockof L new input samples, and processes those to produce ablock of L output samples. The kth block of filter output yk iscomputed using the relation

yk = Xk · h (3)

where the weight vector h is defined as

h = [h(0), h(1), . . . , h(N − 1)]T.

The input matrix Xk is defined as

Xk = [x0

k x1k . . . x4

k . . . xN−1k

](4)

where xik is the (i + 1)th column of Xk are defined as

xik =[x(kL − i) x(kL − i − 1) · · · x(kL − i − L + 1)]T. (5)

Substituting (4) in (3), the matrix-vector product is expressedin the form of scalar–vector product as

yk =N−1∑

i=0

xik · h(i). (6)

Suppose N is a composite number and decomposed asN = M L, then index i is expressed as i = l + mL, for0 ≤ l ≤ L − 1, and 0 ≤ m ≤ M − 1. Substituting i = l + mLin (5), we have

xl+mLk = xl

k−m . (7)

Substituting (7) in (4), we have

Xk = [x0

k x1k · · · xL−1

k x0k−1 x1

k−1 · · · xL−1k−1 · · ·

x0k−M+1 x1

k−M+1 · · · xL−1k−M+1

]. (8)

Substituting (8) in (3), we have

yk =L−1∑

l=0

M−1∑

m=0

xlk−m · h(l + mL). (9)

The input matrix Xk of (8) has an interestingfeature. The data block x0

k is the current block,while {x0

k−1, x0k−2, . . . , x0

k−M+1} are blocks delayedby 1, 2, . . . , (M − 1) cycles. The overlapped blocks{x1

k−1, x1k−2, . . . , x1

k−L+1} are, respectively, 1 clock cycle,2 clock cycles, . . . , (M − 1) cycles delayed versionof overlapped block x1

k . To take the advantage ofthis feature, the input-matrix Xk is decomposed intoM small matrices Sl

k , such that S0k contains L input-

blocks {x0k , x1

k , . . . , xL−1k }, and S1

k contains input blocks

{x0k−1, x1

k−1, . . . , xL−1k−1 }. Similarly, the input block

{x0k−M+1, x1

k−M+1, . . . , xL−1k−M+1} constitute the matrix SM−1

k .

Page 4: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ... IEEE PAPERS/vlsi/a high.pdf · IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 ... 2015. B. K. Mohanty

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

The coefficient vector h is also decomposed into small weightvectors cm = {h(mL), h(mL + 1), . . . , h(mL + L − 1)}.Interestingly, Sm

k is symmetric and satisfy the followingidentity:

Smk = S0

k−m . (10)

According to (10), Smk (for 1 ≤ m ≤ M − 1) are m clock

cycle delayed with respect to S0k . Computation of (9) can be

expressed in matrix-vector product using S0k−m and cm as

yk =M−1∑

m=0

rmk (11a)

rmk = S0

k−m · cm . (11b)

The computations of (11) may be expressed in a recurrenceform

Y(z) = S0(z)[(z−1(· · · (z−1(z−1cM−1 + cM−2) + cM−3)

+ · · · ) + c1) + c0] (12)

where S0(z) and Y(z) are the z-domain representation ofS0

k and yk , respectively.The DFG-4 of block transpose form type-II configuration

(shown in Fig. 5 for N = 6 and L = 2) can be derivedusing the recurrence relation of (12). The delay operator {z−1}of (12) represents a delay for a block of data in the transposeform type-II structure that stores the product of S0

k and cm .The proposed structure (transpose form type-II) is presentedin Section III.

III. PROPOSED STRUCTURES

There are several applications where the coefficients ofFIR filters remain fixed, while in some other applications,like SDR channelizer that requires separate FIR filters ofdifferent specifications to extract one of the desired narrow-band channels from the wideband RF front end. These FIRfilters need to be implemented in a RFIR structure to supportmultistandard wireless communication [6]. In this section, wepresent a structure of block FIR filter for such reconfigurableapplications. In this section, we discuss the implementation ofblock FIR filter for fixed filters as well using MCM scheme.

A. Proposed Structure for Transpose Form Block FIRFilter for Reconfigurable Applications

The proposed structure for block FIR filter is [based onthe recurrence relation of (12)] shown in Fig. 6 for theblock size L = 4. It consists of one coefficient selectionunit (CSU), one register unit (RU), M number of inner-product units (IPUs), and one pipeline adder unit (PAU).The CSU stores coefficients of all the filters to be usedfor the reconfigurable application. It is implemented usingN ROM LUTs, such that filter coefficients of any particularchannel filter are obtained in one clock cycle, where N is thefilter length. The RU [shown in Fig. 7(a)] receives xk duringthe kth cycle and produces L rows of S0

k in parallel. L rows ofS0

k are transmitted to M IPUs of the proposed structure. TheM IPUs also receive M short-weight vectors from the CSU

Fig. 6. Proposed structure for block FIR filter.

Fig. 7. (a) Internal structure of RU for block size L = 4. (b) Structure of(m + 1)th IPU.

such that during the kth cycle, the (m + 1)th IPU receivesthe weight vector cM−m−1 from the CSU and L rows of S0

k

form the RU. Each IPU performs matrix-vector product of S0k

with the short-weight vector cm , and computes a block ofL partial filter outputs (rm

k ). Therefore, each IPU performsL inner-product computations of L rows of S0

k with a commonweight vector cm . The structure of the (m +1)th IPU is shownin Fig. 7(b). It consists of L number of L-point inner-productcells (IPCs). The (l +1)th IPC receives the (l +1)th row of S0

kand the coefficient vector cm , and computes a partial result ofinner product r(kL − l), for 0 ≤ l ≤ L − 1. Internal structureof (l + 1)th IPC for L = 4 is shown in Fig. 8(a). All the MIPUs work in parallel and produce M blocks of result (rm

k ).These partial inner products are added in the PAU [shownin Fig. 8(b)] to obtain a block of L filter outputs. In eachcycle, the proposed structure receives a block of L inputs andproduces a block of L filter outputs, where the duration of eachcycle is T = TM + TA + TFA log2 L, TM is one multiplierdelay, TA is one adder delay, and TFA is one full-adder delay.

B. MCM-Based Implementation of Fixed-CoefficientFIR Filter

We discuss the derivation of MCM units for transpose formblock FIR filter, and the design of proposed structure for fixedfilters. For fixed-coefficient implementation, the CSU of Fig. 6

Page 5: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ... IEEE PAPERS/vlsi/a high.pdf · IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 ... 2015. B. K. Mohanty

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MOHANTY AND MEHER: HIGH-PERFORMANCE FIR FILTER ARCHITECTURE 5

Fig. 8. (a) Internal structure of (l + 1)th IPC for L = 4. (b) Structure ofPAU for block size L = 4.

is no longer required, since the structure is to be tailored foronly one given filter. Similarly, IPUs are not required. Themultiplications are required to be mapped to the MCM unitsfor a low-complexity realization. In the following, we showthat the proposed formulation for MCM-based implementationof block FIR filter makes use of the symmetry in input matrixS0

k to perform horizontal and vertical common subexpressionelimination [17] and to minimize the number of shift-addoperations in the MCM blocks.

The recurrence relation of (12) can alternatively beexpressed as

Y(z) = z−1 · · · z−1(z−1rM−1 + rM−2 + rM−3)

+ · · · + r1 + r0. (13)

The M intermediate data vectors rm , for 0 ≤ m ≤ M − 1 canbe computed using the relation

R = S0k · C (14)

where R and C are defined as

R = [rT

0 rT1 · · · rT

M−1

](15a)

C = [cT

0 cT1 · · · cT

M−1

]. (15b)

To illustrate the computation of (14) for L = 4 and N = 16,we write it as a matrix product given by (16). From (16),we can observe that the input matrix contains six-inputsamples {x(4k), x(4k − 1), x(4k − 2), x(4k − 3), x(4k − 4),x(4k − 5), x(4k − 6)}, and multiplied with several constantcoefficients, as shown in Table I.

As shown in Table I, MCM can be applied in both horizontaland vertical direction of the coefficient matrix. The samplex(4k−3) appears in four rows or four columns of the following

TABLE I

MCM IN TRANSPOSE FORM BLOCK FIR FILTER OF

LENGTH 16 AND BLOCK SIZE 4

input matrix:

R =

⎡⎢⎢⎣

x(4k) x(4k − 1) x(4k − 2) x(4k − 3)x(4k − 1) x(4k − 2) x(4k − 3) x(4k − 4)x(4k − 2) x(4k − 3) x(4k − 4) x(4k − 5)x(4k − 3) x(4k − 4) x(4k − 5) x(4k − 6)

⎤⎥⎥⎦

×

⎢⎢⎣

h(0) h(4) h(8) h(12)h(1) h(5) h(9) h(13)h(2) h(6) h(10) h(14)h(3) h(7) h(11) h(15)

⎥⎥⎦ (16)

whereas x(4k) appears in only one row or one column.Therefore, all the four rows of coefficient matrix are involvedin the MCM for the x(4k − 3), whereas only the first row ofcoefficients are involved in the MCM for x(4k). For largervalues of N or the smaller block sizes, the row size of thecoefficient matrix is larger that results in larger MCM sizeacross all the samples, which results into larger saving incomputational complexity.

The proposed MCM-based structure for FIR filters for blocksize L = 4 is shown in Fig. 9 for the purpose of illustra-tion. The MCM-based structure (shown in Fig. 9) involvessix MCM blocks corresponding to six input samples. EachMCM block produces the necessary product terms as listed inTable I. The subexpressions of the MCM blocks are shift addedin the adder network to produce the inner-product values (rl,m ),for 0 ≤ l ≤ L − 1 and 0 ≤ m ≤ (N/L) − 1 correspondingto the matrix product of (14). The inner-product values arefinally added in the PAU of Fig. 8(b) to obtain a block offilter output.

IV. COMPLEXITIES AND PERFORMANCE CONSIDERATIONS

A. Hardware and Time Complexities

The proposed structure for reconfigurable application con-sists of one CSU, one RU, M IPUs, and one PAU. The CSU

Page 6: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ... IEEE PAPERS/vlsi/a high.pdf · IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 ... 2015. B. K. Mohanty

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

6 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE II

GENERAL COMPARISON OF HARDWARE AND TIME COMPLEXITIES

Fig. 9. Proposed MCM-based structure for fixed FIR filter of block sizeL = 4 and filter length N = 16.

consists of N ROM units of P words each, where P is thenumber of FIR filters to be implemented by the proposedreconfigurable structure. We have excluded complexity of CSUin the performance comparison, since it is common in all theRFIR structures. Each IPU is comprised of L IP cells, whereeach IP cell involves L multipliers and (L−1) adders. The RUinvolves (L − 1) registers of B-bit width. The PAU involvesL(M−1) adders and the same number of registers, where eachregister has a width of (B + B ′), B , and B ′ respectively, beingthe bit width of input sample and filter coefficients. Therefore,the proposed structure involves L N multipliers, L(N − 1)adders, and [B(N − 1) + B ′(N − L)] (flip flops) FFs; andprocesses L samples in every cycle where the duration ofcycle period T = [TM + TA + TFA(log2 L)]. We do not find amultiplier-based direct-form block FIR structure on RFIR inthe literature. However, direct-form multiplier-based block FIRstructure can be derived from the block formulation of [15].We have derived the direct-form block FIR structure using[15, eq. (4)], and estimated its hardware and time complexitiesfor comparison purpose.

B. Performance Comparison

The hardware and time complexities of the proposedstructure and the extracted direct-from structure of [15] alongwith those of the existing RFIR filter structures in [9] and [10]are listed in Table II for comparison. We have assumed fixedword length (B + B ′) for the adder tree in case of direct-formstructure, as well as the pipeline adder in case of transpose

form structure. As shown in Table II, the direct-form structureof [15] and the proposed structures involve the same numberof multipliers and adders, but the proposed one involves{(log2 M − 1)TFA} less cycle period, where M = N/L, at amarginal cost of B ′(N − L) FFs. The register complexity ofthe proposed structure is independent of block size as in thecase of direct-form structure. However, the cycle period of theproposed structure depends on the input-block size, whereasin case of the existing direct-form block FIR structure of[15], it depends on the filter length. Since filter length isusually higher than the block length, the cycle period ofthe existing direct-form structure increase for large orderfilters. To compare with the DA-based structure of [10] andthe proposed structure, we find that the proposed structureinvolves (L N) multipliers in place of (3N B B ′/2) MUXes(bit level), nearly (2L/B) times more adders and B ′Nmore FFs, but offers nearly L times higher throughput.Similarly, compared with the CSM-based structureof [9], the proposed structure involve (L N) multipliers inplace of ≈(7N B B ′/3) MUXes (bit level), ≈(3L/B ′) timesmore adders, B ′(L − 1) less FFs, and offers L times higherthroughput. In spite of more FFs, the proposed structuremay have less area-delay product (ADP) and less energy persample (EPS) than the existing direct-form structure due toits small cycle period.

We have estimated hardware and time complexities of theproposed structure for block sizes L = 4, 8, and filter lengthsN = 32 and 64. Also, we have estimated the hardware andthe time complexities of the direct-form structure extractedfrom [15] for the same block size and for the filter lengths,and those in [9] and [10] for the same filter lengths. We haveconsidered B = 8 (word length of input sample), B ′ = 16(word length of filter coefficient), and 24-bit word length forthe intermediate and output signals for all the designs. Theestimated values are listed in Table III for comparison. We canfind from Table III that the multiplier and adder complexitiesof the proposed structure increases proportionately with blocksize and filter length as in the case of direct-form structure. Thecycle period of direct-form structure increases proportionatelywith the filter length such that it increases by an amount TFAwhen filter-length doubles. But, cycle period of the proposedstructure is independent of filter length and increases by anamount TFA when the block size doubles. The area-delayperformance of the proposed structure is found better than thatof direct-form structure of [15] for higher filter lengths due tosmaller cycle period. Besides, the proposed structure supports

Page 7: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ... IEEE PAPERS/vlsi/a high.pdf · IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 ... 2015. B. K. Mohanty

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MOHANTY AND MEHER: HIGH-PERFORMANCE FIR FILTER ARCHITECTURE 7

TABLE III

THEORETICALLY ESTIMATED HARDWARE AND TIME COMPLEXITIES OF PROPOSED AND EXISTING STRUCTURES FOR B = 8 AND B ′ = 16

MCM scheme when fixed-coefficient filters are implemented,whereas direct-form structure does not support MCM scheme.We have shown that the proposed structure offers both horizon-tal and vertical MCM, which can be exploited in the proposedstructure to reduce the area complexity substantially furthercompared with the direct-form structure for the implementa-tion of fixed filters.

Compared with the direct-form structure, the proposedstructure for block size 4 involves 192, 448, 960, and 1984more FFs and its cycle period less by TFA, 2TFA, 3TFA, and4TFA for filter lengths 16, 32, 64, and 128, respectively. Theproposed structure involves more number of FFs than thedirect-form structure for higher filter lengths, but the excessarea due to those FFs is very small compared with the totalarea of the direct-form structure. On the other hand, the savingin cycle period in the transpose form structure for higherfilter lengths is significant with respect to the cycle periodof direct-form structure. Therefore, the overall ADP of theproposed structure is found to be less than that of direct-form structure [15] for higher filter lengths. Compared withthe structure of [10], the proposed structure for N = 64and L = 4 involves 256 multipliers against 13 824 bit-levelMUXes, 35 less adders, 48 less FFs, and offers nearly fourtimes higher throughput rate. For the same block size andfilter length, the proposed structure involves 256 multipliersagainst 28 416 bit-level MUXes, 198 less adders, 48 less FFsthan those of the structure of [9], and it offers more than fourtimes higher throughput rate due to its smaller cycle period.

C. Synthesis Results

We have coded the proposed structure in VHDL for filterlengths 16, 32, and 64 and block sizes 4 and 8. In addition,we have coded the direct-form block FIR structure extracted

from [15] for the same filter lengths and the same blocksizes, and the structures in [9] and [10] for the same fil-ter lengths. We have considered B = 8, B ′ = 16, and24-bit word length for the intermediate and the output signalsof all the designs. All the designs are synthesized usingSynopsys Design Compiler TMSC 65-nm CMOS library. Thearea, the minimum clock period (MCP), and power estimatesobtained from the synthesis reports generated by the DesignCompiler are listed in Table IV for comparison. As shown inTable IV, the proposed structure involves more area and con-sumes more power than the existing direct-form structure [15]due to extra FFs. But, it has less MCP (higher samplingfrequency,1 block size = 1, in case of [9] and [10]) thanthe corresponding direct-form structure of [15] due to shortercritical path. We have estimated the increase in area (�A) andreduction in MCP (�T ) of proposed structure over the direct-form structure of [15] for different block sizes and differentfilter lengths. Graphs are plotted using these estimated valuesand shown in Figs. 10 and 11. Note that the ADP variesdirectly with (�A), whereas it varies inversely with (�T ). Asshown in Figs. 10 and 11, the intersection point of two curves{�A and �T } gives a filter length (N0), where the direct-form structure of [15] and proposed structure have nearly thesame ADP. For N < N0, (�A) is higher than (�T ) andproposed structure has higher ADP than that of direct-fromstructure of [15]. Similarly, for N > N0, the (�T ) is higherthan (�A) and the proposed structure has less ADP than thedirect-from structure of [15]. The N0 shift marginally towardhigher value for higher block sizes due to increase in MCP ofthe proposed structure.

1Sampling frequency = block-size/MCP.

Page 8: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ... IEEE PAPERS/vlsi/a high.pdf · IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 ... 2015. B. K. Mohanty

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

8 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE IV

SYNTHESIS RESULTS OF PROPOSED STRUCTURES AND EXISTING

STRUCTURES USING TMSC 65-nm CMOS STANDARD CELL

LIBRARY, N : FILTER LENGTH. POWER ESTIMATED AT MCP

Fig. 10. Increase in area and decrease in minimum cycle period of proposedstructure with respect to direct-form structure of [15]. (a) For block size 4.(b) For block size 8. N0 is the filter length for the direct-form structure of [15]and the proposed structures have nearly the same area-delay efficiency.

We have estimated ADP2 and EPS3 of proposed structureand direct-form structure of [15]. We have also estimated thereduction of ADP (RADP) and reduction of EPS (REPS) of theproposed structure over the direct-form structure of [15]. Theestimated RADP and REPS values are shown in the bar chart

2ADP = area/sampling frequency.3EPS = power/sampling frequency.

Fig. 11. RADP and REPS of proposed structure compared with that ofdirect-form structure of [15]. (a) RADP. (b) REPS.

Fig. 12. (a) Comparison of ADP. (b) Comparison of EPS.

of Fig. 11 for comparison. As shown in Fig. 11, the proposedstructure has higher ADP and EPS (negative RADP and REPS)than the existing direct-form structure of [15] for small filterlengths, while it has less ADP and less EPS (positive RADPand REPS) for filter length N ≥ 32. The proposed structureoffers higher saving of ADP and EPS saving than the existingdirect-form structure [15] for higher filter lengths.

We have estimated ADP and EPS of the existing struc-tures of [9] and [10], and these estimated values along withthose of the proposed structures and the direct-form structureof [15] are shown in bar charts of Fig. 12 for comparison.As shown in Fig. 12, the proposed structure has significantlyless ADP and EPS than the existing multiplier-less structuresof [9] and [10]. Compared with the existing direct-formstructure of [15], the proposed structure offers higher ADPand EPS saving for medium and large filter lengths, while theexisting direct-form structure has higher ADP and EPS savingthan the proposed structure for short-length filters. Comparedwith the structure of [10] that is the best among the existingdesigns, the block-proposed structure for block size 4 andfilter length 64 involve 42% less ADP and 40% less EPS,respectively.

Page 9: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ... IEEE PAPERS/vlsi/a high.pdf · IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 ... 2015. B. K. Mohanty

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.

MOHANTY AND MEHER: HIGH-PERFORMANCE FIR FILTER ARCHITECTURE 9

V. CONCLUSION

In this paper, we have explored the possibility of realizationof block FIR filters in transpose form configuration for area-delay efficient realization of both fixed and reconfigurableapplications. A generalized block formulation is presentedfor transpose form block FIR filter, and based on that wehave derived transpose form block filter for reconfigurableapplications. We have presented a scheme to identify the MCMblocks for horizontal and vertical subexpression elimination inthe proposed block FIR filter for fixed coefficients to reducethe computational complexity. Performance comparison showsthat the proposed structure involves significantly less ADPand less EPS than the existing block direct-form structurefor medium or large filter lengths while for the short-lengthfilters, the existing block direct-form structure has less ADPand less EPS than the proposed structure. Application-specificintegrated circuit synthesis result shows that the proposedstructure for block size 4 and filter length 64 involve 42%less ADP and 40% less EPS than the best available FIR filterstructure of [10] for reconfigurable applications. For the samefilter length and the same block size, the proposed structureinvolves 13% less ADP and 12.8% less EPS than that of theexisting direct-from block FIR structure of [15].

REFERENCES

[1] J. G. Proakis and D. G. Manolakis, Digital Signal Processing:Principles, Algorithms and Applications. Upper Saddle River, NJ, USA:Prentice-Hall, 1996.

[2] T. Hentschel and G. Fettweis, “Software radio receivers,” inCDMA Techniques for Third Generation Mobile Systems. Dordrecht,The Netherlands: Kluwer, 1999, pp. 257–283.

[3] E. Mirchandani, R. L. Zinser, Jr., and J. B. Evans, “A new adaptive noisecancellation scheme in the presence of crosstalk [speech signals],” IEEETrans. Circuits Syst. II, Analog Digit. Signal Process., vol. 39, no. 10,pp. 681–694, Oct. 1995.

[4] D. Xu and J. Chiu, “Design of a high-order FIR digital filtering andvariable gain ranging seismic data acquisition system,” in Proc. IEEESoutheastcon, Apr. 1993, p. 1–6.

[5] J. Mitola, Software Radio Architecture: Object-Oriented Approaches toWireless Systems Engineering. New York, NY, USA: Wiley, 2000.

[6] A. P. Vinod and E. M. Lai, “Low power and high-speed implementationof FIR filters for software defined radio receivers,” IEEE Trans. WirelessCommun., vol. 7, no. 5, pp. 1669–1675, Jul. 2006.

[7] J. Park, W. Jeong, H. Mahmoodi-Meimand, Y. Wang, H. Choo, andK. Roy, “Computation sharing programmable FIR filter for low-powerand high-performance applications,” IEEE J. Solid State Circuits,vol. 39, no. 2, pp. 348–357, Feb. 2004.

[8] K.-H. Chen and T.-D. Chiueh, “A low-power digit-based reconfigurableFIR filter,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 53, no. 8,pp. 617–621, Aug. 2006.

[9] R. Mahesh and A. P. Vinod, “New reconfigurable architectures for imple-menting FIR filters with low complexity,” IEEE Trans. Comput.-AidedDesign Integr. Circuits Syst., vol. 29, no. 2, pp. 275–288, Feb. 2010.

[10] S. Y. Park and P. K. Meher, “Efficient FPGA and ASIC realizationsof a DA-based reconfigurable FIR digital filter,” IEEE Trans. CircuitsSyst. II, Exp. Briefs, vol. 61, no. 7, pp. 511–515, Jul. 2014.

[11] P. K. Meher, “Hardware-efficient systolization of DA-based calculationof finite digital convolution,” IEEE Trans. Circuits Syst. II, Exp. Briefs,vol. 53, no. 8, pp. 707–711, Aug. 2006.

[12] P. K. Meher, S. Chandrasekaran, and A. Amira, “FPGA realizationof FIR filters by efficient and flexible systolization using distributedarithmetic,” IEEE Trans. Signal Process., vol. 56, no. 7, pp. 3009–3017,Jul. 2008.

[13] P. K. Meher, “New approach to look-up-table design and memory-based realization of FIR digital filter,” IEEE Trans. CircuitsSyst. I, Reg. Papers, vol. 57, no. 3, pp. 592–603, Mar. 2010.

[14] K. K. Parhi, VLSI Digital Signal Processing Systems: Design andImplementation. New York, NY, USA: Wiley, 1999.

[15] B. K. Mohanty and P. K. Meher, “A high-performance energy-efficientarchitecture for FIR adaptive filter based on new distributed arithmeticformulation of block LMS algorithm,” IEEE Trans. Signal Process.,vol. 61, no. 4, pp. 921–932, Feb. 2013.

[16] B. K. Mohanty, P. K. Meher, S. Al-Maadeed, and A. Amira, “Memoryfootprint reduction for power-efficient realization of 2-D finite impulseresponse filters,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 61, no. 1,pp. 120–133, Jan. 2014.

[17] R. Mahesh and A. P. Vinod, “A new common subexpression eliminationalgorithm for realizing low-complexity higher order digital filters,” IEEETrans. Comput.-Aided Design Integr. Circuits Syst., vol. 27, no. 2,pp. 217–219, Feb. 2008.

[18] S. A. White, “Applications of distributed arithmetic to digitalsignal processing: A tutorial review,” IEEE ASSP Mag., vol. 6, no. 3,pp. 4–19, Jul. 1989.

Basant Kumar Mohanty (M’06–SM’11) receivedthe M.Sc. degree in physics from SambalpurUniversity, Sambalpur, India, in 1989, and thePh.D. degree in VLSI for digital signal process-ing from Berhampur University, Berhampur, India,in 2000.

He joined as a Lecturer with the Department ofElectrical and Electronics Engineering, BITS Pilani,Pilani, India, in 2001, and an Assistant Professorwith the Department of Electronics and Commu-nication Engineering, Mody Institute of Education

Research (Deemed University), Sikar, India. In 2003, he joined the JaypeeUniversity of Engineering and Technology, Guna, India, where he became anAssociate Professor in 2005 and a Full Professor in 2007. He has authorednearly 50 technical papers in journals and conferences of international repute.His current research interests include design and implementation of low-powerand high-performance systems for adaptive filters, image and video-processingapplications, secured communication, and reconfigurable architectures.

Dr. Mohanty is a Lifetime Member of the Institution of Electronics andTelecommunication Engineering, New Delhi, India. He was a recipient ofthe Rashtriya Gaurav Award from the India International Friendship Society,New Delhi, in 2012. He serves as an Associate Editor of the Journal ofCircuits, Systems, and Signal Processing.

Pramod Kumar Meher (SM’03) received theB.Sc. (Hons.) and M.Sc. degrees in physics, and thePh.D. degree in science from Sambalpur University,Sambalpur, India, in 1976, 1978, and 1996,respectively.

He was a Reader in Electronics with BerhampurUniversity, Berhampur, India, from 1993 to1997, and a Professor of Computer Applicationswith Utkal University, Bhubaneswar, India, from1997 to 2002. He is currently a Senior ResearchScientist with Nanyang Technological University,

Singapore. He has contributed over 200 technical papers to various reputedjournals and conference proceedings. His current research interests includedesign of dedicated and reconfigurable architectures for computation-intensivealgorithms pertaining to signal, image and video processing, communication,bioinformatics, and intelligent computing.

Dr. Meher is a fellow of the Institution of Electronics andTelecommunication Engineers, India. He was a recipient of the SamantaChandrasekhar Award for excellence in research on engineering andtechnology in 1999. He was a Speaker for the Distinguished LecturerProgram of the IEEE Circuits Systems Society from 2011 to 2012,and an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITSAND SYSTEMS—II: EXPRESS BRIEFS from 2008 to 2011, the IEEETRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS from2012 to 2013, and the IEEE TRANSACTIONS ON VERY LARGE SCALE

INTEGRATION SYSTEMS from 2009 to 2014. He serves as an AssociateEditor of the Journal of Circuits, Systems, and Signal Processing andIntegration, the VLSI Journal.