06542014

14
120 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 61, NO. 1, JANUARY 2014 Memory Footprint Reduction for Power-Ef cient Realization of 2-D Finite Impulse Response Filters Basant K. Mohanty, Senior Member, IEEE, Pramod K. Meher, Senior Member, IEEE, Somaya Al-Maadeed, Senior Member, IEEE, and Abbes Amira, Senior Member, IEEE Abstract—We have analyzed memory footprint and combina- tional complexity to arrive at a systematic design strategy to derive area-delay-power-efcient architectures for two-dimensional (2-D) nite impulse response (FIR) lter. We have presented novel block- based structures for separable and non-separable lters with less memory footprint by memory sharing and memory-reuse along with appropriate scheduling of computations and design of storage architecture. The proposed structures involve times less storage per output (SPO), and nearly times less energy consumption per output (EPO) compared with the existing structures, where is the input block-size. They involve times more arithmetic resources than the best of the corresponding existing structures, and produce times more throughput with less memory band-width (MBW) than others. We have also proposed separate generic structures for separable and non-separable lter-banks, and a unied structure of lter-bank constituting symmetric and general lters. The pro- posed unied structure for 6 parallel lters involves nearly times more multipliers, times more adders, less registers than similar existing unied structure, and computes times more lter outputs per cycle with times less MBW than the existing design, where is FIR lter size in each dimension. ASIC synthesis result shows that for lter size (4 4), input-block size , and image-size (512 512), proposed block-based non-separable and generic non-separable structures, respectively, involve 5.95 times and 11.25 times less area-delay-product (ADP), and 5.81 times and 15.63 times less EPO than the corresponding existing structures. The proposed unied structure involves 4.64 times less ADP and 9.78 times less EPO than the corresponding existing structure. Index Terms—Block processing, 2-dimensional (2-D) nite im- pulse response (FIR), digital lters, VLSI architecture. I. INTRODUCTION T WO-DIMENSIONAL (2-D) digital lters are frequently used in 2-D signal processing as well as the image and video processing applications such as image enhancement, Manuscript received December 09, 2012; revised February 16, 2013; accepted March 31, 2013. Date of publication June 17, 2013; date of current version January 06, 2014. This work was supported by an internal grant from Qatar University (QUUG-CENG-DCS-11/12-7). The statements made herein are solely the responsibility of the authors. This paper was recommended by Associate Editor F. Clermidy. B. K. Mohanty with the Department of Electronics and Communication Engi- neering, Jaypee University of Engineering and Technology, Raghogarh, Guna, Madhy Pradesh 473226, India (e-mail: [email protected]). P. K. Meher is with Institute for Infocomm Research, 138632 Singapore (e-mail: [email protected]). S. Al-Maadeed with Department of Computer Science and Engineering, Qatar University, Doha, Qatar (e-mail: [email protected]). A. Amira with School of Computing, University of West Scotland, Paisley, Scotland, UK, (e-mail: [email protected]). Digital Object Identier 10.1109/TCSI.2013.2265953 image restoration [1], template matching [2], face recognition, feature extraction for bio-metric systems [3]–[5], and video communication etc. The 2-D FIR lters are more popularly used compared to its innite impulse response (IIR) counterpart due to their numerical stability and simplicity of design. The system function of 2-D FIR lter is often non-separable, while in a few cases, it is separable when impulse response is expressed as . The non-separable and separable system functions are, respectively, given as: (1) (2) where is the impulse response matrix of the non-sepa- rable 2-D FIR lter of size while and are the impulse responses of 1-D FIR lters used for row-wise and column-wise processing of 2-D input. Block diagrams of conventional realization of non-separable and separable 2-D FIR lters are shown in Fig. 1. As shown in this gure, both these lter structures consist of two types of hardware components: (i) the combinational component and (ii) the memory or storage component. The combinational compo- nent consists mainly of the arithmetic circuits along with some steering logic like multiplexors and demultiplexers, while the storage component consists of transposition buffers and/or shift- registers to provide appropriate data to combinational units. The non-separable structure uses shift-registers to introduce the nec- essary row-delays for the processing of intermediate data while the separable structure uses shift-registers for transposition of intermediate data. We can nd from (1) that, a non-separable 2-D FIR lter of size involves shift-regis- ters (SRs) of size each, registers (for row-column processing), and multipliers and adders to compute one lter output per cycle. Similarly, we can nd from (2) that, the separable 2-D lter of size involves SRs of words each, registers, multipliers and adders to compute one lter output per cycle. Combinational and memory (register) complexities of full-parallel non-separable and sepa- rable lters are given in Table I. Since, the image size is higher than the lter-size by more than an order of mag- nitude in most of the image-processing applications, memory becomes the dominant component of hardware complexity of 2-D FIR structures, and consumes major amount of chip-area and total power consumption. 1549-8328 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Transcript of 06542014

Page 1: 06542014

120 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 61, NO. 1, JANUARY 2014

Memory Footprint Reduction for Power-EfficientRealization of 2-D Finite Impulse Response Filters

Basant K. Mohanty, Senior Member, IEEE, Pramod K. Meher, Senior Member, IEEE,Somaya Al-Maadeed, Senior Member, IEEE, and Abbes Amira, Senior Member, IEEE

Abstract—We have analyzed memory footprint and combina-tional complexity to arrive at a systematic design strategy to derivearea-delay-power-efficient architectures for two-dimensional (2-D)finite impulse response (FIR) filter.We have presented novel block-based structures for separable and non-separable filters with lessmemory footprint by memory sharing and memory-reuse alongwith appropriate scheduling of computations and design of storagearchitecture. The proposed structures involve times less storageper output (SPO), and nearly times less energy consumption peroutput (EPO) comparedwith the existing structures, where is theinput block-size. They involve times more arithmetic resourcesthan the best of the corresponding existing structures, and producetimes more throughput with less memory band-width (MBW)

than others. We have also proposed separate generic structures forseparable and non-separable filter-banks, and a unified structureof filter-bank constituting symmetric and general filters. The pro-posed unified structure for 6 parallel filters involves nearlytimes more multipliers, times more adders, lessregisters than similar existing unified structure, and computestimes more filter outputs per cycle with times less MBW thanthe existing design, where is FIR filter size in each dimension.ASIC synthesis result shows that for filter size (4 4), input-blocksize , and image-size (512 512), proposed block-basednon-separable and generic non-separable structures, respectively,involve 5.95 times and 11.25 times less area-delay-product (ADP),and 5.81 times and 15.63 times less EPO than the correspondingexisting structures. The proposed unified structure involves 4.64times less ADP and 9.78 times less EPO than the correspondingexisting structure.

Index Terms—Block processing, 2-dimensional (2-D) finite im-pulse response (FIR), digital filters, VLSI architecture.

I. INTRODUCTION

T WO-DIMENSIONAL (2-D) digital filters are frequentlyused in 2-D signal processing as well as the image and

video processing applications such as image enhancement,

Manuscript received December 09, 2012; revised February 16, 2013;accepted March 31, 2013. Date of publication June 17, 2013; date of currentversion January 06, 2014. This work was supported by an internal grant fromQatar University (QUUG-CENG-DCS-11/12-7). The statements made hereinare solely the responsibility of the authors. This paper was recommended byAssociate Editor F. Clermidy.B. K.Mohanty with the Department of Electronics and Communication Engi-

neering, Jaypee University of Engineering and Technology, Raghogarh, Guna,Madhy Pradesh 473226, India (e-mail: [email protected]).P. K. Meher is with Institute for Infocomm Research, 138632 Singapore

(e-mail: [email protected]).S. Al-Maadeed with Department of Computer Science and Engineering,

Qatar University, Doha, Qatar (e-mail: [email protected]).A. Amira with School of Computing, University of West Scotland, Paisley,

Scotland, UK, (e-mail: [email protected]).Digital Object Identifier 10.1109/TCSI.2013.2265953

image restoration [1], template matching [2], face recognition,feature extraction for bio-metric systems [3]–[5], and videocommunication etc. The 2-D FIR filters are more popularlyused compared to its infinite impulse response (IIR) counterpartdue to their numerical stability and simplicity of design. Thesystem function of 2-D FIR filter is often non-separable, whilein a few cases, it is separable when impulse responseis expressed as . The non-separableand separable system functions are, respectively, given as:

(1)

(2)

where is the impulse response matrix of the non-sepa-rable 2-D FIR filter of size while andare the impulse responses of 1-D FIR filters used for row-wiseand column-wise processing of 2-D input.Block diagrams of conventional realization of non-separable

and separable 2-D FIR filters are shown in Fig. 1. As shownin this figure, both these filter structures consist of two types ofhardware components: (i) the combinational component and (ii)the memory or storage component. The combinational compo-nent consists mainly of the arithmetic circuits along with somesteering logic like multiplexors and demultiplexers, while thestorage component consists of transposition buffers and/or shift-registers to provide appropriate data to combinational units. Thenon-separable structure uses shift-registers to introduce the nec-essary row-delays for the processing of intermediate data whilethe separable structure uses shift-registers for transposition ofintermediate data. We can find from (1) that, a non-separable2-D FIR filter of size involves shift-regis-ters (SRs) of size each, registers (for row-columnprocessing), and multipliers and adders to compute onefilter output per cycle. Similarly, we can find from (2) that, theseparable 2-D filter of size involves SRs ofwords each, registers, multipliers and adders tocompute one filter output per cycle. Combinational and memory(register) complexities of full-parallel non-separable and sepa-rable filters are given in Table I. Since, the image size ishigher than the filter-size by more than an order of mag-nitude in most of the image-processing applications, memorybecomes the dominant component of hardware complexity of2-D FIR structures, and consumes major amount of chip-areaand total power consumption.

1549-8328 © 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: 06542014

MOHANTY et al.: MEMORY FOOTPRINT REDUCTION FOR POWER-EFFICIENT REALIZATION 121

Fig. 1. Conventional structure of 2-D FIR filter. (a) Non-separable method, (b)Separable method.

TABLE ICOMBINATIONAL AND MEMORY COMPLEXITY OF FULL-PARALLEL SEPARABLE

AND NON-SEPARABLE 2-D FIR FILTERS

Some systolic architectures have been suggested for VLSIimplementation of 2-D FIR filters to achieve high-throughputand low-latency implementation [6]–[11]. Recently, some effi-cient structures have been proposed for 2-D IIR filters [13]–[17]and shown that non-separable 2-D FIR filters can also be real-ized efficiently using those structures. In all these existing de-signs, systolization [18] of the structure is considered as themajor issue, and a substantially large number of delay elementsare placed in the data-path to avoid global communication. Sim-ilarly, the symmetry of impulse response matrix has been ex-ploited to reduce the hardware and time complexities of thestructures [12]–[17]. In the last four decades, several design ap-proaches have been suggested for reducing the arithmetic com-plexity of one-dimensional (1-D) FIR filters [19]–[28]. All thosemethods are now quite mature, and can be applied to reduce thecomplexity of 1-D filters, which could be used for the imple-mentation of 2-D FIR filters as well. On the other hand, memorycomplexity constitutes the dominant part of overall area com-plexity of these structures, and plays a significant role in power-efficient realization of 2-D FIR filters. Keeping that in view, inthis paper, we present memory-centric designs of 2-D FIR filterfor the reduction of total memory usage, memory-reuse, reduc-tion of memory band-width and memory-sharing, which couldlead to an area-delay-power-efficient structures.Many image processing applications use 2-D filter banks

comprised of both separable and non-separable filters. Onesuch application is biometric system where Gabor filter bank isused for feature extraction for face recognition and finger-printmatching [3]–[5]. The Gabor filter bank is generated fordifferent center frequencies and orientations. Consequently,constituent filters are both separable and non-separable typeswith symmetry as well as without symmetry. Recently, aunified structure has been proposed in [17] for the realizationof filters with diagonal, four-fold rotational, quadrant andoctal symmetries. However, this structure does not favour therealization a filter-bank since only one filter can be realized ata time. Moreover, separable filters can not be realized usingthis structure. The main objective of unified structure of [17] is

to achieve saving of arithmetic resources, but in this paper, weaim at presenting a unified memory-efficient structure for filterbank with separable and non-separable 2-D FIR filters.The main contributions of this paper are:• Analysis of memory footprint and exploration of possiblestorage optimization to have a memory-efficient designstrategy for the implementation of 2-D FIR filter.

• Shared memory design for separable and non-separablestructures.

• Scheduling of computations and architecture design formemory-reuse and memory band-width reduction in sepa-rable filters.

• Unified structure of filter bank comprised of separableand non-separable filters with low memory footprint peroutput.

The rest of the paper is organized as follows: The proposeddesign strategy is discussed in Section II and block formulationof 2-D FIR filter is given in Section III. Proposed structures arepresented in Section IV for separable and non-separable filters.Generic structure of filter-bank consisting of different types offilters is presented in Section V. Hardware complexity and per-formance of the proposed structures are discussed in Section VI.Conclusion is presented in Section VII.

II. PROPOSED DESIGN STRATEGY

To arrive at the proposed design strategy we analyze herememory complexities of possible configurations of 2-D FIRfilter. The system function of non-separable 2-D FIR filter(given by (1)) can be written in a split form as:

(3a)

(3b)

Computations of (3a) and (3b) can be performed by a di-rect-form or a transposed-form structure to have four possibleconfigurations such as fully-direct (direct-direct), fully-trans-posed (transpose-transpose), hybrid-1 (direct-transpose), andhybrid-2 (transpose-direct), for the realization of non-sepa-rable as shown in Fig. 2 for . All these fourstructures require the same number of arithmetic components(multiplier and adders) and delay elements and cor-responding to shift-registers and fixed registers, respectively)except their locations in the data-path. Since, the bit-widthsof arithmetic units, buses and registers in the data-path aredifferent for input signals and intermediate signals, the overallmemory requirements of different configurations are differentin terms of number of storage bits, although all of them have thesame number of delay elements1. We have estimated memorycomplexity of all the four type of structures for an input imagesize 512 512 (i.e ) and filter length and

), and listed in Table II for comparison. We find that,fully-direct structure has the lowest memory requirement thanothers. Interestingly, the memory complexity of fully-direct

1The input image to be processed by the 2-D FIR filter consists of 8-bit pixelswhile higher bit-width is required for intermediate signals.

Page 3: 06542014

122 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 61, NO. 1, JANUARY 2014

Fig. 2. Four different configurations for realization of 2-D FIR filter for . (a) Fully-direct structure, (b) Hybrid-1 structure (c) Hybrid-2 structure, and (d)Fully-transposed structure, where represents a shift-register of words and represents a single register.

TABLE IIMEMORY COMPLEXITY FULLY-DIRECT, FULLY-TRANSPOSE, HYBRID-1 AND HYBRID-2 STRUCTURES. : FILTER-SIZE, : INPUT IMAGE-SIZE, : INPUT SIGNAL

BIT-WIDTH AND : INTERMEDIATE SIGNAL BIT-WIDTH

structure is independent of word-length of intermediate signalssince all the delay elements of this structure are placed on theinput path only. This is a very useful feature to be exploited formemory footprint reduction in 2-D FIR filter structure.

A. Exploration of Memory-Reuse Possibilities

To explore the memory reuse possibilities, let us considerthe input data-flow of fully-direct structure for the computa-tion of -th row of outputs

as shown in Fig. 3 for . The samplesrequired to compute a given filter output is shown in a pair ofcurly braces and the corresponding filter output is shown at itsright. Each arrow shows the source (shift-register/register) ofsamples. To compute each output of (4 4) filter, 16 input-sam-ples corresponding to 4 rows and 4 columns of 2-D input arerequired. Out of 4 rows (numbered as ),the -th row is the current input-row and others are immediatepast rows. Memory-unit uses three shift-registers (SR-1, SR-2,SR-3) to buffer 3 past -th rows of input

Page 4: 06542014

MOHANTY et al.: MEMORY FOOTPRINT REDUCTION FOR POWER-EFFICIENT REALIZATION 123

Fig. 3. Memory-unit data-flow of fully-direct structure for outputs , and . Redundant memory-values are shownby blue color boxes.

samples. The required -th columns of sam-ples of a particular-row are provided by the serial-in parallel-out(SIPO) shift-register of words. As shown in Fig. 3, allthe 16 input-samples required to compute the filter output areobtained from the memory-unit and current input. Therefore,memory used by fully-direct structure to compute each output is

words, where SRs-unit provides words and fixedregister-unit provides 12words.Memory band-width (MBW) ofthe structure is 15 (read operations on SR-unit is 3, and 12 valuesare obtained from register-unit to compute an output). This canbe generalized to find the number of memory words used byfully-direct structure to be and MBW to be

.As shown in Fig. 3, in total 64 input samples are

required to compute the outputs. These 64 samples belong

to 4 rows and 7 columnsof the input

image. It can be noticed that out of 64 samples 28 sam-ples are different from each other, while redundancy existsin rest 36 samples. The redundant values corresponding tofilter-outputin the blue boxes in Fig. 3. These redundant-memory accesscould be avoided by parallel computation of filter outputs

. In that case,four current input samples

corresponding to four outputs need to beavailable in each cycle. Four input values of each of the pastrows are retrieved from respectiveshift-registers in every cycle. To realize this, each of theshift-registers (of words) is required to be split into fourequal parts of words. Therefore, 12 input values areobtained from the SR-unit in every cycle. Out of 28 uniquesamples, 4 are obtained as input samples of current cycle and12 samples are obtained from the SR-unit. The remaining12 samples are obtained from four SIPO shift-registers. Thememory usage for parallel computation of 4 filter outputs is

which is the same as the memory usageof the fully-direct structure for one output. MBW for parallelcomputation of four filter outputs is . Thememory words required to compute a block of filter outputsper cycle of the 2-D FIR filter of size can thereforefound to be and MBW can found to be

.

Interestingly, the storage space of fully-direct structure of2-D FIR filter of size is the same as parallel compu-tation of filter outputs due to memory reuse during parallelcomputation. Only arithmetic resources increases proportion-ately with throughput rate. This is an important feature whichcan be utilized for area and power saving. The memory reuseefficiency (MRE) of block-based structure can be estimatedas MRE [(total input words—actual memory usage)/ac-tual memory usage], where total input words is times thememory-usage of single-input single-output (SISO) structure.For block-based structure with block size

.Therefore, MRE of block-based parallel structure increasesproportionately with the block-size . Higher the MRE,lesser is the memory requirement per output. Consequently,the structure is more area-delay efficient compared with theSISO structure. MBW of block-based structure increases by afactor of instead of times, where is theinput block-size. This is mainly due to the number of redundantsamples . Higher the , better is theMBW reduction. We have estimated the memory band-widthsaving (MBS) using the formula MBS ofSISO structure) MBW of block-based structure of block-size

MBW of SISO structure). Therefore, we can have. Since, varies with and , MBS

is higher for larger size filters with higher input block-size.We have estimated , MBS and MRE of parallel 2-D FIR

structure for different filter sizes and input block-sizes .The estimated values are listed in Table III to quantify the scopeof memory reuse. We can find from Table III that a block-basedrealization enhances memory-reuse and reduces memory-bandwidth per output. Therefore, we have presented the block for-mulation and its subsequent implementation of 2-D FIR filters.

B. Memory-Sharing in Generalized 2-D FIR Filter Structures

The fully-direct non-separable structure as well as the con-ventional separable structure use shift-registers to storewords each, but shift-registers of fully-direct non-separablestructure stores input pixel values while those of separablestructure stores intermediate values. Due to the difference inbit-width, a common shift-register unit cannot be shared bythese two structures. A different design approach for separablefilter is required where the shift-register unit stores the input

Page 5: 06542014

124 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 61, NO. 1, JANUARY 2014

TABLE IIIMEMORY REUSE EFFICIENCY AND MEMORY BAND-WIDTH SAVING FOR

DIFFERENT SIZE FILTERS AND INPUT-BLOCK SIZE

signal only. For this purpose, we have used an efficient decom-position scheme for 2-D separable filter in the following.The input-output relation of separable 2-D FIR filter, given

by (2), can be written as:

(4)

Computation of (4) can be expressed in split form as:

(5a)

(5b)

and can be expressed as inner-products of a pair of -pointvectors and as

(6a)

(6b)

and and are given by

(7a)

(7b)

(7c)

(7d)

According to (5), input-vectors are fed to the row-filter incolumn overlapped order, and the row-filter generates inter-mediate values column-wise exactly in the same order as the

Fig. 4. Separable 2-D FIR filter structure (for based on proposeddecomposition scheme.

column-filter consumes intermediate values. Consequently,transposition-unit in this case is comprised of fixed registersinstead of shift-registers.Separable filter structure based on decomposition scheme of

(5) is shown in Fig. 4 for . The input samples are fedas overlapped blocks for and

, where successive blocks of a column are over-lapped by samples. The input-blocks are fed in row-serialorder from shift-register unit comprising of shift-reg-isters of words each. We can find from Figs. 3 and 4 thatshift-register unit of fully-direct non-separable structure and theseparable structure based on this decomposition scheme hasthe same number of shift-registers and they are of the samesize. Interestingly, the data-input and data-output formats ofboth these shift-register units are identical. This favours the pos-sible sharing of shift-register units of fully-direct non-separablestructure and separable structure. This leads to a generalizedstructure for both non-separable and separable filters individ-ually or in parallel configuration.Data-flow of a shared shift-register unit is shown in Fig. 5

for , where the input-data requirement of both fully-di-rect non-separable and separable structures is taken care of bythe shared shift-register unit. A shared shift-register unit notonly offers memory-saving, but also allows parallel realizationof both non-separable and separable filters. It is shown in thelater Sections that the parallel implementation of generalizedstructure offers higher area-delay-power efficiency over sequen-tial structure. Keeping these facts in view, we have outlinedhere a systematic memory-centric design strategy to derive anarea-delay-power efficient structure for 2-D FIR filter.• A fully-direct form structure should be used for non-sepa-rable filter to have less storage-complexity.

• A block implementation of fully-direct structure should beused for MBW reduction.

• Separable structure based on the proposed decompositionalgorithm could be derived for shift-register sharing withnon-separable filter structure.

• Appropriate algorithm partitioning and scheduling schemeneed to be used for separable filter to minimize memorybandwidth and increase register sharing.

The register sharing property of proposed non-separable andseparable filter structures are exploited further to derive genericstructures. The proposed generic structures can be configured

Page 6: 06542014

MOHANTY et al.: MEMORY FOOTPRINT REDUCTION FOR POWER-EFFICIENT REALIZATION 125

Fig. 5. Data-flow in a shared shift-register unit of fully-direct structure andseparable structure.

for realization of a single filter of types (separable, non-sepa-rable, symmetry (diagonal, four-fold rotational, quadrant)) orparallel realization of any combination of these filters.

III. BLOCK FORMULATION OF 2-D FIR FILTERS

A. For Non-Separable Filter

Let us consider a non-separable filter which processes a blockof input samples and generates a block of outputs in everycycle. The -th block of filter output of the -th row iscomputed by relation;

(8)

where and are defined as

(9a)

(9b)

The intermediate vector is computed by product of input-matrix and impulse-response vector , and given by

(10)

The input matrix is derived from -th row of theinput matrix of size and given by (11), shown atthe bottom of the page. is given by:

(12a)

From (9) and (11), we find is the inner-productof -th row of matrix and , given by

(13)

B. For Separable 2-D FIR Filter

Let us consider a separable filter which processes a block ofinput samples and generates a block of outputs in every

cycle. The -th block of filter output of the -th row iscomputed in this case by two successive matrix-vector productsgiven by

(14a)

(14b)

and the input-matrix and intermediate-matrix are givenby (15) and (16), shown at the bottom of the page.

(11)

(15)

Page 7: 06542014

126 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 61, NO. 1, JANUARY 2014

Fig. 6. Proposed block-based structure for non-separable 2-D FIR filter forblock-size and filter-size .

IV. PROPOSED STRUCTURES

In this Section, we have derived two separate structures forblock implementation of non-separable and separable 2-D FIRfilters.

A. Block-Based Structure for Non-Separable 2-D FIR Filter

The computation of (8) and (10) are mapped into a fully-direct parallel structure to derive the proposed block-basedstructure for non-separable 2-D FIR filter. The proposed struc-ture is shown in Fig. 6 for filter-size and block-size

. It consists of one memory-module and one arithmeticmodule.1) Memory-Module Design: The memory-module of Fig. 6

is comprised of one shift-register array and input-reg-ister units (IRUs). The shift-register array further consistsof shift-registers of words each, where

. Proposed structure receives a block of inputsamples and computes a block of outputs in each cycle.All the samples of each input-block belong to the samerow and the inputs are fed to the structure block-by-blockand then row-by-row in serial order. The input-blockcorresponding to the -th row of input image is fedto the structure during -th cycle of -th set ofcycles, and the entire image is fed in cycles for

. shift-registersof the shift-register array are arranged in groups referred to asSR-block and each SR-block has shift-registers. As shownin Fig. 6, SR-1,SR-2,SR-3,SR-4 constitute SR-block-1 andSR-25,SR-26,SR-27,SR-28 constitute SR-block-7. OneSR-block stores one input row and the shift-register array

Fig. 7. Internal structure of -th input register unit (IRU) for and.

stores rows of input. SRs of SR-block are connectedsuch that a block of samples transfer from one SR-block tothe adjacent SR-block to its right after every cycle. Therefore,a block of inputs of a particular row are obtainedfrom each SR-block in every cycle, and input-blockscorresponding to consecutive input rows are obtainedfrom the shift-register unit in every cycle.Current input-block and past input-blocks avail-

able in the shift-register array are sent to IRUs to generateinput-matrix of size . During -th cycle of eachset of cycles, the first IRU receives the current block of input

, and the -th IRU receives an input-block from -thSR-block. The -th IRU generates the input-matrix ,for . According to (11), a block ofconsecutive samples of -th row of input are required toconstruct the matrix . Each IRU receives samples fromthe corresponding SR-block during -th cycle and usespast samples belonging to past input-blocks. The in-ternal structure of -th IRU is shown in Fig. 7 forand . It consists of registers, and producesnumber of -point input-vectors , for cor-responding to rows of .2) Arithmetic-Module Design: The arithmetic module is

comprised of functional-units (FUs) and one adder tree (AT).In each cycle, FUs of arithmetic module receive setsof input-vectors from IRUs of storage-module such that

-th FU receives input-vectors generated by -thIRU and it performs separate inner-product computationwith the -th row of impulse-response matrix toobtain -point partial output-vector according to (10).Internal structure of the FU is shown in Fig. 8(a). It consists of

(16)

Page 8: 06542014

MOHANTY et al.: MEMORY FOOTPRINT REDUCTION FOR POWER-EFFICIENT REALIZATION 127

Fig. 8. (a) Structure of -th functional unit (FU) for and .(b) Structure of adder-block for and .

Fig. 9. Internal structure of inner-product cell (IPC) for .

inner-product cells (IPCs). Each IPC (shown in Fig. 9) per-forms -point inner product of input-vector and weight-vector.Finally, partial-output vectors of FUs are added togetherin an adder-block (shown in Fig. 8(b)) according to (8) to obtainone block of output in every cycle, where one clock period

, and are,respectively, one multiplication time, addition time and onefull-adder delay. One complete row of output is obtained incycles and the entire output matrix of size incycles.

B. Block-Based Structure for Separable Filter

The proposed block-based structure for separable 2-D FIRfilter is shown in Fig. 10. It consists of one processing cell (PC)and one transposition-unit (TU), where a PC consists of twoFUs. Structure of each FU is the same as the one shown inFig. 8(a) except that a constant vector is stored in each FU. Inthis case, FU-1 and FU-2 store the constant-vectors and

, respectively. It processes the -th block of inputof the -th row of during the -th cycle of a set of

cycles, and produces an output block , whereand is the input block-size. One complete row of is pro-cessed in cycles and the complete image in cycles. Pro-

Fig. 10. Proposed block-based structure for separable 2-D FIR filter forand .

posed structure receives a block of input samples throughnumber of -point input-vectors where each of theinput-vectors is overlapped by samples. Input-vectorsof -th and -th cycles of the -th set input cycles areshown in Fig. 10 for and . Components of eachinput-vectors are shown in the rectangular box adjacent to itsleft. The input-vectors of -th cycle form the matrix asgiven in (15), where is the -th row of , for

, and .rows of are fed to IPCs of FU-1 in parallel to perform

one matrix-vector multiplication with the constant vectorto calculate an -point intermediate-vector according to(14a). From each intermediate-vector, one intermediate-matrix(as given in (15)) is generated, such that is generated from

. TU generates the required rows of from .We can find from (11) and (16) that the elements of and

satisfy similar property. Therefore, TU performs the samefunction as the IRUs and its structure is identical with that of anIRU (as shown in Fig. 7). The TU of separable structure gener-ates rows , for ) of in parallel and feedsthose to FU-2 in parallel to perform one matrix-vector productwith constant-vector to compute a block of filter output

according to (16b).

V. GENERIC STRUCTURES

In this Section, we derive two separate generic structures fornon-separable and separable filter banks. Also we have pro-posed a unified structure for realization of 2-D FIR filter-bankcomprised of non-separable and separable filters.

A. Generic Block-Based Structure for Non-Separable Filters

The coefficient matrices of non-separable filters can have va-rieties of symmetry, e.g., diagonal, centro, four-fold rotational,quadrant etc. These symmetries can be exploited to realize thetransfer functions with lesser number of multiplications. Thearithmetic module of proposed structure for non-separable fil-ters could be optimized to take advantage of these symmetryproperty. The storage-module of the structure can be interfaced

Page 9: 06542014

128 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 61, NO. 1, JANUARY 2014

Fig. 11. Proposed generic block-based structure for non-separable 2-D FIR fil-ters, is the filter size and is the input block-size.

Fig. 12. Proposed generic block-based structure for separable 2-D FIR filters,is the filter size and is the input block-size.

as a common unit with arithmetic modules of filters with andwithout symmetry. This results in a generic structure shown inFig. 11. Each sub-module of arithmetic-module of generic struc-ture represents arithmetic module of a constituent filter. Thearithmetic-module of each filter is enabled with a select signal(EN , for ) to switch off the arithmetic module tohave power saving if the output of a particular filter is not re-quired. The proposed generic structure can be used to realizeany of the four types of filters or a parallel combination of fil-ters by selecting the arithmetic modules of respective filtersthrough the select signals. The proposed generic structure hashigher MRE and MBS than the non-separable structure for agiven block-size and filter-size due to common storage-unit.The area-delay-product (ADP) and power consumed per output(PPO) of the proposed generic structure in parallel configurationis expected to be significantly less than separate implementationof individual filters.

B. Generic Block-Based Structure for Separable Filters

The proposed generic structure for the realization of sepa-rable filters with and without symmetry is shown in Fig. 12. Thestructure is similar to proposed generic non-separable structureexcept that the shift-register array is only common with PCs ofthe processing unit. Proposed generic structure can be used torealize a separable filter with and without symmetry in parallel.

Fig. 13. Data-flow of common shift-register array and data distribution blockfor non-separable and separable generic structures.

C. Unified Structure for 2-D FIR Filter-Bank

The shift-register size and input to the shift-register array arethe same in the proposed separable and non-separable genericstructures. NL input samples are obtained from the shift-reg-ister array and fed to the IRU-array of generic non-separablestructure as blocks of samples each, while the processingunit of the generic separable structure is fed with blocks ofsamples each. Therefore, the input-blocks of both non-separableand separable filters can be obtained from the same shift-reg-ister array. Outputs of the common shift-register array need tobe rearranged appropriately for the non-separable and separablegeneric structures.Data rearrangement of a common shift-register array is

shown in Fig. 13. The data-flow of an SR-block is shown inblue color. The input-blocks shifting through the SR-block areshown for the -th and -th cycle. The contents of thefirst two locations of each SR of the SR-block are shown forthe -th and -th input cycles of the -th input row. Therectangular dotted boxes show the content of one cell of theSR-block comprised of 4 SRs corresponding to . A directflow of data from the shift-register unit meet the data-flowrequirement of non-separable generic structure while shift-reg-ister output data are rearranged (shown by the data-distributionblock in Fig. 13) for the separable generic structure.The proposed unified structure for 2-D FIR filter is shown

in Fig. 14. It has a common storage unit for both separableand non-separable sections which can be used for the realiza-tion of any one of four types of non-separable or two typesof separable filters. It also can be used for parallel realizationof any combination of separable or non-separable. It involves

memory-words and computes outputseach of all the six filters in every cycle in its full-parallel con-figuration. Although we have shown the unified structure for6 parallel filters, it can be used for realization of more than 6filters without any additional storage. The processing modulecomplexity (arithmetic-modules and PCs of each filter) only in-creases proportionately with the number of parallel filters. Thisis a major advantage for area-delay-power efficient realization

Page 10: 06542014

MOHANTY et al.: MEMORY FOOTPRINT REDUCTION FOR POWER-EFFICIENT REALIZATION 129

Fig. 14. Proposed block-based unified structure for 2-D FIR filter.

of large size filter banks consisting of separable and non-sepa-rable filters.

VI. COMPLEXITIES AND PERFORMANCE CONSIDERATIONS

The proposed structure for non-separable filter consists of astorage-module and an arithmetic module, while the proposedseparable structure consists of one shift-register array and oneprocessing cell, where the processing cell again consists of botharithmetic and storage components. The storage module of non-separable structure consists of one shift-register array andIRUs. The shift-register array of both non-separable and sepa-rable structures consists of SRs of words each. EachIRU consists of registers. The arithmetic module ofnon-separable structure consists of FUs and one adder-blockwhile the processing cell of separable structure consists of twoFUs and one TU (same as the IRU). Each FU consists of IPCs,and each IPC consists of multipliers and one adder-tree (AT)to add words. The adder-block consists of such ATs.

A. Complexity of Block-Based Structures for Separable andNon-Separable Filters

The arithmetic-module of block-based non-separable struc-ture involves multipliers, adders while eachprocessing cell of separable structure involves multipliers,

adders and registers. Apart from these, thenon-separable structure involves registersand the separable structure involves regis-ters. Both these structures compute outputs per cycle, whereone cycle period is and , re-spectively for non-separable and separable structure, for

and are, respectively, one multiplier delay, adder delay andfull-adder delay2.

2The delay of AT increases by after each level of the tree. Since, wehave considered a (4 4) 2-D FIR filter to synthesize the proposed design, itrequires an AT of two levels. So the delay introduced by the adder tree becomes

. Therefore, the critical path is not affected much. For large orderfilter one can use a pipeline adder-tree (PAT) instead. For simplicity if we canconsider the multiplier consisting of ripple carry adders, the multiplication time

is twice that of an addition time . To maintain a critical path of one, then we need to introduce a pipeline stage after the first level of adder

tree using registers. Upto a filter of order 511 we do not require any extrapipeline stage for word length 8 bit or more.

Fig. 15. (a) Memory reuse efficiency (MRE). (b) Memory band-width peroutput (MBWPO). NS, S, NSG, SG and UNI, respectively, stands for non-sep-arable, separable, non-separable generic, separable generic and unified.

TABLE IVGENERAL COMPARISON OF MRE AND MBWPO OF PROPOSED STRUCTURES.

: INPUT-IMAGE WIDHTH/HEAIGHT, : FILTER-SIZE, : INPUT-BLOCKLENGTH

B. Complexity of Generic Structures

Arithmetic module of non-separable generic structure com-prises four sub-modules corresponding to four types of filterswhere each sub-module represents arithmetic-module of thecorresponding filter. Sub-modules of symmetric filters involvesthe same number of adders as those of sub-module of generalfilter which is , and only differ by the number ofmul-tipliers. Sub-module of diagonal, four-fold and quadrant sym-metry filters, respectively, involvesand multipliers. Arithmetic-module of non-separablegeneric structure, therefore, involves multi-pliers and adders. Processing unit of proposedseparable generic structure is comprised of two processingcells (one general and one symmetric filter). It involves

Page 11: 06542014

130 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 61, NO. 1, JANUARY 2014

TABLE VCOMPARISON OF HARDWARE- AND TIME-COMPLEXITY OF THE PROPOSED STRUCTURES AND EXISTING STRUCTURES

TABLE VICOMPARISON OF HARDWARE- AND TIME-COMPLEXITY OF PROPOSED NON-SEPARABLE GENERIC AND UNIFIED STRUCTURES AND EXISTING UNIFIED STRUCTURE

OF [17]

Fig. 16. Normalized storage per filter output (calculated for and, NS, S, NSG, SG and UNI, respectively, stands for non-separable,

separable, non-separable generic, separable generic and unified.

multipliers, adders and registers. Theproposed non-separable generic structure, therefore, involves

multipliers and adders andregisters. It computes outputs of each of

the four filters in every cycle. Similarly, the proposed separablegeneric structure involves multipliers, addersand registers, and computes outputs of eachpair of filters in every cycle. The proposed unified structure hasone storage unit and one processing module which is comprisedof one non-separable section and one separable section. Thenon-separable section represent arithmetic-module of proposednon-separable generic structure whereas the separable sectionrepresents one processing unit of separable generic structure.Complexity of storage unit is the same as those of proposedblock-based non-separable structure. The proposed unifiedstructure, therefore, involves multipliers,

TABLE VIIHARDWARE AND TIME COMPLEXITIES OF PROPOSED GENERIC AND UNIFIEDSTRUCTURES AND EXISTING UNIFIED STRUCTURE OF [17] FOR INPUT-IMAGE

SIZE , FILTER-SIZE

adders andregisters, and computes filter outputs of each of six filters(four non-separable and two separable) in every cycle.

C. Memory Reuse Efficiency and Bandwidth RequirementUsing the definition of MRE and MBWPO given in

Section II, we have calculated MRE and MBWPO of pro-posed structures according to expressions given in Table IV.Using these expressions, we have estimatedMRE andMBWPOof the proposed structure for filter-size and image-size

. The estimated values are shown in graphs of Fig. 15.The MRE of proposed structures increases with block-size.MRE of unified structure is higher than others for a givenblock-size. This is mainly due to memory-sharing by morefilters in the unified structure.

D. Performance Comparison

In [16], an efficient 2-D IIR filter (pole-zero) structure ispresented which can be modified for realization of 2-D FIR

Page 12: 06542014

MOHANTY et al.: MEMORY FOOTPRINT REDUCTION FOR POWER-EFFICIENT REALIZATION 131

TABLE VIIICOMPARISON OF POST-LAYOUT SYNTHESIS RESULTS OF THE PROPOSED STRUCTURES AND THE STRUCTURES OF [16], [17] AND [29] FOR

AND , USING TSMC 90 NM CMOS TECHNOLOGY LIBRARY, POWER ESTIMATED AT 20 MHZ FREQUENCY

filter by removing the all-pole structure from the pole-zerostructure. Accordingly, we have extracted an FIR filter struc-ture from [16] and synthesized that for comparison. Hardwarecomplexity of these extracted structure along with those ofproposed structures and structure of [11] and [29] are listed inTable V for comparison of complexities. We find from Table Vthat proposed non-separable structure involves times moremultipliers and adders than the existing structures and equalnumber of registers, but it offers times higher throughput andtimes less MBWPO than others. Similarly, proposed sepa-

rable structure involves times more multipliers and addersthan those of [29] and the same number of registers, but offerstimes higher throughput and nearly 2 times less MBWPO thanthose of [29]. Proposed separable generic structure involves

times more multipliers and times more adders thanthose of [29] and more registers, and offers timeshigher throughput with nearly 4 times less MBWPO than thoseof [29].We have extracted a unified 2-D FIR filter structure from

the multimodal structure of [17] for comparison purpose. Thehardware and time complexities of this structure and the pro-posed non-separable generic and unified structure are listed inTable VI. Compared with structure of [17], proposed non-sep-arable generic structure and unified structure involve nearly

times more multiplier times more adders and computeand times more outputs, respectively. Proposed generic

structure involves less registers than that of [17] andit has nearly times less MBWPO than other. Similarly, theproposed unified structure involves less registersand nearly times less MBWPO.We have estimated hardware complexity of proposed non-

separable generic and unified structures and the unified struc-ture of [17] for , block-size , 8 and for image-size

. The estimated values are listed in Table VII. Wecan find from Table VII that, proposed non-separable genericstructure involves 13.6% less normalized multipliers, 34.7%less normalized adders and 90% less MBWPO than those of

[17]. Proposed unified structure involves 24.27% less normal-ized multipliers, 47.82% less normalized adders and 94% lessMBWPO than those of [17]. Unlike the existing unified struc-ture, the proposed one does not require any steering logics forsignal switching. Note that, the proposed non-separable genericstructure is comprised of one general filter and three symmetricfilters, while the proposed unified structure is comprised of onegeneral non-separable and one general separable filter, and foursymmetric filters. But, unified structure of [17] is comprised ofonly four symmetric filters. Proposed structures have severaladvantage over the existing structures, (i) provide filter-banksfor non-separable and/or separable filters with symmetry and/orwithout symmetry which could be used in many image pro-cessing applications, (ii) can be configured for implementationof any one filter of the filter-bank or parallel configuration ofany of the filters of the filter-bank, and (iii) can be easily scaledfor high-throughput.We have estimated normalized storage per filter output (SPO)

of proposed structures and the existing structure of [11], [16],[17], [29] for and and shown inthe graph of Fig. 16. We find that proposed structures involvesless normalized SPO than the existing structure. This is mainlydue to memory-reuse efficiency and memory-sharing of the pro-posed structures.

E. ASIC Synthesis Result

We have coded the proposed designs in VHDL for filter size, block-size and image-size (512 512), and

synthesized using Synopsys tool. We have also synthesizednon-separable 2-D FIR filter structures extracted from IIRstructure of [16] and unified structure of [17] and separablestructure of [29]. We have used Wallace-tree based genericBooth-multiplier of Synopsys DesignWare building blockslibrary for all the designs. Shift-registers and registers of alldesigns are synthesized using D-FF (flip-flop). We have consid-ered input signal width -bit and all intermediate signalsand output signal width -bit. We have set switching

Page 13: 06542014

132 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, VOL. 61, NO. 1, JANUARY 2014

activity toggle rate 0.25 and static probability is 0.5. The netlistfile obtained from the Synopsys Design Compiler is processedin IC Compiler. After place, route and clock synthesis area, timeand power (leakage and dynamic) reported by the IC Compilerare listed in Table VIII for comparison. Power consumption ofall the synthesized designs are estimated for 20 MHz clock.Due to memory footprint reduction, the area, leakage powerand dynamic power consumption of the proposed structuresdo not increase proportionately with the number of outputs.Consequently, proposed structures involve significantly lessarea-delay-product (ADP3) and consume less energy per output(EPO4). Compared with [16], proposed generic non-separablestructure involves 11.25 times less ADP and 15.63 times lessEPO. The proposed generic separable structure involves 4.85times less ADP and 5.53 times less EPO than those of [29].Compared with [17], proposed unified structure involves 4.64times less area and 9.78 times less EPO.

VII. CONCLUSION

We have analyzed memory footprint and combinationalcomplexity of 2-D FIR structures to arrive at a systematicdesign strategy to derive area-delay-power-efficient architec-tures. Based on that we have presented novel block-basedseparable and non-separable structures, generic structures forseparable and non-separable filter-banks and unified structurefor concurrent realization of both separable and non-separablefilter-banks. It is shown that storage requirement of proposedstructures does not change with input block-size . Similarly,the storage size of generic non-separable structure is indepen-dent of the number of parallel filters in a filter-bank. Itincreases marginally by words in case of genericseparable and unified structures, where is the filter size.Proposed structures, therefore, offer higher memory reuseefficiency and MBW reduction for higher values of and .This reduces SPO and PPO of proposed structures.Compared with the existing structures, proposed block-based

separable and non-separable structures involve the samenumber of storage words, and proportionately the same orless number of arithmetic resources than the correspondingbest of existing structures, and compute PL times more filteroutputs per cycle with PL times less MBW. The proposedunified structure with 4 non-separable filters and 2 separablefilters involve nearly times more multipliers, timesmore adders, less registers than existing unifiedstructure, and computes times more filter outputs per cyclewith times less MBW than the other. ASIC synthesis resultfor filter-size (4 4), input-block size and image-size(512 512) shows that proposed block-based non-separableand generic non-separable structures, respectively, involve5.95 times and 11.25 times less ADP, and 5.81 times and 15.63times less EPO than the corresponding existing structures. Theproposed unified structure involves 4.64 times less ADP and9.78 times less EPO than the corresponding existing structure.When the inner-product computation involved in FIR filtering

3ADP area data-arrival time(DAT)/outputs per cycle4power clock-period/number of outputs per cycle

is realized by multiplier-less designs to reduce the combina-tional-complexity that may require additional memory leadingto overall increase in memory complexity. But, the advantagegained by the proposed memory footprint reduction techniqueis not affected by such change of method of implementation ofinner-products.

REFERENCES

[1] D. E. Dudgeon and R. M. Mersereau, Multidimensional Digital SignalProcessing. Englewood Cliffs, NJ, USA: Prentice-Hall, 1984.

[2] M. A. Sid-Ahmed, Image Processing: Theory, Algorithms, and Archi-tectures. New York, NY, USA: McGraw-Hill, 1995.

[3] T. Barbu, “Gabor filter based face recognition technique,” in Proc.Rmanian Acad. Ser. A, 2010, vol. 11, no. 3/2010, pp. 277–283.

[4] S. E. Grigorescu, N. Petkov, and P. Kruizinga, “Comparison of texturefeatures based on Gabor filters,” IEEE Trans. Image Process., vol. 11,no. 10, pp. 1160–1167, Oct. 2002.

[5] W. Li, K. Mao, H. Zhang, and T. Chai, “Designing compact Gaborfilter banks for efficient texture feature extraction,” in Proc. 11th Int.Conf. Contr., Automation, Robotics Vis., Singapore, Dec. 7–10, 2010,pp. 1193–1197.

[6] M. A. Sid-Ahemed, “A systolic realization for 2-D digital filters,” IEEETrans. Acoustic Speech Signal Process., vol. 37, pp. 560–565, Apr.1989.

[7] N. R. Sanbhag, “An improved systolic architecture for 2-D digital fil-ters,” IEEE Trans. Signal Process., vol. 39, pp. 1195–1202, May 1991.

[8] B. K. Mohanty and P. K. Meher, “Cost-effective novel flexiblecell-vlevel systolic architecture for high-throughput implementationof 2-D FIR filters,” Proc. IET Comput. Digit. Tech., vol. 143, no. 6,pp. 436–439, Nov. 1996.

[9] B. K. Mohanty and P. K. Meher, “High-throughput and low-latencyimplementation of bit-level systolic architecture for 1-D and 2-D digitalfilters,” IET Proc. Computer Digital Technique, vol. 146, no. 2, pp.91–99, Mar. 1999.

[10] L. D. Van, C. C. Tang, S. Tenqchen, and W. S. Feng, “New VLSI ar-chitecture without global broadcast for 2-D systolic digital filters,” inProc. IEEE Int. Symp. Circuis Syst. ISCAS, Geneva, Switzerland, May2000, vol. 1, pp. 547–550.

[11] L. D. Van, “A new 2-D systolic digital filters architecture withoutglobal broadcast,” IEEE Trans. Very Large Scale Integr. Syst., vol. 10,no. 4, pp. 477–486, Aug. 2002.

[12] H. C. Reddy, I. H. Khoo, and P. K. Rajan, “2-D symmetry: Theory andfilter design applications,” IEEE Circuits System Mag., vol. 3, no. 3,pp. 4–33, 3rd Q 2003.

[13] P. Y. Chen, L. D. Van, H. C. Reddy, and C. T. Lin, “A new VLSI 2-Ddiagonal-symmetry filter architecture design,” in Proc. IEEE APCCAS,Macao, China, Nov. 2008, pp. 320–323.

[14] I. H. Khoo, H. C. Reddy, L. D. Van, and C. T. Lin, “2-D digital filter ar-chitectures without global broadcast and some symmetry applications,”in Proc. IEEE Int. Symp. Circuis Syst. ISCAS, May 2009, pp. 952–955.

[15] P. Y. Chen, L. D. Van, H. C. Reddy, and C. T. Lin, “A new VLSI 2-Dfourfold-rotational-symmetry filter architecture design,” in Proc. IEEEInt. Symp. Circuis Syst. (ISCAS), May 2009, pp. 93–96.

[16] I. H. Khoo, H. C. Reddy, L. D. Van, and C. T. Lin, “Generalized formu-lation of 2-D filter structures without global broadcast for VLSI imple-mentation,” in Proc., IEEE MWSCAS, Seattle, WA, USA, Aug. 2010,pp. 426–529.

[17] P. Y. Chen, L. D. Van, I. H. Khoo, H. C. Reddy, and C. T. Lin, “Power-efficient cost-effective 2-D symmetry filter architecture,” IEEE Trans.Circuit Syst. I, Reg. Papers, vol. 58, no. 1, pp. 112–125, Jan. 2011.

[18] K. K. Parhi, VLSI Digital Signal Processing. New York, NY, USA:Wiley, 1998.

[19] C. Y. Yao, H. H. Chen, C. J. Chien, and C. T. Hsu, “A novel common-subexpression elimination method for synthesizing fixed-point FIR fil-ters,” IEEE Trans. Circuit Syst. I, Reg. Papers, vol. 51, no. 11, pp.2215–2221, Nov. 2004.

[20] Y. Voronenko and M. Puschel, “Multiplierless multiple constant mul-tiplication,” ACM Trans. Algorithms, vol. 3, no. 2, May 2007, Article11.

[21] A. P. Vinod and E. M.-K. Lai, “An efficient coefficient-partitioningalgorithm for realizing low complexity digital filters,” IEEE Trans.Comput.-Aided Design Integr. Circuits Syst., vol. 24, no. 12, pp.1936–1946, Dec. 2005.

Page 14: 06542014

MOHANTY et al.: MEMORY FOOTPRINT REDUCTION FOR POWER-EFFICIENT REALIZATION 133

[22] R. Mahesh and A. P. Vinod, “A new common subexpression elimina-tion algorithm for realizing low complexity higher order digital filters,”IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 27, no.2, pp. 217–219, Feb. 2008.

[23] J. Luis, T.-X. , R. M. A.-P. , and M. Bayoumi, “Hybrid multiplierlessFIR filter architecture based on NEDA,” in Proc. IFIP Int. Conf. VeryLarge Scale Integration (VLSI-SoC 2007), 2007, pp. 316–319.

[24] P. K. Meher, S. Chandrasekaran, and A. Amira, “FPGA realization ofFIR filters by efficient and flexible systolization using distributed arith-metic,” IEEE Trans. Signal Process., vol. 56, no. 7, pp. 3009–3017, Jul.2008.

[25] P. K. Meher, “New approach to look-up table design and memory-based realization of FIR digital filters,” IEEE Trans. Comput.-AidedDesign Integr. Circuits Syst., vol. 57, no. 3, pp. 592–603, Mar. 2010.

[26] A. Croisier, D. J. Esteban, M. E. Levilion, and V. Rizo, “Digital Filterfor PCM Encoded Signals,” U.S. Patent 3777130, Apr. 12, 1973.

[27] S. A.White, “Applications of the distributed arithmetic to digital signalprocessing: A tutorial review,” IEEE ASSP Mag., vol. 6, no. 3, pp.5–19, July 1989.

[28] R. I. Hartley, “Subexpression sharing in filters using canonic signeddigit multipliers,” IEEE Trans. Circuits Syst. II, Analog Digital SignalProcess., vol. 43, no. 10, pp. 677–688, Oct. 1996.

[29] B. K. Mohanty and P. K. Meher, “New scan method and pipeline archi-tecture for VLSI implementation of separable 2-D FIR filters withoutusing transposition,” in Proc. IEEE Region 10 TENCON2008 Conf.,Hyderabad, India, Nov. 2008.

Basant K. Mohanty (M’06–SM’11) received theM.Sc. degree in physics from Sambalpur University,India, in 1989 and the Ph.D. degree in the field ofVLSI for Digital Signal Processing from BerhampurUniversity, Orissa, in 2000.In 2001 he joined as Lecturer in EEE Department,

BITS Pilani, Rajasthan. Then he joined as an Assis-tant Professor in the Department of ECE, Mody Insti-tute of Education Research (DeemedUniversity), Ra-jasthan. In 2003 he joined Jaypee University of En-gineering and Technology, Guna, Madhya Pradesh,

where he became Associate Professor in 2005 and full Professor in 2007. Hisresearch interest includes design and implementation of low-power and high-performance systems for adaptive filters, image and video-processing appli-cations, secured communication and reconfigurable architectures. He has pub-lished nearly 50 technical papers.Dr. Mohanty is serving as Associate Editor for the Journal of Circuits, Sys-

tems, and Signal Processing. He is a life time member of the Institution of Elec-tronics and Telecommunication Engineering, New Delhi, India, and he was therecipient of theRashtriyaGaurav Award conferred by India International friend-ship Society, New Delhi for the year 2012.

PramodKumarMeher (SM’03) received the M.Sc.degree in physics and the Ph.D. degree in sciencefrom Sambalpur University, India, in 1978, and 1996,respectively.Currently, he is a Senior Scientist with the Institute

for Infocomm Research, Singapore. Previously, hewas a Professor of Computer Applications withUtkal University, India, from 1997 to 2002, and aReader in electronics with Berhampur University,India, from 1993 to 1997. His research interestincludes design of dedicated and reconfigurable

architectures for computation-intensive algorithms pertaining to signal, imageand video processing, communication, bio-informatics and intelligent com-puting. He has contributed nearly 200 technical papers to various reputedjournals and conference proceedings.Dr. Meher has served as a speaker for the Distinguished Lecturer Program

(DLP) of IEEE Circuits Systems Society during 2011 and 2012 and AssociateEditor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—II: EXPRESSBRIEFS during 2008 to 2011. Currently, he is serving as Associate Editor for theIEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS—I: REGULAR PAPERS, theIEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS,and Journal of Circuits, Systems, and Signal Processing. Dr. Meher is a Fellowof the Institution of Electronics and Telecommunication Engineers, India. Hewas the recipient of the Samanta Chandrasekhar Award for excellence in re-search in engineering and technology for 1999.

Somaya Al-Maadeed (SM’12) received the B.Sc. degree in computer sciencefrom Qatar University, Qatar, in 1994, the M.Sc. degree in mathematics andcomputer science from Alexandria University, Egypt, in 1999, and the Ph.D. incomputer science from Nottingham University, Nottingham, U.K., in 2004.Following her Ph.D., she worked as an Assistant Professor with Qatar Uni-

versity, where she did research in the areas of biometrics, digital filters, speechrecognition, image processing, and document management. She has publishedaround forty papers in the above general areas.

Abbes Amira (M’01–SM’07) received the Ph.D.degree in the area of electronic and computerengineering from Queens University Belfast, U.K.,in 2001, developing a coprocessor for matrix algo-rithms using FPGAs.He is a Full Professor in visual communication at

the University of the West of Scotland (UWS), Scot-land. Prior to joining UWS, he took academic po-sitions at the university of Ulster-UK as Reader inembedded systems in the Nanotechnology and Inte-grated Bio-Engineering Centre (NIBEC); Associate

Professor in Embedded Systems in the department of electrical engineering atQatar University-Qatar, Senior lecturer at Brunel University-UK within the di-vision of Electronic and Computer Engineering and a lectureship in ComputerEngineering at Queens University. He has been awarded a number of grantsfrom government and industry and has published around 200 publications inthe area of reconfigurable computing and image and signal processing during hiscareer to date. Twelve Ph.D. students have successfully completed their Ph.D.under his supervision. He took consultancy positions with many companies inUK, and he holds two visiting professor positions at the University of Nancy,Henri Poincare, France and University of Tunn Hussein Onn, Malaysia. His re-search interests include: embedded systems, high performance reconfigurablecomputing, image and video processing, multi-resolution analysis, biometricsand connected health applications.Prof. Amira has been invited to give talks, short courses and tutorials at uni-

versities and international conferences and being chair, program committee fora number of conferences. He was one of the tutorials presenters at ICIP 2009,Conference Chair of ECVW 2011, Program Chair of ECVW2010, ProgramCo-Chair of ICM12, DELTA 2008, IMVIP 2005. He is also one of the 2008VARIAN prize recipients. He has been an external examiner for many Universi-ties in UK, Hong Kong, Australia and Malaysia. He was one of the guest editorsfor the Special Issue in the Pattern Recognition Journal, titled Feature Gener-ation and Machine Learning for Robust Multimodal Biometrics, March 2008.He is a Fellow of IET, Fellow of the Higher Education Academy, and Seniormember of ACM.