Efficient architectures for 3D HWT using dynamic partial reconfiguration

Journal of Systems Architecture 56 (2010) 305–316

Contents lists available at ScienceDirect

Journal of Systems Architecture

journal homepage: www.elsevier .com/ locate /sysarc

Efficient architectures for 3D HWT using dynamic partial reconfiguration

A. Ahmad a,b,*, B. Krill a, A. Amira c, H. Rabah d

a Department of Electronic and Computer Engineering, School of Engineering and Design, Brunel University, West London, UB83PH Uxbridge, United Kingdomb Department of Computer Engineering, Faculty of Electrical and Electronic Engineering, Universiti Tun Hussein Onn Malaysia (UTHM), 86400 Batu Pahat, Johor, Malaysiac Nanotechnology and Integrated BioEngineering Centre (NIBEC), University of Ulster, Shore Road Newtownabbey, BT37 0QB Belfast, Northern Irelandd Laboratoire d’Instrumentation, Electronique de Nancy, University Henri Poincare, 540003 Nancy, France

a r t i c l e i n f o

Article history:Received 28 September 2009Received in revised form 4 February 2010Accepted 9 February 2010Available online 20 February 2010

Keywords:Dynamic partial reconfiguration (DPR)Medical image processingHaar wavelet transform (HWT)

1383-7621/$ - see front matter � 2010 Elsevier B.V. Adoi:10.1016/j.sysarc.2010.02.001

* Corresponding author. Address: Department oEngineering, School of Engineering and Design, BrunUB83PH Uxbridge, United Kingdom.

E-mail address: [email protected] (A. Ah

a b s t r a c t

This paper presents the design and implementation of three dimensional (3D) Haar wavelet transform(HWT) with transpose based computation and dynamic partial reconfiguration (DPR) mechanism on fieldprogrammable gate array (FPGA). Due to the separability property of the multi-dimensional HWT, theproposed architecture has been implemented using a cascade of three N-point one dimensional (1D)HWT and two transpose memories for a 3D volume of N � N � N suitable for real-time 3D medical imag-ing applications. These applications require continuous hardware servicing, hence DPR has been intro-duced. Two architectures were synthesised using VHDL and implemented on Xilinx Virtex-5 FPGAs.Experimental results and comparisons between different configurations using partial and non-partialreconfiguration processes and a detailed performance analysis of the area, power consumption and max-imum frequency are analysed in this paper.

� 2010 Elsevier B.V. All rights reserved.

1. Introduction transform resolution level) and hence are extremely computation-

Real-time three dimensional (3D) medical image processing is aniche area concerned with the processing of real-time sequences ofmedical image data which represents a developing trend in medi-cal industries. Due to significant advances in medical technology,there are various 3D medical imaging modalities such as magneticresonance imaging (MRI), computed tomography (CT) and positronemission tomography (PET) that have been widely used especiallyfor cancer screening and diagnosis.

With a more realistic manner of anatomic structure representa-tion including the information about position, size, shape, and easeof visualisation, 3D images provide more accurate parameters forthe assessment of pathological changes [1], and thus leading togreat advantages for image based screening, diagnosis andtreatments.

Despite their many advantages, most 3D medical imaging algo-rithms are computationally intensive with matrix transformationas the most fundamental operation involved in transform basedmethods. Computational complexity for various algorithms thathave been employed in 3D real-time medical imaging [2] revealthat they are in the order from OðN � log NÞ for fast Fourier trans-form (FFT) to OðN2 � JÞ for the recently developed curvelet trans-form (where N is the transform size and J is the maximum

ll rights reserved.

f Electronic and Computerel University, West London,

mad).

ally intensive for large medical volumes data [3].Therefore, there is a need for high performance systems while

keeping architectures flexible to allow for quick upgradeability tocope with real-time applications. Moreover, an efficient implemen-tation for these operations are of significant importance and hencelead to efficient solutions for large medical volumes data in real-time medical imaging applications.

As diverse as the applications may seem, there is no doubt thatthe discrete wavelet transform (DWT) plays a significant role inimage processing applications as an alternative to classical timefrequency representation techniques such as the discrete Fouriertransform (DFT) and discrete cosine transform (DCT) [4].

Owing to its multi-resolution characteristics and its capabilityto represent real life non-stationary signals such as image andspeech, DWT has attracted a great deal of research and develop-ment. Haar wavelet transform (HWT) is the simplest of all waveletsand it has been applied in image multi-resolution due to the fol-lowing advantages: conceptually simple, fast, memory-efficient,and it is exactly reversible without the edge effects which are aproblem with other wavelet transforms [5,6].

Reconfigurable hardware, especially field programmable gatearrays (FPGAs) are widely used in real-time processing from simplelow-resolution and low bandwidth applications to very high-reso-lution and high-bandwidth [7].

The advantages offered by FPGAs, such as massive parallelismcapabilities, multimillion gate counts, and special low power pack-ages contributes to minimise the amount of memory used, compu-tational complexity and power consumption [8] and seems to be

http://dx.doi.org/10.1016/j.sysarc.2010.02.001

mailto:[email protected]

http://www.sciencedirect.com/science/journal/13837621

http://www.elsevier.com/locate/sysarc

306 A. Ahmad et al. / Journal of Systems Architecture 56 (2010) 305–316

an excellent hardware candidate for an efficient implementation ofreal-time 3D medical image processing compared to applicationspecific integrated circuit (ASIC) and even digital signal processor(DSP) platforms [9].

The application of 3D real-time image processing uses severalbuilding blocks for its computationally intensive algorithms to per-form matrix transformation operations. Moreover, complexity inaddressing and accessing large medical volumes data and massiveamount of data to be processed have resulted in vast challengesfrom a hardware implementation point of view.

In order to cope with these issues, FPGAs with efficient reconfig-urability mechanisms should be employed to meet the require-ments of these applications in terms of maximum speed, size,power consumption and throughput.

Dynamic partial reconfiguration (DPR) is a promising mecha-nism for reducing the hardware required for implementing an effi-cient design for 3D medical image processing application as well asimproving the performance, speed and power consumption of thesystem.

With this technique, the design can be divided into sub-designsthat fit into the available hardware resources and can be uploadedinto the reconfigurable hardware when needed. As the first config-uration finishes its operations, the next configuration can be up-loaded to the hardware [10]. In short, by timesharing, largedesigns can be implemented on limited hardware resources.

From designer and FPGA technology perspective, partial recon-figuration enables designers to realise the implementation of mul-tiple modules in complex system on one FPGA device where astatic area must be defined for components communication, whilefull reconfiguration implementation treat the entire FPGA as onemodule [10].

Xilinx provides many FPGA devices that support active partialreconfiguration, ranging from the Spartan-3 device to the latestVirtex-5 device. The active or dynamic reconfiguration allows themodification of a part of the FPGA while the remaining part ofthe device is active [11]. Thanks to the internal configurationaccess port (ICAP) the reconfiguration management can be imple-mented in the FPGA so that the final system is self-reconfigurable[12,13].

Basically, a dynamically and partially reconfigurable systembased on commercial FPGAs is composed of a static part and differ-ent dynamic parts. The communication between these parts re-quires the use of bus macros. In Virtex-II family this bus macrosare implemented using tri-state buffers which is not very effective.A new alternative based on slice macro in form of predefined rou-ted macro has been introduced in more recent devices (Virtex-4, 5and 6) [14]. These macros are placed at the edge of boundaries sep-arating dynamic and/or static region modules and allowing thecommunication between different regions of the FPGAs. Furtherexplanation will be discussed in Section 4 for the implementationof 3D HWT with DPR mechanism.

The aim of this paper is to develop efficient architectures for 3DHWT with transpose based computation using DPR technique. Theproposed architectures will be deployed in a reconfigurable envi-ronment for adaptive medical image compression. An in depthevaluation of these architectures in terms of area, power consump-tion and maximum frequency is also carried out. Influence of thetransform size on the hardware performance for both proposedarchitectures is addressed.

The rest of the paper is organised as follows. An overview of therelated work of 3D HWT and DPR are given in Section 2. Section 3explains the mathematical background for HWT and matrix trans-position. Section 4 exposes the proposed architecture of 3D HWTwith DPR mechanism. Experimental results, comparison and anal-ysis are presented in Section 5. Finally, concluding remarks andfurther potential ideas to be explored are given in Section 6.

2. Related work

In this section, existing 3D DWT architectures that have beenimplemented on FPGAs and a broad idea of DPR implementationtechnique in various fields are reviewed.

2.1. 3D DWT

Despite its complexity, there has recently been an interest in 3DDWT implementation on various platforms. However, a survey ofexisting literature indicates that the research is still in its infancyas demonstrated by the limited contributions proposed and itcan be classified into three categories: architecture development[15–17], architecture with FPGA implementation [18–22] and fi-nally architecture that has been implemented on other silicon plat-forms [23]. Since the aim and contribution of this paper is on theimplementation of partial and non-partial reconfiguration onFPGA, the reviews and discussion will focus on previously pub-lished FPGA implementation only.

With regard to FPGA implementation, in Ref. [18] they proposeda memory-efficient real-time architecture that has been imple-mented with optimised memory requirement to the orderOðKN2 þ ðK � 2Þ � NÞ. Moreover, the proposed architecture also islow power and can run at higher frequency which fully utilisesthe advantages of multiplierless, parallelism and pipelined struc-tures for filter design. The proposed architecture has been imple-mented on Xilinx FPGA devices and operating at an operatingfrequency of 75 MHz. However, it is noted that the drawback ofthis architecture is the reduction in compression of video se-quences and modifications are required for better optimisation inperformance.

Another significant implementation in [19,20], focuses on area-efficient high-throughput architecture. The architecture is basedon distributed arithmetic (DA) and suitable for real-time medicalimaging applications. It was implemented in VHDL and synthes-ised on Xilinx Virtex-E FPGAs. With the advantages offered bythe DA design technique, the architecture has been proven asarea-efficient high-throughput with the processor running up to85 MHz and capable to process five levels DWT analysis of a128� 128� 128 functional magnetic resonance imaging (fMRI)volume image in 20 ms.

Ismail et al. presented another hardware design and FPGAimplementation of a new efficient 3D DWT for video compres-sion application [21]. The design has been implemented on adevice from Altera. It exhibits low memory demands and lowerlatencies for the compression and decompression processes.The advantages of their design is that it is generic to any wave-let filter coefficients and scalable to fit for any frame size of thevideo sequence.

The recent FPGA implementation of 3D wavelet has beenhighlighted for video segmentation in traffic monitoring bySalem et al. Empirically, the design has a slice utilisation of63% and the maximum frequency allows a 100 MHz systemclock [22].

Table 1 summarises the previously published architectures forFPGA implementation of 3D DWT [18–22]. Due to the fact thatthe implementation of 3D DWT on FPGA is still immature, in orderto evaluate and conduct a comparative study, the following justifi-cations have been considered:

1. Proposed architectures in Refs. [18–22] were targeted for differ-ent FPGA platforms, applications and design techniques. With adifferent FPGA technology available on each devices such asprocess technology and circuit topology, it draw inconsistentcomparison and performance evaluation.

Table 1Comparative study of the 3D DWT architectures with FPGA implementation.

Parameters Architectures with FPGA implementation

[18] [19,20] [21] [22]

FPGA XCV50TQ144 V300EFG256 EPF10K200SFC672-1 XC2VP30Filter Daubechies Daubechies Daubechies HaarArith. Tech. CSD DA NA NAArea (Slices) 738 out of 768 1271 LCs 9984 = 7.94% 8663 out of 13696Max. Freq. (MHz) 75 85 NA 101.245

Note: Arith. Tech., arithmetic techniques.Max. Freq., maximum frequency.

A. Ahmad et al. / Journal of Systems Architecture 56 (2010) 305–316 307

2. With different design techniques and targeted for specificdesign tradeoffs, obviously each proposed architecture claimsfor novelty achievement. It is contradict with the aim of thispaper which is to demonstrate the advantages offered by DPRfor complex system such as medical image compression.

2.2. DPR

During the last decade, DPR has been widely studied in variousfields: the development of Erlangen slot machine (ESM) in FPGA-based reconfigurable computer [24], in automotive applicationsfor video based driver assistance applications [25] and a wave-form-like reconfiguration for DPR in Ref. [26].

A significant contribution presented in Ref. [10] with novelFPGA-based scalable architecture for DCT using DPR has been themotivation for this paper. The proposed scalable architecture isbased on DA techniques and it works in two different modes to re-duce computational complexity and power consumption. It hasbeen reported that with DPR mechanism, the processing elementsof the DCT architecture can be changed easily and the scalablearchitecture can be adjusted to perform different types of DCTzonal coding.

In addition, the scalable architecture can also reconfigure theunused DCT computation with the motion estimation module tofully utilise DPR advantages. The proposed architecture providessignificant results for partial reconfiguration process with bettersaving of power consumption by 3.03%, reduce the processingclock cycles and the reconfiguration overhead by 39.5% and 50%respectively.

These achievements motivate a strong justification to furtherexplore the 3D HWT implementation with both partial and non-partial reconfiguration processes and to evaluate their perfor-mance in terms of area, power consumption and maximum speed.

In addition to the previously reviewed work on DPR, there existtwo other significant references discussing the advantages andchallenges in DPR [27,28]. Shoa and Shirani thoroughly explain dif-ferent issues in run-time reconfiguration (RTR) systems and list theimplemented systems which support RTR reconfiguration as wellas discussing different applications of and the improvementsachieved by applying RTR.

An evaluation of DPR for signal and image processing is anotherinteresting contribution that has been presented in Ref. [28]. Theauthors present the advantages and limitation of DPR in profes-sional electronics applications and guidelines to improve its appli-cability. Furthermore, the missing elements of the design flow touse in DPR has been identified and explained.

3. Mathematical background

3.1. 3D Haar wavelet transform (HWT) and matrix transposition

The wavelet transform is a mathematical tool with a great vari-ety of possible applications. As the simplest wavelet transform, the

Haar wavelet whaar , is discontinuous, symmetric wavelet in theDaubechies family, and the only that has an explicit expression.The scale function /haar , is a simple average function. The waveletand the scale functions can be expressed as follows:

whaarðtÞ ¼1 if 0 6 t < 1

2

�1 if 12 6 t < 1

0 otherwise

8><>: ð1Þ

/haarðtÞ ¼1 8 0 6 t < 10 otherwise

�ð2Þ

In addition, the HWT wavelet is simple and computationallycheap because it can be implemented by a few integer additions,subtractions, and shift operations [29]. This wavelet was selectedbecause of its simplistic nature, and mathematical features. Themathematical features of the basis are as follows: the most sim-plistic wavelet basis, can be implemented using pairwise averagingand differencing, both unitary and orthogonal, and also it has com-pact support. Therefore, this wavelet basis is by no means the mostsuitable to achieve close to optimal compression performance in3D medical image compression application.

The one dimensional (1D) HWT can be extended naturallyinto higher dimensions simply by taking tensor products of 1Dfilters [30]. This approach is fundamentally based on the parti-tioning concept of the image into small blocks containing eightvoxels. Consider a block with dimensions of 2� 2� 2, and arelabeled as ci;j;k, 0 6 i; j; k 6 1. In this case, 0 represents low-passfiltering and 1 for high-pass filtering. The Haar transform in 3Dis expressed as in Fig. 1 and the decomposition of this smallblock is expressed with cLLL represents the overall average orsuccessive low-pass filtering coefficient, and the remaining sevenvalues are detail, or wavelet, coefficients, determined by the or-der of filtering.

As an example, the coefficient of cLHL is obtained by applying thelow-pass filter L, high-pass filter H and then the low-pass filter L,along the three principle axes, respectively [29].

In a more realistic example, taking an image with dimension of8� 8� 8, the image volume is partitioned into a series of sub-blocks as defined previously. After the decomposition of each ofthese sub-blocks the 64 overall averaging coefficients are com-bined to form eight new 2� 2� 2 sub-blocks and the next levelin the decomposition hierarchy.

The complete decomposition end when only one overall averag-ing coefficient and five hundred and eleven detail coefficientsremain. For an n� n� n image volume, log2n iterations arerequired to perform a complete decomposition and an illustrationof the method is presented in Fig. 2.

Matrix transposition can be viewed as changing the storage ofmatrix elements from the row-major order to the column-majororder, or vice versa. Mathematically it can be expressed by Eq.(3) for all i; j ¼ 0 . . . N � 1, where aij is the coefficient of the matrix.An example is shown in Fig. 3.

aij ¼ aji ð3Þ

Fig. 1. 3D HWT expression.

(a) (b) (c)

(d)

(e)

Fig. 2. Decomposition based on tensor product of one-dimensional filters. (a) Original image volume. (b) Image volume partitioned into 2 � 2 � 2 sub-blocks. (c) One overalllow-pass coefficient is obtained from each sub-block after the first decomposition stage. (d) All sub-block averaging coefficients are clustered to form new sub-blocks, whichare then decomposed further to obtain one overall low-pass coefficient. (e) Image after two stage decomposition on a 4 � 4 � 4 image volume.

Transposition

11 12 1

21 22 2

1

N

N

NNN

a a a

a a a

a a

11 21 1

12 22 2

1

N

N

NNN

a a a

a a a

a a

Fig. 3. Transposition of a matrix.


3.2. Pipelined direct mapping implementation

For illustration purposes, the 1D HWT flow diagram with N-in-puts sample for pipelined direct mapping implementation isshown in Fig. 4 with ‘Avg.’ and ‘Diff.’ notation refer for averageand differencing respectively.

The registers, adders and difference blocks are triggered at therising edge of the clock. At the first clock edge, the data are loadedand on the second edge they will be retrieved from the registers tothe adder and the difference blocks.

At the third clock edge, the adder and the difference blocks per-form their respective operations as described in Eq. (4) and (5),where i ¼ 0 . . . N

2 � 1� �

.

Hi ¼a2�i þ a2�iþ1

2

� �ð4Þ

H N2þið Þ ¼ a2�i � a2�iþ1ð Þ ð5Þ

4. Architecture

4.1. Proposed system overview applications

Fig. 5a illustrates an application overview of the proposed med-ical image compression system including the transform, quantiza-tion and entropy coding blocks. In each block, buffers have beenused for storing intermediate results to be processed. Since theapplication targeted for 3D medical images, the transform blockin this system including the 3D HWT. Our goal is to propose anadaptive compression system for medical images, where all itsblocks are reconfigurable and in this paper, the focus only concernson the transform block.

The DPR is used for two reasons. The first, is to have the possi-bility to exchange the type of algorithm (Haar, Daubechies,

Avg. Avg. Avg. Avg. Diff. Diff. Diff. Diff.

Outputs vector

o(0) o(1) o(2)

Avg. Avg. Diff. Diff.

Avg. Diff.

i(N-1)...N inputs sample

i(0) i(1) i(2)

o(N-1)...

Registers

Registers

Registers

Registers

Fig. 4. 1D HWT flow diagram with N inputs sample for direct mapped architecture.


biorthogonal 9/7 . . .etc.) thus making the architecture flexible. Thesecond is to find the best tradeoffs between area, power consump-tion and hardware performance.

Fig. 5. Proposed system architectures (a) Compression system overview. (b) Architecturewith z 2 ½0;1 . . . 7� (d) Transpose matrix after T1 for sub-images/slices with z 2 ½0;1 . . . 7�

4.2. 3D HWT with transpose based computation

4.2.1. System architectureThe proposed system for 3D HWT with transpose based compu-

tation is illustrated in Fig. 5b. The whole chain to calculate the 3DHWT gets an input as a 3D image with N � N � N point and outputsthe coefficients of the N � N � N point. To simplify the hardwaredesign, the 3D DWT is split into three 1D HWT calculation cas-caded together with transpose modules in between. This isachieved by performing the first 1D HWT along the rows (columns)of the array followed by 1D HWT along the columns (rows) of thetransformed array. The third 1D HWT is performed on correspond-ing pixels in each of the N sub-images that constitute the thirddimension. All transpositions modules store the transposed coeffi-cients into memory with a fetch unit module that reads back thecoefficients for the next 1D HWT calculation.

4.2.2. 3D HWT and transposeThe proposed 3D HWT implementation works as follows. The

input to the first 1D HWT is read row by row, the 1D HWT is per-formed on each input vector as they are provided and the calcu-lated values are sent to the transpose module T1 whichcalculated the memory addresses for the transposition and storesthe data into memory. The transpose T1 acts as a memory for-warder and performs matrix transpose, since row vectors are pro-vided by the 1D HWT. After transposition of the resultant matrix,another 1D HWT is performed on the coefficients which are storedin memory to yield the two dimensional (2D) HWT coefficients.

This is the conventional row-column 2D HWT computation. The2D HWT computation is performed on each sub-image ½I�z forz 2 ½0;1 . . . 7�, where ½I�0 is the first sub-image and ½I�7 is the eighthsub-image of the input volume as shown in Fig. 5c with notation ofIzxy.

The output coefficients of the 2D HWT are send to the secondtranspose, T2. As described before, all coefficients are stored intomemory and the transpositions of T2 are stored after transformation

for 3D HWT with transpose based computation. (c) Input data for sub-images for ½I�z

. (e) Transpose matrix after T2 for sub-images/slices with x 2 ½0;1 . . . 7�.


into memory. As shown in Fig. 5d and e, Tz1;xy and Tz

2;xy represents thecoefficients after performing transposition T1 and T2 respectively. Asan example, the HWT computations for the first transpose areTz

1;00 ¼Iz00þIz

012 for averaging and Tz

1;04 ¼ Iz00 � Iz

01 for differencing, andthese two operations are based on Eq. (4) and (5). The Algorithm 1gives the complete description of the 3D HWT process.

Algorithm 1. The 3D HWT Pseudo-code

1: for slice ¼ 1 to noslices do2: for row ¼ 1 to norows do3: Apply a 1D HWT column-wise

4: end for5: end for6: for slice ¼ 1 to noslices do7: for col ¼ 1 to nocols do8: Apply a 1D HWT row-wise

9: end for10: end for11: for col ¼ 1 to nocols do12: for row ¼ 1 to norows do13: Apply a 1D HWT slice-wise

14: end for15: end for

The Haar function is able to calculate a given vector with the cy-cle rate of N

2 þ 4 cycles (8 cycles for N = 8) as shown in Eq. (6). Eachtransposition clock cycle (Eq. (7)) with memory writes is able tooperate with N cycles (8 cycles for N = 8). A sub-image is calculatedin one clock cycle as shown in Eq. (8) and the required 3D imagecalculation clock cycle rate is given by Eq. (9).

Fch¼ N

2þ 4 ð6Þ

Fct ¼ N ð7ÞFcsubimg

¼ N � ½6 � ðFchþ Fct Þ� ð8Þ

Fcimg¼ N � Fcsubimg

ð9Þ

4.3. 3D HWT with DPR

4.3.1. Proposed architectureFig. 6 illustrates the proposed idea of reconfigurable and adap-

tive system architectures using DPR. It compose of several recon-figurable processing modules (RPM), a reconfigurable interface,an off chip memory and micro blaze (lblaze) and the system con-nected to the host personal computer (PC) via peripheral compo-nent interconnect (PCI) express. The reconfigurable processingmodules allow hardware acceleration and can be reconfiguredbased on the system demand while the communication interfaceis used to build the interconnection between RPM and the othercomponents.

RPM1 RPM2

RPM3RPM4 RPM5

Interfaces

SDRAM Interfaces

Bus

SDRAM

PCIexp

ICAP

µBlaze

Off Chip

Fig. 6. Proposed reconfigurable and adaptive system architectures.

Both proposed architecture implementation on the FPGA are gi-ven in Fig. 7a and b. The DPR framework consists of two reconfig-urable areas and a static area. One DPR area is used for the 1D HWTand the second is used for the different transposition modules. The1D HWT DPR area is directly connected to the transpose DPR area.The static area consists of the data fetch unit and the memory con-troller (Wishbone compliant). On the other hand, the implementa-tion of 3D HWT without DPR defined the entire FPGA devices asone module as oppose to the implementation with DPRmechanism.

4.3.2. Reconfigurable and static areaThe proposed system is implemented with the current partial

reconfiguration suite from Xilinx (ISE 9.2PR and PlanAhead 10.1)[14]. It uses the module based DPR where configuration framesare reconfigured and bus macros are used to connect the DPR areaswith the static area [31]. The partial reconfiguration design flow isillustrated in Fig. 8a and b for all the steps required and modulesdefinition illustration.

This methodology has the restriction that all design files andreconfigurable modules must be available to the build environ-ment to build partial modules. The main advantage of DPR is thatan implementation of a given design can be integrated into a smal-ler FPGA. This reduces cost, package size and power. Also powerconsumption and logic size can be reduced by cascading opera-tion/calculation modules.

In the 3D HWT case, the transposition module and the 1DHWT module can be changed. The transposition module will bechanged during image calculation three times for each sub-image.First, transposition T1 performs the row to column transpositionwhich are active till a sub-image is transposed. After the T1

sub-image transposition, the DPR area is reconfigured with theT2 transposition which saves the sub-images. This described oper-ation will be repeated for all sub-images. After all sub-images arecomputed and transposed with T2, the transposition DPR is recon-figured with the straight transposition and the last 1D HWT isperformed on all T2 sub-images. The HWT DPR area can be recon-figured to switch between different pixel sizes. The pixel size Ndependency is propagated from the HWT module to all connectedmodules, this gives the advantage that no other logic changes arenecessary.

The DPR module connections are performed with simple businterfaces. Data fetch unit and HWT DPR area are connected witha defined data bit width bus, a request line and back signal free.The fetch unit sends data and the request to the HWT core as longthe free signal is active. HWT and transposition module are con-nected with the defined data bit width bus and an enable signal.Each cycle where the enable is active data will be transposed andwritten into the memory.

5. Experimental results

5.1. FPGA implementation

Xilinx early access partial reconfiguration (EAPR) design flow[31] has been used as a design flow reference and the proposedtwo architectures have been implemented on the Xilinx Virtex-5(XC5VLX110T-3FF1136).

Table 2 lists the overall performance results for both proposedarchitectures for N = 8 and 128. The implementation of 3D HWTwith DPR mechanism provides significant results with better sav-ing of area and reduce the power consumption by 47.07% and9.77% respectively. In terms of maximum frequency, DPRmechanism yielding 51.10% better maximum frequency thanwithout DPR.

FPGA

Read data from memory

Send data to be processed in 1D HWT

Send data for transposition operation

Write the resultsto different

memory locations

3

1

4

1D HWT

FPGA - Static area

T2T1 1D HWT 1D HWT

2

Data fetchunit Block RAMs

Read data frommemory

Send data to beprocessed in

1D HWT

Send data fortransposition

operation

Write the results todifferent memory

locations

3

1

2 4

Transpose

Reconfigurablearea

Reconfigurablearea

1D HWT

Data fetchunit Block RAMs

Static area

(a)

(b)

VIRTEX-5

XC5VLX110T TM

Fig. 7. Proposed top architecture of 3D HWT. (a) Without DPR. (b) With DPR.


Table 3 lists a comparison between non-partial and partialreconfigurations scenarios. In the case of non-partial reconfigura-tion, a full bitstream need to be generated and stored for each 3DHWT configuration. A full bitstream of 3,889,941 bytes is requiredfor 3D HWT configuration and the shortest configuration timeneeded is also the worst at 4.8 ms.

On the contrary, a full partial bitstreams generated are signifi-cantly smaller and hence reducing the storage space required tostore the various bitstreams. The results show that the file size oftransform size (N = 64) for full partial bitstreams is reduced about86.95% of a full bitstream and the configuration time also reducedby 86.88%. In summary, by comparing the file sizes of the bit-streams, it is obvious that partial reconfiguration has more efficientbitstream and as proven, smaller bitstream decreases the configu-ration time.

5.2. Discussions

5.2.1. Result analysisFor the FPGA implementation purposes, there are different

transform sizes (N = 8, 16, 32, 64 and 128) have been used to eval-uate the relationship of the transform sizes on the area (slices),power consumption (mW) and maximum speed (MHz). The ideasof various transform sizes used also act as a good mechanism basedon the large medical volumes data involves in real-time 3D medi-cal image processing applications.

Influence of transform size on area, power consumption andmaximum frequency is depicted in Figs. 9–11, respectively. Results

indicate that the proposed 3D HWT with transpose based compu-tation requires more area to be implemented while by using DPRmechanism, the area saving can be achieved between 40.82% and47.18% and the relationship of the area is increasingly proportionalto the transform sizes. The power consumptions can be estimatedby the large areas required by non-partial reconfiguration con-sumes up to 1872.83 mW for N = 128 and by performing partialreconfiguration it saves power consumption by 0.74–9.77%.

To evaluate the performance of the proposed architectures interms of maximum frequency, Fig. 11 depicted the idea of trans-form sizes relationship with its maximum achievable operatingfrequencies and resulting that with DPR, better maximum fre-quency can be achieved. Since the static build consists of all trans-positions and the 1D HWT function, the design requires more spaceon the whole FPGA. Moreover, the signals for the different transpo-sitions have to be multiplexed and increases the routing and signalnumber. This will end in a slower frequency as the DPR design. Fur-thermore, the partial design can only performed with the slowestclock speed of the slowest module. In the proposed system themaximum frequency is 356.9 MHz since the transposition T2 hasthe slowest frequency as shown in Fig. 12.

Moreover, in order to visualise the impact of non-partial andpartial reconfiguration for the proposed architecture, chip layoutson different FPGA devices of Virtex-5 are shown in Fig. 13. WithDPR mechanism, the area for static and reconfigurable area canbe specified and can be clearly seen in the layouts generated.

Comparative study for both non-partial and partial reconfigura-tion processes shows a significant conclusion concerning the

Partition the system into modules

Define static modules andreconfigurable modules

Decide the number of partialreconfiguration regions

Decide partial reconfiguration regionsizes, shapes and locations

Map modules to partialreconfiguration regions

Define partial reconfiguration regioninterfaces, instantiate slice macrosfor partial reconfiguration regions

interfaces

Transpose

1D HWT

(a)

Data fetchunit

Block RAMs Con

trolle

r

Static modules

Reconfigurablemodules

1

2

3

4

5

6

(b)

Des

ign

parti

tioni

n gD

esig

nflo

orpl

anni

n ga n

dbu

dge t

i ng

Fig. 8. Partial reconfiguration design flow. (a) Steps for partial design flow. (b) Define static and reconfigurable modules.

Table 2Resources utilisation and overall proposed architectures performance on XC5VLX110T-3FF113.

Parameters Proposed 3D HWT

Without DPR With DPR

N = 8 N = 128 N = 8 N = 28

Area (Slices) 3180 (4.60%) 39261 (56.80%) 1882 (2.72%) 20,779 (30.06%)Power consumption (mW) 1102.64 1872.83 1094.48 1689.84Maximum frequency (MHz) 206 170.13 371.15 347.92

Table 3Comparison of bitstream generated and configuration times towards transform sizes.

N (transform size) Without DPR With DPR

BS (bytes) CT (ms) BS (bytes) CT (ms)

8 3,889,941 4.8 191,024 0.2316 3,889,941 4.8 287,573 0.3632 3,889,941 4.8 313,101 0.3964 3,889,941 4.8 507,677 0.63

Note: BS, bitstream.CT, configuration time.


advantages offered by DPR especially in processing large medicalvolumes. Analysis for the performance achieved for differentparameters such as area utilised, power consumes and maximumfrequency achieved clearly reveals that with DPR, complex designscan be implemented on limited hardware resources and hence leadto better performance achievements.

5.2.2. Advantages of partial reconfigurationNon-partial reconfiguration can be defined as an arrangement

process of all utilised resources on the FPGA and it offers full-de-vice reconfiguration. With this approach, any design errors canbe resolved by refining the bitstream of the design. An FPGA withfull-device configuration allows the chip to be configured by spe-

cific design at one time and at another time the chip is configuredwith another design.

From hardware resources point of view, reconfiguration mech-anism offers utilisation of the total number of resources on theFPGA by loading each design separately, while without reconfigu-ration both designs are loaded together. Therefore, the total of re-sources usage cannot exceed the available number of resourcesavailable on the FPGA.

Indeed non-partial reconfiguration suffers for limited hardwareresources issue, partial reconfiguration provides all the advantagesthat have been offered by full-device reconfiguration with twoadditional advantages concerning its execution process and bit-streams size generated.

Since the unchanged portion of the FPGA is not affected and itssmaller bitstream file size generated than a full bitstream, applica-tion of 3D real-time medical imaging with requirement of its mod-ules to continuously operates can be implemented on the samechip as modules. With the size of the bitstream is directly propor-tional to the number of resources being configured, partial recon-figuration lead to direct benefit of less space needed foroperation configurations.

In terms of configuration time, there are four major phases inthe reconfiguration process: clearing configuration memory, ini-tialisation, bitstream loading and device startup. As complex asthe phases in configuration processes may seem, a smaller bit-

1000

1200

1400

1600

1800

2000

20 40 60 80 100 120

Pow

er c

onsu

mpt

ion

(mW

)

Transform Size (N)

Without DPRWith DPR

Fig. 10. Influence of transform size on power consumption (mW).

150

200

250

300

350

400

20 40 60 80 100 120

Max

imum

fre

quen

cy (

MH

z)

Transform Size (N)

Without DPRWith DPR

Fig. 11. Influence of transform size on maximum frequency (MHz) for 1D HWT modules.

1000

10000

20 40 60 80 100 120

Are

a (S

lices

)

Transform Size (N)

Without DPRWith DPR

Fig. 9. Influence of transform size on area (slices).


0

50

100

150

200

250

300

350

400

450

img2subimg (T2) row2col (T1) straight

Max

imum

fre

quen

cy (

MH

z)

Transpose function

Without DPRWith DPR

Fig. 12. Comparison on maximum frequency (MHz) achievement for transposefunction.

Virtex-5 (XC5VLX110T-3FF113)

(a) Without DPR (b) With DPR

(a) Without DPR (b) With DPR

Virtex-5(XC5VLX50T-3FF1136)

RPM 2(Transpose)

RPM 1(1D HWT)

Static

RPM 1(1D HWT)

Static

RPM 2(Transpose)

Fig. 13. Comparison of chip layout for different Virtex-5 devices for N = 64.


stream sizes resulting a shorter configuration time and this savingtime particularly useful for computationally intensive applicationssuch as 3D real-time medical imaging.

In the proposed architecture of 3D HWT with transpose basedcomputation, the implementation of 1D HWT and transpose treatthe entire Virtex-5 as one module as oppose to the second architec-ture with DPR mechanism. It has been proven with the implemen-tation of the proposed architecture on smaller Xilinx FPGA devices(XC5VLX30T-3FF323) for N = 128 as shown in Table 4 that sufferswith overmapped problem. Obviously, by implementing DPR com-plex designs can be implemented on limited hardware resourcesand hence leading to better performance.

5.2.3. 3D HWT implementation for medical imaging applicationThe proposed architecture of 3D HWT that has been imple-

mented with transpose based computation using DPR mechanismis suitable for medical image compression system. Fig. 14a–f illus-trate the best quality and compression comparison for the firstmedical volumes slices of original and the reconstructed slicesfor CT and MRI using 3D HWT in a medical image compression sys-tem with context-adaptive variable-length coding (CAVLC).

6. Conclusion and future works

Two architectures for 3D HWT have been proposed in this paperbased on transpose computation and partial reconfiguration. Com-parative study for both non-partial and partial reconfiguration pro-cesses has shown that DPR offers many advantages and lead to a

Table 4Device summary report of the proposed architecture on XC5VLX30T-3FF323.

Parameters Proposed 3D HWT

Without DPR With DPR

Number of slicesregisters

27,886(145%)(Overmapped)

Haar static = 178 (1%)

Haar partial = 14,469 (75%)Trans img2subimg = 267 (1%)Trans row2col = 235 (1%))

Note: Trans row2col, T1.Trans img2subimg, T2.

Fig. 14. Comparison of original and reconstructed CT and MRI images for the firstslices.


promising solution for implementing computationally intensiveapplications such as 3D medical image compression. Using DPR,several large systems are mapped to small hardware resourcesand the area, power and maximum frequency are optimised andimproved. On-going research is focusing on the design and FPGAimplementation of 3D HWT using other arithmetic techniquessuch as DA and systolic design.

Acknowledgment

The authors would like to thank the British Council, the FrenchProgram Alliance Egide and the Ministry of Higher EducationMalaysia for funding this research work.

References

[1] A. Mayer, H.-P. Meinzer, High performance medical image processing in client/server-environments, Comput. Meth. Prog. Biomed. 58 (1999) 207–217.

[2] D.-Z. Tian, M.-H. Ha, Applications of wavelet transform in medical imageprocessing, in: Proceedings of International Conference on Machine Learningand Cybernetics, Baoding, China, 2004, pp. 1816–1821.

[3] I.S. Uzun, Design and FPGA Implementation of Matrix Transforms for Imageand Video Processing, Ph.D. dissertation, School of Computer Science, TheQueen’s University of Belfast, 2006.

[4] S.G. Mallat, A theory for multiresolution signal decomposition: the waveletrepresentation, IEEE Trans. Pattern Anal. Machine Intell. 11 (1989) 674–693.

[5] C. Capilla, Application of the Haar wavelet transform to detect microseismicsignal arrivals, J. Appl. Geophys. 59 (2006) 36–46.

[6] A. Khashman, K. Dimililer, Image compression using neural networks and haarwavelet, WSEAS Trans. Signal Process. 4 (2008) 330–339.

[7] A. Ahmad, K.K. Loo, J. Cosmas, VLSI architecture design approaches for real-time video processing, WSEAS Trans. Circuit Syst. 7 (2008) 855–868.

[8] S. Chandrasekaran, A. Amira, S. Minghua, A. Bermak, An efficient VLSIarchitecture and FPGA implementation of the finite ridgelet transform, J.Real-Time Image Process. 3 (2008) 183–193.

[9] P. Dang, VLSI architecture for real-time image and video processing systems, J.Real Time Image Processing. 1 (2006) 57–62.

[10] J. Huang, M. Parris, J. Lee, R.F. Demara, Scalable FPGA-based architecture forDCT computation using dynamic partial reconfiguration, ACM Trans. Embed.Comput. Syst. 9 (1) (2009) 1–18.

[11] C. Kao, Benefits of partial reconfiguration, in: Xcell Journal Fourth Quarter,2005, pp. 66–67.

[12] R. Fong, S. Harper, P. Athanas, A versatile framework for FPGA field updates: Anapplication of partial self-reconfiguration, in: Proceedings of the 14th IEEEInternational Workshop on Rapid Systems Prototyping, 2003, pp. 117–123.

[13] X. Zhang, H. Rabah, S. Weber, Dynamic slowdown and partial reconfigurationto optimize energy in FPGA based auto-adaptive SoPC, in: 4th IEEEInternational Symposium on Electronic Design, Test and Applications, DELTA2008, pp. 153–157.

[14] http://www.xilinx.com.[15] M. Weeks, M. Bayoumi, 3D discrete wavelet transform architectures, in:

Proceedings of the IEEE International Symposium on Circuits and Systems(ISCAS ’98), Monterey, CA, 1998, pp. 57–60.

[16] M. Weeks, M. Bayoumi, 3-D discrete wavelet transform architectures, IEEETrans. Signal Process. 50 (2002) 2050–2063.

[17] M. Weeks, M. Bayoumi, Discrete wavelet transform: architectures, design andperformance issues, J. VLSI Signal Process. 35 (2003) 155–178.

[18] B. Das, S. Banerjee, A memory efficient 3-D DWT architecture, in: Proceedingsof the Conference IEEE International Conference on Acoustics, Speech, andSignal Processing, Florida, USA, 2002, pp. 3224–3227.

[19] M. Jiang, D. Crookes, Area-efficient high-speed 3D DWT processor architecture,Electron. Lett. 43 (2007) 502–503.

[20] M. Jiang, D. Crookes, FPGA implementation of 3D discrete wavelet transformfor real-time medical imaging, in: Proceedings of the 18th European Conf. onCircuit Theory and Design (ECCTD 2007), Seville, Spain, 2007, pp. 519–522.

[21] S. Ismail, A. Salama, M. Abu-ElYazee, FPGA implementation of an efficient 3D-WT temporal decomposition algorithm for video compression, in: Proceedingsof the Conference IEEE International Symposium on Signal Processing andInformation Technology, Cairo, Egypt, 2007, pp. 154–159.

[22] M.-M. Salem, M. Appel, F. Winkler, B. Meffert, FPGA-based smart camera for 3Dwavelet-based image segmentation, in: Proceedings of the Second ACM/IEEEInternational Conference on Distributed Smart Cameras (ICDSC 2008),California, USA, 2008, pp. 1–8.

[23] G. Zhang, M. Talley, W. Badawy, M. Weeks, M. Bayoumi, A low powerprototype for a 3-D discrete wavelet transform processor, in: Proceedings ofthe Conference IEEE International Symposium on Circuits and Systems (ISCAS’99), Florida, USA, 1999, pp. 145–148.

[24] M. Majer, J. Teich, A. Ahmadinia, C. Bobda, The Erlangen slot machine: adynamically reconfigurable FPGA-based computer, J. VLSI Signal Process. 47(2007) 15–31.

[25] C. Claus, J. Zeppenfeld, F. Muller, W. Stechele, Using partial-run-timereconfigurable hardware to accelerate video processing in driver assistancesystem, in: Proceedings of the Conference Design, Automation, Test andExhibition in Europe (DATE ’07), Nice, France, 2007, pp. 1–6.

[26] L. Braun, K. Paulsson, H. Kromer, M. Hubner, J. Becker, Data path drivenwaveform-like reconfiguration, in: Proceedings of the InternationalConference on Field Programmable Logic and Applications (FPL 2008),Heidelberg, Germany, 2008, pp. 607–610.

[27] A. Shoa, S. Shirani, Run-time reconfigurable systems for digital signalprocessing applications: A survey, J. VLSI Signal Process. 39 (2005) 213–235.

[28] P. Manet, D. Maufroid, L. Tosi, G. Gailliard, O. Mulertt, M. Di Ciano, J.-D. Legat,D. Aulagnier, C. Gamrat, R. Liberati, V. La Barba, P. Cuvelier, B. Rousseau, P.Gelineau, An evaluation of dynamic partial reconfiguration for signal andimage processing in professional electronics applications, EURASIP J. Embed.Syst. 2008 (2008) 1–11.

[29] C. Bajaj, I. Ihm, S. Park, 3D RGB image compression for interactive applications,ACM Trans. Graphics 20 (1) (2001) 10–38.

[30] D. Montgomery, F. Murtagh, A. Amira, A wavelet based 3D image compressionsystem, in: Proceedings of the Seventh International Symposium on SignalProcessing and Its Applications, vol. 1, 2003, pp. 65–68.

[31] P. Lysaght, B. Blodget, J. Mason, J. Young, B. Bridgford, Invited paper: enhancedarchitectures, design methodologies and CAD tools for dynamicreconfiguration of Xilinx FPGAs, in: International Conference on FieldProgrammable Logic and Applications, 2006. FPL ’06, 2006, pp. 1–6.

Afandi Ahmad received his B.Sc. Engineering in Elec-trical from Universiti Teknologi Malaysia (UTM –ITTHO) and M.Sc. in Microelectronics from UniversitiKebangsaan Malaysia (UKM) in 2002 and 2003,respectively. He worked as a lecturer at Faculty ofElectrical and Electronic Engineering, Universiti TunHussein Onn Malaysia (UTHM). Currently, he is a Ph.D.candidate in the Electronic and Computer Engineering,School of Engineering and Design, Brunel University,United Kingdom on the efficient reconfigurable archi-tectures for 3D medical image compression. Hisresearch interest includes 3D transform, medical imag-ing and reconfigurable computing.

Benjamin Krill obtained his diploma in ComputerEngineering from Furtwangen University, Germany andM.Sc. in Distributed System Engineering from BrunelUniversity, West London. Currently, he is a Ph.D. can-didate in the Electronic and Computer Engineering,School of Engineering and Design, Brunel University,United Kingdom on dynamic partial reconfiguration forimage and signal processing applications.

Abbes Amira has recently been appointed as Reader in
Embedded Systems in the Nanotechnology and Inte-grated BioEngineering Centre NIBEC at the University ofUlster. From May 2006-March 2010, he was a seniorlecturer at Brunel University, West London within thedivision of Electronic and Computer Engineering in theSchool of Engineering and Design. Before he joinedBrunel University, he has held a lectureship in ComputerScience at Queen’s University, Belfast (QUB) sinceNovember 2001. He received his Ph.D. in ComputerScience from Queen’s University Belfast in 2001. He hasbeen awarded a number of grants from government andindustry and has published over 130 publications dur-
ing his career to date. He has been invited to give talks and tutorials at universitiesand international conferences and being chair, program committee for a number of
conferences. He was one of the tutorials presenters at ICIP 2009, Program Chair ofECVW2010, Program Co-Chair of DELTA 2008 and IMVIP 2005. He is also one of the2008 VARIAN prize recipients. He is a senior member of the IEEE, senior member ofACM, member of IET and Fellow of the Higher Education Academy. His researchinterests include: embedded systems, high performance reconfigurable computing,image processing, multiresolution analysis and medical applications.
http://www.xilinx.com


Hassan Rabah received the MS degree in electronicsand control engineering and the Ph.D. degree in elec-tronics from Henri Poincaré University of Nancy, Francein 1987 and 1993, respectively. In 1993 he joined theElectronic Instrumentation Laboratory of Nancy (LIEN)as an Associate Professor. In 1997 he joined the archi-tecture group of LIEN where he supervised research onVLSI implementation of parallel architecture for imageprocessing. He also supervised research in collaborationwith industrial partners in the FPGA implementation ofadaptable architecture for smart sensors. He partici-pated in several national projects for quality of servicemeasurement in DVB-T, optimisation of transmission

over ADSL of video by transcoding techniques of MPEG4/H264 AVC and usingscalability of the H264 SVC standard. His research activities and interest include:

partial and dynamic reconfigurable architectures for adaptive systems, FPGA poweroptimisation, video compression decompression and transcoding in MPEG4-AVCand SVC. He is the author and co-author of more than 60 papers in internationaljournals and conferences and he holds two patents.

Efficient architectures for 3D HWT using dynamic partial reconfiguration

Documents

Transcript of Efficient architectures for 3D HWT using dynamic partial reconfiguration