Optimized Implementation across Slice Fabric on...

Optimized Implementation across Slice Fabric on FPGA

By

Aqib Perwaiz

2006-NUST-TfrPhD-ComE-06

Supervisor

Dr. Shoab Ahmed Khan

College of Electrical and Mechanical Engineering National University of Sciences and Technology, Pakistan

August 2013

Optimized Implementation across Slice Fabric on FPGA

By

Aqib Perwaiz

2006-NUST-TfrPhD-ComE-06

A thesis submitted in partial fulfillment of the requirement for the

Degree of Doctor of Philosophy

Supervisor

Dr. Shoab Ahmed Khan

Department of Computer Engineering

College of Electrical and Mechanical Engineering Pakistan

August 2013

i

ACKNOWLEDGEMENT First of all, I am thankful to ALLAH Almighty for his mercy, help and guidance, without

which this work would not have been possible. I would like to express my gratitude to

Prof. Dr. Younus Javed, Dean ASG. He has always emphasized on higher education

and advance studies in the University and tried to promote a culture of research and

technological development without his efforts and interest, this work could not have

been possible.

It was a great honor for me to be supervised by Dr. Shoab A Khan, besides his

significant contribution to this work; he influenced my development as a member of the

research community in this field. Ever since I started my studies in Bachelors of

Electrical Engineering, Dr. Shoab has been a role model for me.

I would like to thank the members of the Research Monitoring Committee and the

foreign experts who have guided me throughout my work and helped me in keeping my

research on the right path.

I owe my parents for every success in my life, their encouragement and support is a key

factor in every achievement that I have ever made. I am also indebted to my wife, for

her continuous encouragement and patience during the course of my PhD. I would also

like to thank my children, for their patience and motivation that I always find from their

smiles.

Special thanks to the Higher Education Commission for their financial support.

ii

This dissertation is dedicated to my family for

their love, deep understanding, endless

patience and especially my wife for her

encouragement at all times.

iii

SUMMARY

This thesis proposes a mathematical modeling based technique that optimizes mapping

of Digital Signal Processing (DSP) algorithms on FPGAs. The thesis mathematically

models the problem by defining objective function that optimizes attributes like area,

power, and timing under a set of design constraints. The constraints list the embedded

blocks on FPGAs as resources. Any high-end DSP system consists of multiple sub-

systems. Each sub-system has multiple architectural options to select from, multiple

architectural options of Software defined Radio / Software defined jammer have been

discussed. Beside architectural design options, there are many other attributes that

directly affects the mapped resources. The world length quantization plays an critical

role in further optimizing the selected architectural option. The thesis models all these

attributes and the solution lists the resources required for the optimized mapping. The

thesis then indexes the results to select the best FPGA from the database. The model

also work on already selected FGPA and optimizes its resources to best fit a complex

design in the available HW , the thesis further discusses the effect of world length on

hardware(HW) complexity. The experiments demonstrate that world length of

intermediate variables does not help in improving the performance beyond a certain

point. The thesis explores the intricate relationship of intermediate variable lengths, with

the overall accuracy of the results and links it with the complexity of HW. Several design

examples are listed to show the validity of the findings. As an example CORDIC

algorithm has been explored to analyze the effect of bit resolution on the hardware

complexity and least mean square error.

iv

In the design space exploration, several architectural options are discussed. The

options include bit serial, byte serial, folded, unfolded, and distributed arithmetic based

architecture. In the discussion, novel techniques of mapping algorithm on these

architectures are also presented. For example, while discussing bit serial architectures

a novel design of serial multiplication is presented. The multiplier created in the process

is used in the design of subsystems. In this preview, the design of a serial least mean

square adaptive filter is presented. A bitwise serial CORDIC architecture used in direct

digital frequency synthesizer is also explored.

The thesis further focuses more on architectural design options that best maps on

FPGAs. The architectures that are optimal for custom design may perform poorly once

mapped on FPGA. This observation is substantiated by giving design examples from

Compression tress. These trees are very fundamental to DSP architectures due to their

vide use in general purpose multiplication, multiplication with constants and multiple

operand addition and subtraction. Different compression ratios for Wallace tree have

been explored to identify the correct ratio of Wallace compression tree to best map on

LUTs based FPGA.

Mapping a DSP algorithm on the hardware entails the technique of floating point to fixed

point conversion. Matlab ® tool has been used to map the above mentioned algorithms

on the hardware, Xilinx ® has been used to synthesize the same and LP solve has been

used to solve the complex mathematical model.

v

LIST OF ACRONYMS

DSP Digital Signal Processing

FPGA Field Programmable Gate Array

ASIC Application Specific Integrated Circuit

COMB Combined application of WLA and HLS

DCT Discrete Cosine Transform

DFG Data Flow Graph

FIR Finite Impulse Response

FU Functional Unit

HOM Homogeneous-architecture approach

HET Heterogeneous-architecture approach

HLS High-Level Synthesis

IIR Infinite Impulse Response

IOB Input / Output Block

LE Logic Element

LMS Least Mean Squares

LSB Least Significant Bit

LUT Look-Up Table

MILP Mixed Integer Linear Programming

MSB Most Significant Bit

MSE Mean Square Error

MUX Multiplexer

MWL Multiple Word-Length

vi

RTL Register Transfer Logic.

SEQ Sequential application of WLA and HLS.

SFG Signal Flow Graph

SNR Signal to Noise Ratio

SQNR Signal to Quantization Noise Ratio

UWL Uniform Word-Length

WLA Word-Length Allocation

vii

Contents

Summary iii List of Acronyms iv

List of figures viii 1. Overview

1.1 Introduction 1

1.2 Problem statement 2

1.3 Structure of this thesis 4

1.4 References 4

2. An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs

2.1 Introduction 6

2.2 Optimization Mathematical Model 7

2.3 WCDMA Receiver Example 12

2.4 Results 20

2.5 Conclusion 20

2.6 References 21

3. Hardware Mapping On FPGA

3.1 Overview 23

3.2 Hardware resources available on FPGA 24

3.2.1 Express fabric technology 24

3.2.2 Routing and interconnect architecture 25

3.2.3 Block Rams 25

viii

3.2.4 Clock management 26

3.2.5 Dedicated MAC modules 26

3.3 Look up Table 26

3.4 Digital Signal Processor( DSP 48) 27

3.5 References 28

4. Trading Off Word Length with Optimized Area

4.1 Overview 31

4.2 Proposed Algorithm 32

4.2.1 Format Conversion 33

4.2.2 Insertion of accuracy handlers 33

4.2.3 Modeling of hardware utilization for the selected word length

33

4.2.4 Iteration on word length 33

4.2.5 Analysis for percentage increase in area. Decrease in timing and increase in accuracy

33

4.2.6 Design space determination 34

4.2.7 Design space exploration 34

4.2.8 Selecting the appropriate word length that offers best accuracy, area and timing trade off

34

4.3 Design Example

4.3.1 CORDIC Algorithm 35

4.3.2 CORDIC Modeling 35

4.3.3 CORDIC synthesis on XILINX 38

4.3.4 Experimental Results 40

4.4 Conclusion 41

ix

4.5 References 41

5. Optimizing Bit Serial Architecture

5.1 Overview 43

5.2 Bit Serial Multiplication 44

5.3 Algorithm for Bit Wise Serial Multiplication 47

5.4 Design Example of Bit Serial Multiplier 50

5.5 Architecture 51

5.6 Implementation and Results 52

5.7 The LMS FIR filter Using Bit Serial Compressor 53

5.6.1 Bit Serial Adder 54

5.8 LMS Filter Architecture 55

5.9 Implementation and Results 56

5.10 References 56

6. Optimization On FPGA Slice Fabric

6.1 Overview 60

6.2 Optimization Techniques vs FPGA architecture 61

6.2.1 Compression Trees 62

6.2.2 Multiplier pipelining 63

6.2.3 Optimization of Bit resolution 63

6.3 Design Optimization 64

6.3.1 Optimization of FIR filter 64

6.3.2 Optimization of IIR filter 67

6.4 Complex Multiplier 68

6.5 Experimental Results 70

x

6.5.1 FIR filter 70

6.5.2 IIR filter 71

6.6 Complex Multiplier Synthesis 72

6.6.1 Optimization of Bit width 72

6.7 Conclusion 73

6.8 References 74

7. Conclusion And future Work 76

xi

List of Figures

Fig 2.1 Block layout of WCDMA receiver

Fig 2.2 Data rate of WCDMA receiver

Fig 3.1 Block diagram of Vertex 5 6-input LUT

Fig 3.2 LUT showing programmable I/O blocks

Fig 3.3 Internal architecture of Digital Signal processor DSP 48 showing the registers and carry chain

Fig 4.1 I/O systems with multiple inputs and outputs

Fig 4.2 Effects of increase in bit width, hardware complexity and its effects on LMS error in the design

Fig 4.3 Bit resolution Vs LMS error where bit width of X,Y= bit width of ф

Fig 4.4 Bit resolution Vs LMS error where bit width of X,Y< bit width of ф

Fig 4.5 Bit resolution Vs LMS error where bit width of X,Y> bit width of ф

Fig 4.6 Analysis on no. of slices, registers and IO’s with bit resolution of X,Y( varying) and ф ( fixed)

Fig 4.7 Analysis on no. of slices, registers and IO’s with bit resolution of X,Y( fixed) and ф ( varying)

Fig 5.1 Multiplication of two numbers having bit width of 8 x bits each

Fig 5.2 Serial compression of two numbers illustrated in dot notation

Fig 5.3 Compression cycles for serial multiplication shown in dot notation

Fig 5.4 Serial multiplication input to triangular compressor

Fig 5.5 Multiplication of two four x bit numbers

Fig 5.6 Bit wise dot product of first bit of A and B

Fig 5.7 Bit serial compressor based multiplication architecture showing the input X and Y , output p, cycle tracker, terms generator and triangular serial

xii

compressor

Fig 5.8 LMS FIR filter with serial i/p and o/p

Fig 5.9 Bit wise serial adder

Fig 5.10 Architecture of bit wise serial LMS filter composed of triangular compressor serial adder’s error calculator and filter weight adjuster

Fig 5.11 Bit serial CORDIC architecture

Fig 5.12 Flow chart of algorithm for the calculation of sine and cosine

Fig 5.13 Bit serial modified CORDIC architecture

Fig 5.14 Error analysis

Fig 6.1 6 input LUT’s, CLB’s and carry chain of Virtex-5 exploded view

Fig 6.2 Virtex-5 FPGA DSP 48 slice

Fig 6.3 FIR filter having sever taps

Fig 6.4 Systolic FIR filter with cut sets represented by dashed lines

Fig 6.5 Schematic of 6:3 type compression trees




Fig 6.9 IIR filter of first order

Fig 6.10 First order transformation of IIR filter

Fig 6.11 Schematic of Complex multiplier

Fig 6.12 Complex multiplier incorporating booth encoded Wallace tree reduction technique

Fig 6.13 The frequency (MHz) and number of utilized LUTs in CSD by using different compression trees for FIR filter compression

Fig 6.14 The number of utilized LUTs and frequency (MHz) in CSD by using different compression trees for FIR filter compression

xiii

Fig 6.15 Complex multiplier using different compression trees for comparison of LUTs and path delay

Fig 6.16 LUTs and clock rates for FIR filter

Overview 2013

1

Chapter 1

Overview ____________________________________________________________________________________

1.1 Introduction

In every signal processing system Field Programmable Gate Arrays (FPGAs) are used

for the prototyping / evaluation of the algorithm for the timing performance and the

throughput of the system. Latest FPGAs have virtual embedded computational blocks

[1] which offer higher speed computational units. While designing a specific algorithm

the structure of the embedded blocks, resources available on the hardware, bit width of

inputs and the depth of pipelining plays a vital role to achieve area and timing

performance [2].

Latest FPGAs offer reconfigurable logic blocks custom designed for high

throughput multiply accumulate operations, dedicated carry chain support, Block

Random Access Memories (RAMs) and internal slice cascade structure [3]. The layout

of logic elements in blocks of FPGA’s restricts the application of customary optimization

techniques and it renders a need for specialized techniques specific to the available

resources in FPGA. Traditional optimization techniques [4] which have proved well

suited for FPGA’s may not exhibit same superior performance there by it is essential to

choose a different family of FPGA’s with emphasis on separate optimization methods to

generate optimal hardware architecture [5]. Advanced applications need an elaboration

of the requirement to perform custom optimizations on a particular FPGA with a goal to

Overview 2013

2

maximize the performance. In short the extent of algorithm optimization is highly

depends on the target device configuration.

1.2 Problem Statement

The objective of this research is to evolve novel optimal techniques for implementation

of signal processing algorithms like Infinite Impulse Response (IIR), Finite Impulse

Response (FIR) filters, Direct Digital Frequency Synthesizer (DDFS) and Coordinate

Rotation Digital Computer (CORDIC) algorithm on FPGA based architecture. DSP

algorithms have constraints and different architectural options can be realized leading to

the same design within the FPGA design space. It becomes a complex problem while

implementing the design thereby selecting a suitable option. An algorithm has been

developed which identifies the design architectural option considering the design

constraints. The optimization is performed based on the throughput requirement and

FPGA fabric architecture. These algorithms are selected for their widely usability in

many DSP applications. As the optimization is considered based on the enhancement

of throughput and accuracy, therefore for lower throughput requirements, bit and word

serial architectures are also being considered. The research also explores the tradeoff

of word-length on accuracy and area / timing of the design.

The thesis first builds presents the Novel mathematical model for optimization

within the design space and then makes a base by giving an elaborate account of high

speed computational resources available in new generation FPGAs. Virtex-5 is used as

a choice platform. The thesis then discusses optimization effects due to varying word

length and hardware mapping on FPGA.

Overview 2013

3

For lower throughput requirement, bit serial architectures are proposed. An

algorithm for a serial multiplier has been developed and multiple instances of the

multiplier have been used to realize a bit serial Least Mean Squares(LMS) filter. Serial

implementation of CORDIC algorithm has also been discussed to model a bit serial

CORDIC which can be used as a DDFS.

For the DSP circuits implementation fixed-point arithmetic is used .To minimize

the design costs and least mean square error the word length has to be selected very

precisely. An algorithm for the word length optimization of CORDIC [6] has been

developed, the synthesis of algorithm shows that an increase in the bit resolution the

hardware complexity increases linearly and the least mean square error decreases

marginally. Therefore it is mandatory to find an optimum point where performance and

minimum hardware complexity converge.

For desired optimization correct selection of target device is a vital parameter.

Programmable devices are an attractive choice for system designers as the re

configurable capabilities make FPGAs [7] a suitable prototyping platform. FPGAs have

been used for the analysis of algorithms developed during this work as the latest

advancement in FPGAs offer new possibilities of implementing high performance DSP

algorithms. The optimal resource usage available in the FPGA gives an insight into the

mapping of different compression trees. For Virtex-5 it has been concluded through

experimentation that by selecting a compression ratio of 6 to 3 efficiently multiplies

using addition which yields reduced area and high speed implementations.

Overview 2013

4

1.3 Structure of this Thesis

The work consists of seven chapters, including the first chapter on introduction.

The second chapter is about the optimization of DSP algorithm considering multiple

architectural options and selecting the appropriate FPGA device to meet the design

constraints. The third chapter is about the hardware mapping on FPGA, different

available resources including computation blocks and multiplier units within an FPGA

are discussed. Chapter 4 describes the trading off word-length in optimizing area,

CORDIC algorithm is implemented to analyze the results and conclude the effects of

varying word length on least mean square error. Chapter 5 is about the bit serial design

of CORDIC algorithm and bit serial multiplier; multiple instances of same multiplier are

used to realize a bit serial LMS filter. Chapter 6 is about the optimized implementation

on the slice fabric of FPGA, different compression trees are analyzed with respect to a

specific family of FPGA, word-length optimization techniques for FIR and IIR digital

filters and complex multipliers are also discussed. Chapter 7 is concluding the research

and highlighting the future work.

1.4 References

[1] C.H. Ho, P.H.W. Leong and W. Luk, “Virtual Embedded Blocks: A Methodology

for Evaluating Embedded Elements in FPGAs”14th Annual IEEE Symposium

on Field-Programmable Custom Computing Machines (FCCM'06) 0-7695-

2661-6/06

[2] L.W. Couch 11, Modern Communication Systems, Prentice Hall, 1994.

Overview 2013

5

[3] L.K. Tan, et al. "An 800-MHz quadrature digital synthesizer," IEEE JSSC,

vol. 30, N 12, pp.1463-1473, 1995.

[4] R. El-Ashry ,M. Rehan, Hassan El Kamchouchi and F. Gebali,

“Performance-optimized FPGA implementation for the flexible triangle

search block-based motion estimation algorithm” Electrical and Computer

Engineering (CCECE), 2011 24th Canadian Conference on may 2011.

[5] J.E. Voider, "The CORDIC trigonometric computing technique," IRE

Transactions on Electronic Computers, vol. EC-8, pp.330-334, 1959.

[6]

Er. ManojArora, Er. R S Chauhan, Er.LalitBagg “FPGA Prototyping of

Hardware Implementation of CORDIC Algorithm”, International Journal of

Scientific & Engineering Research, Volume 3, Issue 1, January-2012 ISSN

2229-5518.

[7] Steve Kilts Advanced FPGA Design: Architecture, Implementation, and

Optimization chapter 1.

An Optimal Designing Solution for Efficient Utilization and Mapping of Resources on FPGAs 2013

6

Chapter 2

An Optimal Designing Solution for Efficient Utilization and

Mapping of Resources on FPGAs

____________________________________________________________________________________

2.1 Introduction

This chapter presents a novel model for optimizing device resources based on the

design constraints. The model also identifies the target device to be used based on the

optimization constraints. For a DSP algorithm multiple optimization options are available

based on a set of constraints to give best solution in terms of throughput, area, timing

and power consumption [1]. High-end FPGAs have millions of embedded as well as

distributed resources. Complex applications can be mapped by adopting multiple design

options for each architectural option e.g. folding, unfolding and parallel design options

whereby selection is mainly based on the throughput of the design. Same design can be

implemented by using different set of resources to achieve the set criteria. There are

multiple mapping options where an algorithm or component can be mapped such that it

uses different type of resources within the same device [2]. These options offer intricate

optimization problem to any designer of a complex digital system. This chapter presents

a mathematical model where multiple design options can be worked out based on

constraints to select best available optimization within a multiple variable FPGA [3] – [9]


7

design space. The algorithm is used to select the best option on availability of multiple

solutions for the constraint of hardware resources. The chapter considers the design

and implementation of high data rate WCDMA receiver for optimization using the

proposed technique.

2.2 Optimization Mathematical model

The optimization problem is first modeled as an integer programming problem. To

demonstrate the working of the model the design of a WCDMA receiver is considered.

The receiver consists of several blocks and for each block based on the throughput

requirements, multiple architectural design options are available. The design has to

explore the tradeoffs in this multi variable design space to get the optimal solution that

best fits on a selected FGPA and optimizes its resources while meeting the throughput

constraints. The problems complexities exponentially grow for complex design thus

require a tool to make the selection for the designer. Our proposed technique develops

an integer programming model for the problem. To demonstrate the effectiveness of the

technique, the model is mapped on a WCMDA receiver to optimize the target clock,

number of MACs, number of adders, number of LUTs and number of registers while

meeting the throughput constraint.

The modeling starts by defining decision variables. Let 푥 be a decision variable in the

optimization problem, where 푗is the component and 푖 is the architectural option


8

available for that component. Few of the possible architectural options for each design

components are described in Table 2.1.

Table 2.1. Architectural options and their description

Ser Option number Description

a. 0 Embedded resources

b. 1 Distributed resources

c. 2 Bit serial/ word serial architecture

d. 3 Folded architecture

e. 4 Unfolded architecture

The optimization problem is solved for a set of design constraints. These constraints

relate to the resources on the FPGA and the throughput requirements on each

component. A listing of these constraints is as follows:

Area Constraints

These set of constraint relates with the resources on the FPGA. The designer can

budget these resources for each part of the design and put the budgeted number as a

constraint or can let the optimization model solve it for a global optimal solution for the

complete design while the solution is constraints in available resources.


9

The Adders Constraint

The adder constraint relates to the adders on the FPGA for a component 푗having

architectural option 푖. The designer can fix the number of adders in a design to be

implemented on a specific target device. If 푎 represent the adder for a component 푗

having architectural option 푖and 퐴 represent the total number of adders available on

the FPGA then the constraint for the adder is defined as fol:-

푎 푥 ≤ 퐴 (1)

The Multiplier Constraint

The multiplier constraint relates to the MACs on the FPGA for a component 푗 having

architectural option. The designer can plan the number of MACs in a design to be

implemented on a specific target device. If 푚 represent MACsfor a component 푗 having

architectural option 푖푎푛푑푀 represent the total number of MACs available on the FPGA

then the constraint for MACs is written as fol:-

푚 푥 ≤ 푀 (2)

The Register Constraint

The register constraint relates to the registers on the FPGA for a component 푗 having

architectural option 푖. The designer can plan the number of registers in a design to be


10

implemented on a specific target device. If 푟 represent the register for a component 푗

having architectural option 푖 and 푅 represent the total number of registers available on

the FPGA then the register constraint is defined as fol:-

푟 푥 ≤ 푅 (3)

The Look Up Table(LUT) Constraint

The LUT constraint relates to the LUTS on the FPGA for a component 푗 having

architectural option 푖. The designer can plan the number of LUTs in a design to be

implemented on a specific target device. If 푙 represent the LUTs in the design for a

component 푗 having architectural option 푖 and 퐿 represent the total number of LUTs

available on the FPGA then the LUT constraint is defined as fol:-

푙 푥 ≤ 퐿 (4)

Memory Constraints (SRAM Block constraint)

This constraint is an optional constraint and it optimizes the use of RAM block which is

directly related to the power consumption of FPGA. If 푟푎푚 represent the RAM blocks

for a component 푗 having architectural option 푖and 푅퐴푀 represent the total RAMs

available on the FPGA then the RAM constraint is defined as fol:-

푟푎푚 푥 ≤ 푅퐴푀 (5)


11

Power Constraint

This is an optional constraint and can be placed for design with low power objectives. If

푝 represent the desired power level in the design for a component 푗 having

architectural option 푖 and 푃 represent the total power the FPGA can handle then the

power constraint is defined as fol:-

푝 푥 ≤ 푃 (6)

As discussed earlier the decision variable 푥 must meet equation (7) to optimize an

architectural option 푁 in the design.

푥 = 1∀푖 = 1,2,3,4, … … … . . ,푁 (7)

For the design it is required to minimize the fol equation

(∝ 푙 푥 + ∝ 푚 푥 +∝ 푟 푥 +∝ 푟푎푚 푥 +∝ 푎 푥 +∝ 푝 푥 ) (8)

Where ∝ is the weight of LUTs constraint 푙 in the design,∝ is the weight of MACs

constraint푚 ,∝ is the weight of register constraint푟 ,∝ is the weight of memory

constraint 푟푎푚 ,∝ is the weight of adder constraint푎 푎푛푑 ∝ is the weight of

power푝 in the design. For any problem these equations are solved using the tool

LPsolve which determine the architectural option to be selected by using the specific

constraints. These constraints map on the resources of FPGA and therefore determine


12

the target device under implementation. For better understanding this mathematical

model has been implemented on a WCDMA receiver.

2.3 WCDMA Receiver Example

The purpose of a digital receiver is to recover the baseband signal without synchronous

demodulation; it includes the signal processing immediately after the Analog Front End

(AFE) from detecting the start of burst to the actual stream of information intended for

communication. The proposed mathematical model is mapped on a WCDMA receiver for

Software Defined Radio under development at Center for Advanced Research in

Engineering. The component layout and interconnection of the WCDMA receiver for

Software Defined Radio is shown in Figure 2.1. For each of the components / blocks

there are multiple design options based on the throughput, timing and other design

constraints. The typical data rate is high and to achieve this high data rates, every block /

component has to operate on high clock rate which relates to pipelining and no. of

registers, the data rate of each component is shown in Figure 2.2. The more the

registers, the more is the power consumed. Therefore, it becomes a complex problem to

solve, if at the design time the exact weight-age of each subcomponent is known then

we exactly know the architecture to implement and the FPGA that supports the complete

design is also identified.


13

Signal Down Sampling

Correlation of received signal with Spreading

Sequence

Start of burst detection and timing compensation

Data despreading

Course frequency estimation

Channel estimation and compensation

Course frequency compensation

Fine frequency estimation and compensation

Channel Equalization and phase adjustment

Training sequence removal

Forward error correction

Symbol demapping



Sequence


Data despreading








Symbol demapping



Sequence


Data despreading








Symbol demapping


Sequence






Symbol demapping



Sequence

Data despreading








Symbol demapping

Figure 2.1. Block layout of WCDMA receiver


14

Since there are multiple modules the throughput of each module is different

depending the data stream it is handling, following formula was used for the chip rate

calculation:-

푐ℎ푖푝푟푎푡푒 =푇ℎ푟표푢푔ℎ푝푢푡 × 푆푝푟푒푎푑푖푛푔푔푎푖푛× (퐷푎푡푎푙푒푛푔푡ℎ+ 푡푟푎푖푛푖푛푔퐿푒푛푔푡ℎ)

퐵푖푡푠푝푒푟푠푦푚푏표푙 × 퐹표푟푤푎푟푑푒푟푟표푟푐표푟푟푒푐푡푖표푛 × 푑푎푡푎푙푒푛푔푡ℎ

The constraints in this design are as per Table 2.3 below:- Table 2.3. Design parameters for WCDMA Receiver

Serial Parameter Target design

1. Training Length 32 2. Spreading Factor 16 3. Data Length 288 4. Modulation Index 4 5. Modulation Schemes QPSK 6. Target throughput 512 kbps 7. Forward Error Correction ½

8. Chip Rate 9.102 Mcps 9. Up sampling Factor 4

With above values the chip rate is

퐶ℎ푖푝푟푎푡푒 =512000 × 16 × (288 + 32)

2 × 0.5 × 288 = 9.102푀푐푝푠

Since the up sampling factor is 4 the actual bandwidth becomes

퐵푎푛푑푤푖푑푡ℎ = 푢푝푠푎푚푝푙푖푛푔푓푎푐푡표푟 × 4 = 36.408푀푐푝푠


15



Sequence


Data despreading








Symbol demapping



Sequence


Data despreading








Symbol demapping



Sequence


Data despreading








Symbol demapping


Sequence






Symbol demapping



Sequence

Data despreading








Symbol demapping

36.408 Mcps

9.102Mcps (Chip rate)

568.89 Ksps

512 Ksps

256 Ksps

512 Kbps

568.89 Ksps

568.89 Ksps

568.89 Ksps

568.89 Ksps

568.89 Ksps

Demodulated bit stream

Figure 2.2. Data rates for WCDMA receiver


16

There are total of 12 sub modules in the WCDMA receiver and for each module there are

several options as per Table 2.4.

Table 2.4 . Architectural options for WCDMA receiver

Serial Module Throughput Options Available

1. Signal down sampling 36.408 Mcps 4

2. Correlation of received signal with spreading sequence

9.102 Mcps 4

3. Start of burst detection and timing sequence

568.89 Ksps 3

4. Data de-spreading 568.89 Ksps 4

5. Course frequency estimation 568.89 Ksps 4

6. Channel estimation and compensation

568.89 Ksps 4

7. Course frequency compensation

568.89 Ksps 4

8. Fine frequency estimation and compensation

568.89 Ksps 5

9. Channel equalization and phase adjustment

568.89 Ksps 5

10. Training sequence removal 568.89 Ksps 5

11. Forward error correction 256 Ksps 4

12. Signal de-mapping 512 Kbps 4

The whole design was transformed in terms of equations defined above and constraints

values were defined, LP solve tool solved the equation and provided the best

architectural option for each sub component/ blocks in the design. The initial values of

the design constraints are as per Table 2.5.


17

Table 2.5. Initial values of constraints

Serial

Constraint

Value

Option 1 Option 2

a. ∝ 0.0004 0.00028

b. ∝ 0.0416 0.0357

c. ∝ 0.0003 0.00025

d. ∝ 0.00019 0.0001530

e. ∝ 0.01 0.008

f. ∝ 0.083 0.002

g. 퐴 5200 6500

h. 푀 24 28

i. 푅 3000 4000

j. 퐿 2500 3500

k. 푅퐴푀 120 500

l. 푃 100 125

Against these constraints a grid of 12x5 (12 components and 5 options each) was

initialized at max and the selected options as per the solution of LP solve is also

highlighted. As we change the constraints the selected option and finally the target

device also changes, the details are as per Table 2.6.


18

Table 2.6. Gird illumination of selected architectural option for WCDMA receiver. The green dots and the

interconnect represent the option 1 constraint values and the blue dot and interconnect represent the

option 2 values of the design constraints

ser Embedded resources

Distributed resources

Bit/ word serial

folded Unfolded

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.


19

The selected architecture for each sub component / block is as per the Table 2.7.

Table 2.7. Selected implementation option for each module of WCDMA receiver

Serial

Sub component /

block

Throughput

Selected Option

Option 1 constraint

Option 2 constraint

1. Signal down sampling 36.408 Mcps 0( Embedded) 0( Embedded)

2. Correlation of received signal with spreading sequence

9.102 Mcps 1(Distributed) 3( folded)

3. Start of burst detection and timing sequence

568.89 Ksps 1(Distributed) 3( folded)

4. Data de-spreading 568.89 Ksps 3( Folded) 1(Distributed)

5. Course frequency estimation

568.89 Ksps 3( Folded) 1(Distributed)

6. Channel estimation and compensation

568.89 Ksps 0( Embedded) 1(Distributed)

7. Course frequency compensation

568.89 Ksps 0( Embedded) 1(Distributed)

8. Fine frequency estimation and compensation

568.89 Ksps 1( Distribited) 4(un folded)

9. Channel equalization and phase adjustment

568.89 Ksps 0( Embedded) 3( folded)

10. Training sequence removal

568.89 Ksps 0( Embedded) 3( folded)

11. Forward error correction 256 Ksps 0( Embedded) 3( folded)

12. Signal de-mapping 512 Kbps 0( Embedded) 3( folded)


20

As the values of the constraints are changed the architectural options are also changed.

As the architectural options are related to the resources of FPGA therefore it concludes

that this has direct impact on the selection of target device.

2.4 Results The WCDMA receiver for software defined radio was implemented using the

mathematical model to select the architectural option for each component / blocks of the

design. Two values were given for each constraint and the model was solved using LP

solve. The results show that for varying options there are two FPGAs that meet the

requirement the first one is Spartan 3A device xc3sd3400a-4cs484 [12] and the second

one is Vertex 5 device Xc5vfs240T. The specifications of these devices are as per Table

2.8.

Table 2.8. Resources available on Spartan 3A and Vertex 5 FPGA

Serial Resources Spartan 3A Vertex 5 1. Slices 23872 37440

2. 4 I / p LUTs 47774 149760

3. Flip flops 47774 149760

4. DSP Blocks 126 1056

5. Block Rams 126 516

2.5 Conclusion

This mathematical model has presented a novel technique which helps an algorithm

designer to map his algorithm on different available architectural options thereby while

adjusting the weight-ages of different resources the best fit target FPGA is also


21

identified. The complex example of WCDMA receiver has been discussed and with the

given throughput requirement at each stage the design maps perfectly on the Spartan

3A FPGA on the option 1 constraints and Vertex 5 FPGA on the option 2 constraints.

The result of implementation and LP solve solution confirms the novelty of the

algorithm. Any system can be optimally designed to fit in the FPGA design space basing

on the fine adjustment of the constraints. By carefully adjusting the constraints low

power solutions are realizable. Other implementations of this model could be one the

modern day software defined jammers which have almost the same complex

components with an addition of few for the Spectrum search.

2.6 References

1. Vinoo Sumeri and Ranga Venuri,”Throughput optimization with design space

exploration during partitioning of multi FPGA Architectures”, Laboratory for

Digital Design Environment.

2. Alastair M. Smith, Member, IEEE, George A. Constantinides, Senior Member,

IEEE, and Peter Y. K. Cheung, Senior Member, IEEE”” FPGA Architecture

Optimization using Geometric Programming

3. OgnjenŠcekic,”FPGA comparative analysis” pages 2 – 140.

4. J. Lamoureux, and S. J. E Wilton “On the Interaction between Power-Aware

FPGA CAD Algorithms,” IEEE International Conference on Computer-Aided


22

Desig, Nov. 2003.

5. M. French, L. Wang, T. Anderson, M. Wirthlin, “Integrated Tool Suite for Post

Synthesis FPGA Power Consumption Analysis,” Military and Aerospace

Programmable Logic Devices (MAPLD) International Conference, Washington,

D.C., September 2005.

6. B. Hutchings, P. Bellows, J. Hawkins, S. Hemmert, B. Nelson, “A CAD Suite for

High Performance FPGA Design,” Field Customizable Computing Machines,

1999.

7. L. Shang, A. Kaviani, and K. Bathala, “Dynamic Power Consumption in

Virtex-II FPGA Family,” FPGA ’02, Monterey, California, February, 2002.

8. M. French, L. Wang, T. Anderson, and M. Wirthlin, “Post Synthesis-Level

Power Estimation for FPGAs,” IEEE Symposium on Field-Programmable

Custom Computing Machines, April 2005.

9. L. Wang, M. French, A. Davoodi, D. Agarwal, “FPGA Dynamic Power

Minimization Through Placement and Routing Constraints,” .

10. A public domain version of LP_Solve is maintained by the Open Source

Community at the URL: http://sourceforge.net/projects/lpsolve

11. LP_Solve Mixed Integer Linear Programming (MILP) solver, was originally

developed by Michel Berkelaar (mailto:[email protected]) in ANSI C as Non-


23

Public domain software, available via anonymous FTP at

ftp://ftp.es.ele.tue.nl/pub

12. http://www.xilinx.com/support/documentation/data_sheets/ds610.pdf

Hardware Mapping on FPGA 2013

23

Chapter 3 Hardware Mapping on FPGA ____________________________________________________________

3.1 Overview

Latest generation FPGAs [1], [2] have high integration densities, huge number of

dedicated resources for processing and storage at high clock speeds. These features

make them an attractive choice to map complex DSP algorithms to achieve desired

performance.

As a algorithm designer the main focus is on the optimization of performance

parameters such as area, power and timing delays [3], [4]. If the algorithm bit width is

correctly mapped on the bit handling capacity of resource available on the target device

then the complete design area can be taken as a sum of individual component areas. In

the same way overhaul power consumption can be computed as the sum of switching

power of all the input signals and the mean power consumption of each functional unit

(FU) [5], [6], [7], [8], [9], [10]. For the optimization on cost in terms of resource usage for

the FPGA architecture having intrinsic features, the existent design require inclusion

and proper modeling within the optimization process[11], [12], [13], [14].

Latest FGPAs have built in specialized blocks, when the algorithm is correctly mapped

in terms of internal pipelining of the multiplier compressor onto the DSP of FPGA it will

result in reductions in design cost [15], [16], [17], [18] and design time. Therefore the

understanding of resources available on the target device is very important to be known

to the designer in order to map the algorithm for optimum performance.

Trading off world length with optimized area 2013

24

3.2 Hardware resources available on FPGA

Hardware resources available on the FPGA play a vital role while mapping the

algorithm on the target device. For all practical purposes Xilinx Vertex-5 FPGA will be

considered and few resources available on Xilinx Vertex-5 FPGA are discussed as

under:-

3.2.1 Express Fabric technology

Express Fabric technology is based on a 6-input LUT architecture and routing.

The combination of carry chains/ dedicated multiplexers, Look-Up Tables (LUTs)

and Flip-Flops (FFs) determine the efficiency and performance of implementing

arithmetic and logic functions. The Virtex-5 family has a fully independent (not

shared) 6-input LUT as shown in Figure 3.1.

Figure 3.1Block Diagram of a Virtex-5 6-Input LUT


25

LUT input architecture is the determining factor for minimizing the critical

path delay which eventually represents the performance of logic fabric. In order

to minimize the critical path the 6-input LUT has be exactly mapped onto the

algorithm otherwise it will result in inefficient use of the wider-input LUTs and the

die size which determines the area also increases.

3.2.2 Routing and interconnect Architecture

Interconnect timing delays which can account for more than 50% of the critical

path delay are minimized in Vertex-5 FPGA by changing the interconnect pattern.

The diagonally symmetric interconnect pattern have enhanced performance due

to the reduction in the places vs hop ratio and enhancement in the connection vs

the hop ratio. This design helps in finding the optimal routes.

3.2.3 Block RAMs

Ram Blocks are used for the in-chip data storage. The block RAM base size in

the Virtex-5 family has doubled as it was in Virtex-4 family and this has resulted

in deeper pipelining, larger memory arrays and the usage of full RAM as two half

RAMs. Therefore the block RAM available (Virtex-5) when operated in Simple

Dual Port mode effectively doubles the block RAM bandwidth. Enhanced block is

ideal for performance maximizing and power management tool.


26

3.2.4 Clock management

For synthesizing various clock signals these blocks are used. Being dedicated

they boost the internal performance while increasing the board system

frequency.

3.2.5 Dedicated MAC modules

The Virtex-5 family has introduced the DSP48E slice, a new DSP slice that has

an enhanced multiplier width (25 x 18), independent c register, logic Unit

Functionality and dedicated hardware central processing unit in the form of hard

power pc core. It the bit width of algorithm is exactly mapped on this DSP it will

result in achieving the desired area and timing performance.

3.3 Look Up Table (LUT)

A LUT in an FPGA is a array of interconnected programmable logic blocks (transistors).

These programmable logic blocks are programmed to switch on /off which interconnects

the wire, a large numbers of these blocks can be wired in this way. Input/output from the

FPGA is via special I/O pads which contain sequential logic circuitry.

Virtex-5 architecture has real 6-input LUT with dual-LUT capability. There are a total of

64 bits of logic programming space and 6 independent inputs and any function of 6

inputs and numerous combinations of one or two smaller functions can easily be

implemented.


27

Figure 3.2 Look up table showing programmable I/O blocks

The 6-input LUT also includes associated carry logic, MUXs, and a flip-flop as shown in

Figure 3.2.

3.4 Digital signal processor (DSP 48E)

DSP48E is the digital signal processing slice in Virtex-5 FPGA. By using several slices

together efficient digital filters can be realized. If design styles as shown in Fig 2.3 are

incorporated it can result in substantial savings [20][Xilinx.com].


28

Figure 3.3 Internal Architecture of digital signal processor (DSP 48) showing the Registers and carry

chain

To achieve performance and power characteristics, the Pipelining of DSP algorithms is

often required. There are three pipelining stages in DSP48E slice and when it is used as

a multiplier when all the stages are utilized performance is guaranteed. When the

MREG as shown in Fig 3.3 is enabled it results in saving almost 15% of the overall

slice.

If resources discussed above are a part of almost every latest FPGA, as a designer

while realizing a DSP algorithm, the algorithm has to be mapped on the FPGA available

resources for onward resource saving and algorithm efficiency.

3.5 References [1] Altera Corp. www.altera.com. [2] Xilinx Inc. http://www.xilinx.com.

[3] A. Nayak, M. Haldar, A. Choudhary, and P. Banerjee. Accurate Area and Delay

Estimators for FPGAs. In Proc. Design, Automation and Test in Europe


29

Conference and Exhibition, 2002. [4] C. Brandolese, W. Fornaciari, and F. Salice. An Area Estimation Methodology for

FPGA Based Designs at System C-Level. In Proc. Design Automation

Conference, 2004, pages 129–132, 2004.

[5] S. Bilavarn, G. Gogniat, and J.L. Philippe. Area Time Power Estimation for FPGA

Based Designs at a Behavioral Level. In Proc. Int. Conf. on Electronics, Circuits

and Systems, volume 1, pages 524–527, 2000.

[6] J.A. Clarke, A.A. Gaffar, and G.A. Constantinides. Parameterized Logic Power

Consumption Models for FPGA-based Arithmetic. In Proc. Int. Conf. on Field

Programmable Logic and Applications, pages 626 – 629, 2005.

[7] J.A. Clarke, A.A. Gaffar, and G.A. Constantinides. Fast Word-Level Power Models

for Synthesis of FPGA-Based Arithmetic. In Proc. IEEE Int. Symp. on Circuits and

Systems, pages 1299–1302, 2006.

[8] R. Jevtic, C. Carreras, and G. Caffarena. High-level Switching Activity Models for

Multipliers in FPGAs. In Proc. ACM/SIGDA Int. Symp. on Field Programmable

Gate Arrays, pages 224–225. ACM Press, 2007.

[9] R. Jevtic and G. Carreras, C. Caffarena. Switching Activity Models for Power

Estimation in FPGA Multipliers. In Proc. Int. Workshop on Applied Recon-

figurable Computing, pages 201–213, 2007.

[10] C.S. Bouganis, G.A. Constantinides, and P.Y.K. Cheung. A Novel 2D Filter

Design Methodology for Heterogeneous Devices. In Proc. IEEE Symposium on

Field-Programmable Custom Computing Machines, 2005.

[11] G. Caffarena, J. A. López, C. Carreras, and O. Nieto-Taladriz. High-Level ynthesis

of Multiple Word-Length DSP Algorithms using HeterogeneousResource FPGAs.

In Proc. Field Programmable Logic and Applications, pages 675–678, 2006.

[12] G. Caffarena, J. A. López, C. Carreras, and O. Nieto-Taladriz. Optimized

Implementation of DSP Cores on FPGAs Using Logic-based and Embedded

Resources. In Symp. on System-on-Chip, pages 103–106, 2006.

[13] D.Chen and J. Cong. Register Binding and Port Assignment for Multiplexer


30

Optimization. In Proc. IEEE Asilomar Conf. on Signals, Systems and Computers,

volume 1, pages 68–73, 1994.

[14] P. Metzgen and D. Nancekievill. Multiplexer Restructuring for FPGA

Implementation Cost Reduction. In Proc. Design Automation Conference, pages

421 – 426, 2005.

[15] H.A. Atat and I. Ouaiss. Register Binding for FPGAs with Embedded Memory. In

Proc. IEEE Symp.on Field-Programmable Custom Computing Machines, pages

165–175, 2004.

[16] G.W. Morris, G.A. Constantinides, and P.Y.K. Cheung. Using DSP Blocks for

ROM Replacement: A Novel Synthesis Flow . In Proc. Int. Conf. Field

Programmable Logic and Applications, pages 77–82, 2005.

[17] S.J.E. Wilton. Implementing Logic in FPGA Memory Arrays: Heterogeneous

Memory Architectures. In Proc. IEEE Int. Conf. on Field-Programmable

Technology, 2002.

[18] X. Liang, J.S. Vetter, M.C. Smith, and A.S. Bland. Balancing FPGA Resource

Utilities. In Proc. Int. Conf. on Eng. of Reconf. Systems and Algorithms, pages

156–162, 2005.


31

Chapter 4

Trading off world length with optimized area

______________________________________________________________________

4.1 Overview

In a particular digital signal processing system the number of processed bits at a time is

a major source of resource wastage. The selection of the word-lengths of variables is

carried out to meet the applications output error tolerance. As a designer the aim is to

determine a correct word length at which the cost and the output distortion match a

certain criteria depending upon the application under consideration.

(a) (b)

Figure 4.1 (a) I/O System with multiple inputs and outputs (b) Optimal word length - cost Vs distortion

tradeoff


32

Consider a I/O system comprising of 푀 inputs, 푁outputs, a internal variable 푆 and

desired quantization 푄 as shown in Figure 4.1(a). For a desired quantization error 푇 to

be in some limits the requirement is to determine the size of different variables and

states that gives the desired quantization error with minimum hardware (Word length,

registers, multipliers etc.).

For the algorithm specific quantization error the width of input variables constraint the

size of푄. Empirical determination of this relationship can be computed which help in

setting the optimal bit width of different variables for achieving desired푄. The example

of CORDIC algorithm is discussed below which analyzes the effects on ‘Q’ by varying

the bit width of input variables. To achieve the optimal word-length fixed-point arithmetic

is used for the implementation. The tradeoff is shown in the Fig 4.1(b) in which cost and

distortion curve analysis clearly shows that longer word length may improve application

performance but at the cost of an increased hardware cost where as a shorter word

length may increase the quantization errors and overflows there by reducing the

hardware cost [1] [2]. The aim is looking for an optimal point at which the performance

of application is maximized with minimum hardware cost and minimum quantization

errors. The outcome of this research is an algorithm for word length optimization, the

details are discussed below.

4.2 Proposed Algorithm

The algorithm for the word-length optimization is a six step process. The details are as

under:-


33

4.2.1 Format conversion

This is the start in which the format conversion is carried out. The input in floating

point is converted to fixed point for implementation on the hardware.

4.2.2 Insertion of accuracy handlers

Since it introduces quantization noise the accuracy handlers are inserted to log the

quantization error for finding an optimal point in the design space that minimizes

area but maintaining required quantization performance.

4.2.3 Modeling of HW utilization for the selected world length

The system is analyzed for different word lengths based on the resources available

on a particular architecture and then finally based on the quantization error

constraint on the output, area; timing and word length are selected. This selection is

based on application to application usage.

4.2.4 Iterating on world length

Iterations are carried to explore the design space. These iterations usually require bit

by bit changing of inputs. The outputs are analyzed for achieving the desired level of

performance.

4.2.5 Analysis of percentage increase in area, decrease in timing

and percentage increase in accuracy

Analysis is carried out for different set of word lengths that indicates the increase in

timing, performance area and reduction in quantization error.


34

4.2.6 Design Space Determination

The exhaustive search is minimized by finding the relationship between the world

lengths of different input signals. There, usually is a strong relationship among the

input signals that governs the quantization performance of output signals. Their

relationships can be extracted from the mathematical dependencies of the inputs

and outputs or for highly complex algorithms they can be empirically determined by

running algorithms for different world-lengths.

4.2.7 Design space exploration

Exploring the design space, an optimum point is desired whereby any increase in

the word length has least effect on the quantization error and has best tradeoff for

area and resource usage.

4.2.8 Selecting the appropriate world length that offers best

accuracy, area and timing tradeoff

The optimum point analyzed in the last step is the selected word length.

4.3 Design Example

To effectively understand the proposed algorithm, CORDIC algorithm has been

implemented.

4.3.1 CORDIC Algorithm

CORDIC (Coordinate Rotation Digital Computer) algorithm is used for the generation

of digital sine and cosine [3] [6] and this digital transformation is achieved by

iterating the equations recursively. The algorithm accuracy is how ever proportional


35

to the bit width of angle d . A vector (A1, B1) is mathematically transformed into a

new vector (A2, B2) .Mathematically in equation form it can be represented as :-

2 1 1*cos( ) *sin( )a a b (1)

2 1 1*sin( ) *cos( )b a b (2)

Where

2 1d (3)

4.3.2 CORDIC Modeling

MATLAB software has been used for the designing, modeling and simulation of

CORDIC algorithm. The built in quantization functionality of MATLAB [7] - [9] has been

used to map the algorithm arithmetic in floating point.

Figure 4.3, Figure 4.4 and Figure 4.5 show the LMS error of hardware complexity viz~

a~viz the bit width resolution. The analysis follow a trend according to which the least

mean square error minimizes (almost approach zero)when the relation in the Equation

below holds.

Bit width of input A,B > Bit width of angle Φ (4)

Any increase in bit width after a certain point has minimal effects on the reduction of

the LMS error rather it has drastic effects on the hardware complexity, the same

conclusion has been illustrated in Figure 4.2


36

Figure 4.2 Effects of increase in bit width, hardware complexity and its effects on LMS error in the design

space

This concludes that for CORDIC algorithm if the condition mentioned below holds it will

guarantee min hardware utilization with min quantization error.

Min {bit width (X, Y) > (bit width (Φ) minus 2)} = Min least mean square error

Figure 4.3 Bit resolution vs LMS error where bit width of X,Y = bit width of Φ


37

Figure 4.4 Bit resolution vs LMS where bit width of X,Y <bit width of Φ

Figure 4.5 Bit resolution vs LMS error where bit width of X,Y >bit width of Φ

After having analyzed the effect of increase in the bit width on the least mean

square error with in the design space, the same algorithm has also been explored to


38

analyze the bit width resolution effect on the hardware complexity. CORDIC algorithm

has been implemented in MODEL SIM followed by the synthesis on Xilinx.

4.3.3 CORDIC Synthesis on XILINX

MODELSIM software was used for the implementation of CORDIC algorithm and Xilinx

software was used for the synthesis of same. 1’s compliment value of angles ranging

from 0 to л was used as the systems i/p and different iterations have been realized to

reduce the LMS error.

Table 4.1a and Table 4.1b show the result of synthesis. In Table 4.1a the bit width X

and Y was varied from 10 bits to 30 bits and the bit resolution of angle Φ was kept fixed

at 9 bits where as in Table 4.1b the bit width of X and Y was kept fixed and the bit

resolution of angle Φ was varied from 9 bits to 16 bits.

Table 4.1 a Synthesis Results with bit resolution of X, Y= varied and Φ= fixed

Serial

Device Utilization

Bit Resolution

Selected Device : v50fg256-6

10,9 11,9 12,9 13,9 14,9 15,9 16,9 17,9

1 No. of slices 67 77 85 93 98 107 108 124

2 No. of registers 43 48 53 44 50 57 53 60

3 No. of IO’s 33 35 37 39 41 43 45 49


39

Table 4.1 b Synthesis Report with bit resolution of X, Y= fixed and Φ= varied

The variation in resource utilization by making different bit width selections for X, Y and

Φ are shown in Figure 4.9 and Figure 4.10 respectively.

Figure 4.6Analysis on no. of slices, registers and IO’s with bit resolution of X, Y (varying) and Φ (fixed)

0

20

40

60

80

100

120

140

10,10,9 12,12,9 14,14,9 16,16,9

Slices

Registers

IO's

Serial Device Utilization

Bit Resolution


20,9 20,10 20,11

20,12

20,13

20,14

20,15

20,16

1 No. of slices 163 164 164 165 166 167 167 168

2 No. of registers 74 75 76 77 78 79 80 81

3 No. of IO’s 61 62 63 64 65 66 67 68

X axis: Bit resolution of X, Y, Φ


40

Figure 4.7Analysis on no. of slices, registers and IO’s with bit resolution X, Y (fixed) and Φ (varying)

4.3.4 Experimental Results

The experimental results show an increase in resource utilization with increase in bit

width resolution of X, Y and Φ. However the LMS error decreases where the condition

of bit resolution of X, Y > bit resolution of Φ holds. However for the CORDIC

algorithm the ideal bit width for X, Y=11, 11 bits and Φ=9 bits. The resource utilization at

this input bit width selection is tabulated in table 4.2.

Table 4.2 Device Utilization of CORDIC

0

50

100

150

200

20,20,9 20,20,11 20,20,13 20,20,15

Slices

Registers

IO's

Serial Resource Utilization Bit width resolution(X,Y, Φ)


11,11,9

1 No. of slices 77

2 Sliced Flip flops 48

3 IO’s 35

X axis: Bit resolution of X, Y, Φ


41

4.4 Conclusion

The fixed point arithmetic is used for mapping most of the FPGA designs due to high

complexity / cost of floating point hardware. For all the practical purposes the bit

resolution of input variables should be greater that the bit resolution of angle when

CORDIC is used as Direct Digital Frequency Synthesizer (DDFS).

4.5 References

[1] L.W. Couch 11, Modern Communication Systems, Prentice Hall, 1994.

[2] L.K. Tan, et al. "An 800-MHz quadrature digital synthesizer," IEEE JSSC, vol.

30, N 12, pp.1463-1473, 1995.

[3] J.E. Voider, "The CORDIC trigonometric computing technique," IRE

Transactions on Electronic Computers, vol. EC-8,pp.330-334, 1959.

[4] V.F. Kroupa, "Spectral Properties of DDFS: Computer Simulations and

Experimental Verifications," IEEE International Frequency Control Symposium,

pp.613-23, 1994.

[5] M.J. Flanaga, G.A. Zimmerman, "Spur-reduced digital sinusoid synthesis," IEEE

Trans. Comm. vol. 43, No. 7, pp. 2254- 2262, 1995.

[6] C.M. Rader, "VLSI systolic arrays for adaptive nulling," IEEE Signal Processing

Magazine, 1996.

[7] Mathworks Corp, MATLAB Technical Computing Environment

,www.Mathworks.com,Jan.2003.


42

[8] L. Presti, G. Cardamone, "A direct digital frequency synthesizer using an IIR

filter implemented with a DSP microprocessor," IEEE ICASSP-94, vol. 3, 1994

[9] E. Grayver, B. Daneshrad, "Reconfigurable Signal Processing ASIC

Architecture for High Speed Data Communications," ISCAS 98, June 1998

Optimizing Bit Serial Architecture 2013

43

Chapter 5

Optimizing Bit Serial Architecture ____________________________________________________________________________

5.1 Overview

Bit serial architectures are attractive choice for applications where data I/O is on a

serial interface. Many high speed serial interfaces are in use for many applications (like

Telecom serial interface port (TSIP), DSP serial peripheral interface) in our day to day

life. In these applications, it is always very tempting to use the serial clock to execute

the design. This requires innovative designs that can work on bit by bit basis. This

section presents two designs of considerable complexity to demonstrate the feasibility

of mapping algorithms on serial architectures. One is Adaptive Filter application and the

second is CORDIC algorithm. As multiplier and adder are basic components in most of

signal processing applications, their architectures are first discussed and then these

architectures are used in the complex examples to realize the effect of efficient

component design on the overhaul application.

Pin count, floor space, and wire length requirements are reduced in bit-serial arithmetic

VLSI designs. However, performing bit-serial arithmetic poses challenging design and

implementation problems. Research in bit-serial arithmetic using conventional binary

representations has focused on the design of multipliers and squarer’s [15] - [18].


44

5.2 Bit Serial Multiplication

Bit serial multiplication can be performed either by the serial-serial multiplication

technique or by serial-parallel multiplication technique. We have used the serial-serial

multiplication technique to realize a triangular compressor which performs efficient bit

serial multiplication.

The back ground research of serial multiplication reveals that significant work has been

done in the past. R. F. Lyon [1] in his research discussed about a very efficient serial

multiplier which was performing serial multiplication by utilizing an efficient two’s

compliment pipelined serial multiplier. The multiplier was heavy on resources and this

was the drawback of his technique. H. J.Sips [2] and by N. R. Strader and V. T.

Rhyne[3] focused on the multiplication of unsigned numbers and designed a modular

full precision bit serial multiplier. R. Gnanasekaran[4]developed a very complicated and

complex multiplication scheme which automatically caters for negative weight of the

most significant bit of the operands in the two’s complement representation. Rhyne and

Strader [5] presented a complicated booth recoded multiplication scheme in which n

identical cells result in 2n-bit product but this design resulted in unnecessary complexity

[6]. Few serial/parallel implementations were also presented by Gnansekaran [7].

Denyer and Renshaw used the modified Booth’s algorithm [8] and designed an NMOS

serial multiplier which utilized multiplier cells [9]. Kanopoulos presented a bit serial 3 x 3

matrix/vector multiplier [10]. After going through all the serial multipliers which have

been designed and implemented and keeping in view our requirement of handling the

video streaming which involves bit serial multiplication, a bit serial multiplication


45

algorithm was realized based on the serial-serial multiplication technique. Figure 4.1

illustrates multiplication of two [12] numbers, both the numbers have a bit width of eight

x bits. As a result of multiplication eight x partial products (PP_0 - PP_7) are generated

as shown below:-

A7 A6 A5 A4 A3 A2 A1 A0

B7 B6 B5 B4 B3 B2 B1 B0

A7B0 A6B0 A5B0 A4B0 A3B0 A2B0 A1B0 A0B0 PP_0








P14 P13 P12 P11 P10 P9 P8 P7 P6 P5 P4 P3 P2 P1 P0

Figure 5.1Multiplication of two numbers having a bit width of 8 x bits each

Figure 5.1 illustrates the bit serial multiplication. As arrive for the multiplication a dot

product takes place and results in A0B0which is P0(LSB of final product P)along with a

carry out. As the cycles continue and with the arrival of each progressing bit of A and B

this serial multiplication continues as illustrated in the figure above.

In each cycle the number of terms increase and it following the trend 2n+1, where n is

the cycle number. The shape of Figure 5.1 follows a triangle shape and that is why the


46

name of this technique is termed as triangular compression technique. In dot notation

Figure 5.2 shows the multiplication.

Figure 5.2 Serial Compression of two numbers illustrated in dot notation

The detail working and the partial product generation in each cycle is shown in Figure

5.3,the complexity of this algorithm is O (n).

Figure 5.3 Compression cycles for serial multiplication shown in dot notation


47

5.3 Algorithm for Bit wise Serial Multiplication

An algorithm for the bit wise serial multiplication is discussed here; the designed

algorithm can be mapped on any bit wise serial multiplication architecture.

The description of algorithm is as under:-

Algorithm

INPUT: A, B

OUTPUT: X

INITIALIZE: Ai and Bi = 0 for I > W-1(Where W is the width of the input)

c i,j and s i,j = 0 for all i, j

Generation of Terms

Begin

for i=0 to W-1

begin

for j=0 to W-1

begin

, 1, 1 , ,& 2i i i j i i j i j i jA B carry sum carry sum ;

end

,0i ip sum

for i = W to 2W-1

1, 1i W i Wp sum


48

Triangular compression

begin

for i=0 to W-1

begin

for j=0 to W-1

begin

{carry[i+1],product[i-1]} cycle[i][j]+cycle[i+1][j]+product[i] ;

end


49

7 6 5 4 3 2 1 0

7 6 5 4 3 2 1 0

0 0

1 0

1 1 0 1

2 0

2 1

AB

2 2 1 2 0 2

3 0

3 1

3 2

3 3 2 3 1 3 0 3

4 0

4 1

4 2

4 3

4 4 3

4 2 4 1 4 0 4

5 0

5 1

5 2

5 3

5 4

5 5 4 5 3 5 2 5 1 5 0 5

6 0

6 1

6 2

6 3

6 4

6 5

6 6 5 6 4 6 3 6 2 6 1 6 0 6

7 0

7 1

7

AA

A

2

7 3A

7 4

7 5

AA

7 6A

7 7 6 7 5 7 4 7 3 7 2 7 1 7 0 7A

14 13 12 11 10 9 8 7 6 5 4 3 2 1 0

Figure 5.4 Serial Multiplication Input to Triangular Compressor


50

5.4 Design Example of Bit Serial Multiplier

For a better understanding of the algorithm an example of proposed multiplier is

discussed. Let there are two four bit serial inputs A and B such that

A = 0101

B = 1111

0101 A

1111 B __________________

0101 pp_1 0101 pp_2 0101 pp_3 0101 pp_4 __________________ 1001011 X Figure 5.5Multiplication of two four x bit numbers

The step by step multiplication as per the algorithm discussed above is as under:-

1. The dot product of first bit of A and first bit of B results in first bit of X and a carry

out as illustrated in Figure 5.6

2. The dot product of second bit of A and second bit of B results in second bit of

and no carry forward as indicated in Figure 5.6.

3. The dot product of third bit of A and third bit of B results in third bit of X and a

carry forward as shown in Figure 5.6.


51

1 1

______ 1 X1

01 11

_________________ 0 1

0 ____________________

1 X 2 0

(a) (b)

101 111

_________________ 11

101 ____________________ 110 X 3 01 _____________________ 1 0 01 _______________________ 100

0101 1111

_________________ 100 ____________________ 000 0101 _____________________ 001 X4 100 _______________________ 1001

(c) (d)

Figure 5.6 (a) Bit wise dot product of first bit of A and B (b) Bit wise dot product of second bit of Aand B

(c) Bit wise dot product of third bit of A and B(d) Bit wise dot product of fourth bit of A and B

The partial products of fourth cycle and the carry of the third cycle concludes the final

product X as shown in Figure 5.6

5.5 Architecture

The architecture of bit wise serial multiplier is shown in Figure 5.7. The bit wise serial

output is available immediately in the in the next cycle after the input is received serially.

Since the bit width of each serial input is eight x bits the final product is sixteen x bits out


52

of which first eight bits contain most of the information. To keep the output limited to

eight bits, the last eight bits of the output are truncated and it results loss in the

precision but for the application such as video streaming it is covered and the saving in

the hardware resources viz~ a ~ viz the precision compromised is huge.

Figure 5.7 Bit Serial Compressor Based Multiplication Architecture showing the input X and Y, output P,

cycle tracker, terms generator and triangular serial compressor

5.6 Implementation and Results

The efficiency of proposed bit serial multiplier is compared with a conventional bit serial

multiplier [11]. Both the designs were implemented on FPGA and the implementations

results are shown in Table 5.1. The proposed design was compared with a conventional

bit wise serial multiplier and the results show about 38% saving in number of look up

tables, 30% saving in the number of flip flops and 25 % increase in the operating

frequency.


53

Table 5.1Implementation Results

Look Up Tables ( Numbers)

Flip Flops ( Numbers)

Clock Frequency (MHz)

Xilinx

Virtex5

Altera

Stratix-III

Xilinx

Virtex5

Altera

Stratix-III

Xilinx

Virtex5

Altera

StratixIII

Conventional bit wise serial

multiplier

13 11 12 12 454 656

Proposed bit wise serial multiplier

9 8 8 8 565 840

5.7 The LMS FIR filter Using Bit Serial Compressor

In the least mean square FIR filter a weighted linear sum of the present and past K

samples of the input signal is used to find the filter output at any instance of time.

Mathematically it can be represented as

1

0

( ) ( ) ( ) ( )N

T ij i j

jx i v i y i v y

(1)

Where

( ) [ ( ), ( 1), ( 2),.............. ( 1)]Ty i y i x i y i y i k

0 1 2 1( ) [ ( ), ( ), ( ),................. ( )]Tkv i v i v i v i v i

Equation 2 and Equation 3 update the weights of the algorithm

( 1) ( ) ( ) ( )v i v i e i y i (2)

( ) ( ) ( )e i d i x i (3)


54

( )d i is the signal which is used as the reference

The filter computation and the adaptation require O(D) computation [13] and the

computations involves 2D additions and 2D+1 multiplications.

Figure 5.8 LMS FIR Filter with serial i/p and o/p

Figure 5.8 shows a 3 x tap LMS FIR filter, both i /p are bit wise serial. There are three x

multipliers which form a part of the filter. Multiple instances of proposed triangular

compressor based serial multiplier have been used to realize this filter. For the addition

bit wise serial adder as discussed below has been implemented.

5.7.1 Bit Serial Adder

Figure 5.9 shows a bit wise serial adder which performs the operations as per

Equation 1 and Equation 2.

( _ )Sum A B carry in (1)

( )Carryin AB r AB BCarry (2)


55

Figure 5.9 Bit wise serial adder

5.8 LMS Filter Architecture

Figure 4.10 shows the architecture is above filter as implemented. The architecture

comprises of bit wise serial proposed multipliers, bit wise serial adders, registers, error

calculator and filter weight adjuster which adjusts the weight basing on the difference

between the filter final o/p and the reference signal .

Figure 5.10Architecture of bit wise serial LMS filter composed of triangular compressor serial adder’s

error calculator and filter weight adjuster


56

5.9 Implementation and Results

Two versions of the LMS adaptive filter one utilizing the bit serial triangular compressor

based multiplier and the second utilizing the conventional bit serial multiplier were

implemented on FPGA. The results of both the filter versions are tabulated in Table 5.2.

Table 5.2Implementation Results

Look Up Tables

( Numbers) Flip Flops

( Numbers) Clock Frequency

(MHz)

Xilinx

Virtex5

Altera

StratixIII

Xilinx

Virtex5

Altera

StratixIII

Xilinx

Virtex5

Altera

StratixIII

Conventional adaptive

filter

39 33 36 36 454 565

Proposed adaptive

filter

27 24 8 24 656 840

The results show 38% saving in the number of Look Up Tables, 30% saving in the

number of Flip Flops and 25 % increase in the clock frequency.

5.10 References

[I] R. F. Lyon, “Two’s complement pipeline multipliers,” IEEE Trans.

Communication. vol. COM-24, no. 4, pp. 418-425, Apr. 1976.

[2] H. J. Sips, “Comments on ‘An O(n) parallel multiplier with bit sequential input and

output,’” IEEE Trans. Computer, vol. C-31, no. 4, pp. 325-327, Apr. 1982.


57

[3] N. R. Strader and V. T. Rhyne, “A canonical bit-sequential multiplier,” IEEE

Trans. Computer, vol. C-31, no. 8, pp. 791-795, Aug. 1982.

[4] R. Gnanasekaran, “On a bit-serial input and bit-serial output multiplier,” IEEE

Trans. Computer, vol. C-32, no. 9, pp. 878-880, Sept. 1983.

[5] T. Rhyne and N. R. Strader, 11, “A signed bit-sequential multiplier,” IEEE Trans.

Computer, vol. C-35. no. 10, pp. 896901, Oct. 1986

.[6] L. Dadda, “On serial-input multipliers for two’s complement numbers,” IEEE

Trans. Computer, vol. 38. no. 9, pp. 1341-1345, Sept. 1989.

[7] “A fast serial-parallel binary multiplier,” IEEE Trans. Computer, vol. C-34, no. 8,

pp. 741-744, 1985.

[8] P. Denyer and D. Renshaw, VLSI Signal Processing: A Bit-Serial Approach,

Addison-Wesley, 1985.

[9] J. Newkirk and R. Mathews, The VLSI Designer’s Library. Addison- Wesley,

1983.

[10] N. Kanopoulos, “A bit-serial architecture for digital signal processing,” IEEE

Trans. Circuits Sys., vol. CAS-32, no. 3, pp. 289-291, 1985.

[11] C.W.Ng, N.Wong and T.S Ng “Efficient FPGA implementation of bit stream

multipliers” Electronics letter online no: 20070293, department of Electrical and


58

Electronic Engineering, The University on Hong Kong 26 April 2007.

[12] Woon-SengGan, Sen M. Kuo,“Teaching DSP Software Development: From

Design to Fixed-Point Implementations” IEEE Transactions On Education, Vol.

49, No. 1, February 2006

[13] “Implementation of an LMS Adaptive Filter on an FPGA Employing Multiplexed

Multiplier Architecture” Daniel Allred, Venkatesh Krishnan, Walter Huang, and

David Anderson Center for Signal and Image Processing, Georgia Institute of

Technology, Atlanta, GA 30332-0250.

[14] C.W.Ng, N.Wong and T.S Ng “Efficient FPGA implementation of bit stream

multipliers “Electronics letter online no: 20070293, department of Electrical and

Electronic Engineering, the university on Hong Kong, 26 April 2007.

[15] Dadda, L., “On Serial-Input Multipliers for Two’s Complement Numbers”, IEEE

Transactions on Computers, Vol. 38, No. 9, pp. 1341-1345, Sep. 1989.

[16] Denyer, P. and D. Renshaw, WSI Signal Processing: A Bit-Serial Approach,

Addison-Wesley, 1985.

[17] Ercegovac, M.D. and T. Lang, Division and Square Root: Digit-Recurrence

Algorithms and Implementations, Kluwer, Boston, 1994.

[18] Strader, N.R. and V.T. Rhyne, “A Canonical Bit-Sequential Multiplier”, IEEE

Transactions on Computers, Vol. C-31, No. 8.


59

[19] Andraka R., .Building a high performance bit serial processor in an FPGA., On-

Chip System Design Conference, North Kingstown, 1996.

[20] http://comparch.doc.ic.ac.uk/publications/files/osk00jvlsisp.ps

Optimization on FPGA Slice Fabric 2013

60

Chapter 6

Optimization on FPGA Slice Fabric ____________________________________________________________________________

6.1 Overview

FPGA is an essential part in today’s almost every communication system involving

software defined signal processing applications. The reason being the design

algorithms are tested for their performance in terms of accuracy, timing, complexity,

area and power consumption after being mapped on the FPGA. All these parameters

are related with the bit width which is being processed at a time and which eventually

depends upon the architecture of the FGPA i.e the available resources. Any reprieve of

even a single bit may cause degradation of magnitudes therefore at design stage the

inbuilt composition of objective tool if taken in contemplation ends up in guaranteed

optimal performance [1].

This work extends the application of methods described in [2] [3] [4]. This resulted in

reducing critical path. By introducing the multiple pipelining along with some techniques

optimization has been achieved for the designing of different digital filters.

Compression trees play a vital role in the overall optimization and they often have

different configurations and can optimize the algorithm if selected properly. Same is the

case with pipelining and bit width reduction depending upon the type of optimization

required. These techniques when performed at the very beginning i.e at the design

stage results in considerable optimization. The optimization techniques that map on the


61

structure of FPGA are described below as it is the first step towards the process of

performance maximization.

6.2 Optimization Techniques vs FPGA architecture

The internal structure of Virtex-5 FPGA is shown in Figure 6.1; the express fabric

consists of CLBs which has multiple LUTs which contain a dedicated carry chain for

high speed data propagation.

Figure 6.16 input LUTs, CLBs and carry chain of Virtex-5 slice exploded view

Different i/p and o/p patterns of the LUTs provide different options for multiple

combinational logics. From Figure 6.1 it is clear that the LUT is 6- i/p and for optimized

implementation each available resource has to be carefully used by keeping in view the

data it can handle.


62

Figure 6.2Vertex 5 FPGA DSP 48 slice

The DSP slice in Virtex-5 is shown in Figure 6.2, besides the rated frequency of 550

MHz with a 25 x 18 bits resolution, this feature if fully utilized helps in the designing and

prototyping of high-performance digital filters [5].

A detailed analysis of the Virtex-5reveals that an optimal implementation of any

algorithm based on multiplication reduction methods can be mapped on this FPGA

using 6:3 compression tree structure as opposed to 3:2 or 4:2 or other similar structures

due to the presence of 6 input LUTS which actually reduces the no. of hops there by

reducing the critical path.

6.2.1 Compression Trees

Compression trees are used for multiplication instead of using a dedicated

hardware one of the examples is the Wallace compression tree [4].Wallace tree


63

with a compression ratio of 4:2 was recognized as the most efficient but with a 6

i/p LUT present in the Virtex-5 FPGA the Wallace tree with compression ratio of

6:3 provides the best results in terms of performance.

6.2.2 Multiplier Pipelining

DSP 48 is a block which performs the multiplication and accumulation in the

Virtex-5 FPGA. There is an inherent 3 stages of pipelining which enables up to 4

levels of pipelining at max without having to incur any additional hardware

resources. The design’s throughput can be enhanced from 80 MHz to 500 MHz

[5] by efficiently using this slice.

6.2.3 Optimization of Bit Resolution

Appropriate choice of bit width of an algorithm has a direct impact on the

consumption of power, mean square error (MSE) and complexity [8]. The bit

width has to be carefully selected so as to map accurately on the internal

resource structure of FPGA. This will result in a guaranteed optimized design.


64

6.3 Design Optimizations

To analyze the hardware design optimizations few digital filters were studied. The first

was a FIR filter [6] which was implemented different forms. Then the same filter was

converted in its CSD form and conversion of same FIR filter employing different

compression trees for the synthesis of same. The second was an IIR filter in pipelined

and direct form was implemented and thirdly a complex multiplier was also

implemented.

6.3.1 Optimization of FIR filter

Figure 6.3shows an FIR filter with seven taps which was synthesized on FPGA of

Xilinx Virtex-5 family. The design was implemented using 8 DSP48 blocks

running at 73.678 MHz [9] [10].

Figure 6.3 FIR filter having seven taps

The systolic implementation of same resulted in 8 x times faster resource

utilization i.e 592 MHz as shown in Figure 6.4.


65

Figure 6.4Systolic FIR filter with cut-set represented by dashed lines

Compression trees with compression ratios of 3:2, 4:2, 6:3 and 7:3 were used

after transforming the same filter in Canonic Sign Digit (CSD) form. Figure 6.5,

Figure 6.6, Figure 6.7 and Figure 6.8 represent the schematic of various

compression trees. The numbers of ones in a coefficient are reduced by around

33% using CSD representation [13].

Figure 6.5Schematic of 6:3 type compression trees


66




67

Figure 6.8 Schematic of 7:3 type compression trees

6.3.2 Optimization of IIR filter

Figure 6.9 shows the implementation of a1storder IIR filter [7] [8] . Pipeline stages

were added to the filter as shown in Figure 5.10 by application of Look ahead

transformation [2]. The Synthesis of both the filters shows an increase in clock

speed up to 370.157 MHz from 247.588 MHz.

( 1) ( ) ( 1)x i a x i b y i (6)

Figure 6.9 IIR Filter of first order


68

Now after applying the transform, we have

2

( 2 ) ( 1) ( 1)( ) ( 1) ( 2 )

x i a x i b y ia x i ab y i b y i

(7)

Figure 6.10 First order transformation of IIR filter

6.4 Complex Multiplier

4 multiplications, 1 addition and 1 subtraction operation is involved in each complex

multiplication.

(a + ib) x( c + id)=(ac - bd) +i(ad + bc)(9)

Figure 5.11showsthe schematics of complex multiplier.

Figure 6.11Schematic of Complex multiplier


69

LUT based execution method is realized to implement complex multiplier, by utilizing

the carry chain the implementation was very efficient. The partial product generation

was achieved by utilizing the Booth algorithm and partial product reduction is achieved

by Wallace tree. Booth recoding algorithm [11] is used for generation of partial products

that are reduced by half. Compression trees incorporating different compression ratios

are implemented for comparison of LUTs used and the path delays are optimized by left

to right scanning of operands.

The two’s compliment equivalent of a multiplier X is described by following Equations:-

21

10

2 ( 2 )k

ii

i kk

X b b

(10)

3 2 1 1 0 1( 2 )2 ......... ( 2 )i ki i i kb b b b b b

(11)

/ 2 12

2 1 2 (2 1)0

( 2 )2i

kk k k

kb b b

(12)

Here for an even value of i 1ib represents the sign bit, the following equation gives the

product / 2 1

22 1 2 ( 2 1)

0

( 2 )2i

kk k k

kY C b b b

(13)

The overall architecture of optimized complex multiplier implementation by using

encoding of consecutive two bits to a single bit through scanning three consecutive bits

is given below. This reduces the number of partial products by half.


70

Figure 6.12 complex multiplier incorporating booth encoded wallace tree reduction technique

6.5 Experimental Results

Figure 6.5 and Figure 6.6 represent the FIR and IIR filters. These filters have been

realized by implementing compression trees having different compression ratios. The

design environment was based on VHDL Coding Software implemented using Xilinx

ISE and Modelsim simulator.

6.5.1 FIR Filter

Designs were synthesized by focusing on the clock frequency. From the

synthesis results minimum clock period and the logic utilization are compared.

The results of different implementations of FIR filter after being mapped with

compression tress with different compression ratios were compared.


71

Figure 6.13The frequency (MHz) and number of utilized LUTs in CSD by using different

compression trees for FIR filter Comparison

6.5.2 IIR Filter

IIR filters are compared by implementing different forms and incorporation

different compression tree ratios.

Figure 6.14The number of utilized LUTs and frequency (MHz) in CSD by using different

compression trees for IIR filter Comparison.


72

6.6 Complex Multiplier Synthesis

With same optimization parameters a 32 bit complex multiplier was synthesized by

incorporating compress tree with different compression ratios and the results were

compared. Figure 6.13 show the results of the synthesis in terms of Look up tables

utilized and path delays.

6.6.1 Optimization of Bit Width

Direct form FIR filter was synthesized using CSD implementation [12] for various

bit widths of input and the filter coefficients. Figure 6.11shows the resulting LUTs

andclock speed.

Figure 6.15 Complex multiplier using different compression treesfor Comparison of LUTs and

Path Delay of


73

Figure 6.16 LUTs and Clock rates for FIR filter

The results show a 10% saving in the look up tables and a increase of 1.1% se

in clock speed . The error has also been reduced, results show a variance of 0.1704 in

the LMS error when a format of Q1.15 was used during the implementation.

6.7 Conclusion

Key components of DSP systems have been implemented. Throughout the

implementation the focus was on the LUT and critical path delay reduction by keeping in

view the available resources on the target platform. Compression tree with different

compression ratios were realized during the implementation and results how that the

compression ratio of 6:3 correctly maps on the inherent structure of Virtex -5 FPGA for

all practical purposes.


74

6.8 References

[1] Xcell Journal “Achieve high performance with vertex 5 FPGA”,fourth quarter

2006.

[2] K.Satoh, J.Tada, H.Yanagida, and Y.tamura,”Parallel Image Reconstruction

Operation By dedicated Hardware for three Dimensional Ultrasound

Imaging”,pp.1522-1525, Proc of IEEE UFFC, Nov. 2007

[3] Keshab.k.parhi, “Pipelined and parallel recursive and adaptive filters” chapter 10 of

pipelined adaptive digital filters

[4] Keshab.k.parhi, “Bit level Arithmetic architectures” chapter 13 of pipelined adaptive

digital filters

[5] Vojin G. Oklobodzija, “The Computer Engineering Handbook”, CRC Press

[6] Anna Kunchevaand GeorgeYanchev, “ Synthesis and implementation of DSP

Algorithm in Advanced Programmable architectures” Proc of ISCCS 2008.

[7] AntoliSergyienko, Volodymir Lepekha, JuriKanevski and PrzemyslawSoltan, “

Implementation Of IIR Digital Filters In FPGA” Poland.

[8] Shanthala S and S.Y.Kulkarni, “Hight speed and low power FPGA Implementation

Of FIR Filter for DSP Applications”,EuropeanJounral of scientific research ISSN

1450-216x Vol.31 No.1(2009), PP. 19-28.

[9] Xilinx Co.,:Xcell journal vol.58.59”,2007 Spring.

[10] D.Phanthavong,”Designing with dsp 48 blocks using precision synthesis,”Xcell

Journal, 2005.


75

[11] Ki-seon Cho, Jong, Jin Seok, Goang Choi, “54x54 bit Radix 4 Multiplier based on

modified booth algorithm”, ACM 2003 1-58113-677.

[12] AqibPerwaiz and Shoab .A .Khan “ Effect of Bit Precision on hardware

complexity for DDFS architecture”, IEEE Conference

Conclusion and Future Work

2013

76

Chapter 7

Conclusions and Future Work ____________________________________________________________________________

This work has addressed the optimization techniques custom to the target technology

under consideration. A mathematical model that optimizes mapping of Digital Signal

Processing (DSP) algorithms on FPGAs has been presented. Any high-end DSP

system consists of multiple sub-systems. Each sub-system can be defined by multiple

architectural options based on the design constraints. Beside architectural design

options, there are many other attributes that directly affects the mapped resources. The

world length quantization plays a critical role in further optimizing the selected

architectural option. The thesis has modeled all these attributes and the solution lists

the resources required for the optimized mapping. The target device is selected based

on the results and the constraints defined in the design. By adjusting the constraints the

target device is changed and low power solutions are possible. The experiments

demonstrate that world length of intermediate variables does not help in improving the

performance beyond a certain point. The thesis has also explored the intricate

relationship of intermediate variable lengths, with the overall accuracy of the results and

links it with the complexity of HW. Several design examples have been listed to confirm

the validity of the findings.

In the design space exploration, several architectural options have been discussed. The

options include bit serial, byte serial, folded, unfolded, and distributed arithmetic based

architectures. The architectures that are optimal for custom design may perform poorly


2013

77

once mapped on FPGA. This observation is substantiated by giving design examples

from Compression tress. These trees are very fundamental to DSP architectures due to

their vide use in general purpose multiplication, multiplication with constants and

multiple operand addition and subtraction. Different compression ratios for Wallace tree

have been explored to identify the correct ratio of Wallace compression tree to best map

on LUTs based FPGA.

The inherent architecture of device under consideration plays an important role in

optimizing the mapping of the algorithm on FPGA. An automatic technique that explores

different architectural options subject to design constraints can save FPGA resources.

The automatic technique is based on a sound mathematical model that helps in

suggesting the best target device that meets all the constraints in an optimized solution.

Besides exploring architectural options, there are many other design parameters that

further help in optimizing design to meet the required specifications. The quantization of

each variable in the algorithm is very critical. Optimized Word-Length Allocation (WLA)

tailors the precision arithmetic operations and results in saving area and cost. The

thesis lists techniques for optimization and implements them while pursuing the ultimate

goal of algorithm design, there have been contributions in straight away saving a high

percentage of resources in case of FIR and IIR filters or for that matter any complex

multiplier. The deductions from the thesis are listed as below:-

1. The mathematical model presented in chapter 2 helps an algorithm designer to

map his algorithm on different available architectural options thereby while


2013

78

adjusting the weight-ages of different resources the best fit target FPGA is also

identified. The complex example of WCDMA receiver has been discussed and

with the given throughput requirement at each stage the design maps perfectly

on the Spartan 3A FPGA on one set of constraints and Vertex 5 FPGA on the

other set of constraints. Any system can be optimally designed to fit in the FPGA

design space basing on the fine adjustment of the constraints. By carefully

adjusting the constraints low power solutions are realizable.

2. In a particular digital signal processing system the number of processed bits at a

time is a major source of resource wastage. The selection of the word-lengths of

variables is carried out to meet the applications output error tolerance. To

achieve an optimum word length at which the cost and the output distortion

match a set criteria depending upon the application is a target for an algorithm

designer. As in case of CORDIC (discussed in chapter 4) for all practical

purposes the bit resolution of input variables should be greater that the bit

resolution of angle when CORDIC is used as Direct Digital Frequency

Synthesizer (DDFS).

3. A DSP algorithm designer must determine the dynamic range and desired

precision of input, intermediate, and output signals in a design implementation to

ensure that the algorithm fidelity criteria are met. In most of the cases results show

a linear increase in the hardware complexity with increase in the bit resolution,

going beyond a certain bit resolution is not advisable as it only adds to the


2013

79

hardware complexity but has no contribution towards the reduction in the least

mean square error.

4. To implement bit serial multiplication in DSP algorithms the proposed bit serial

multiplier proved to be more efficient.

5. Compression trees are used to add different partial products of a multiplication

and eliminate the need for using a of dedicated multiplier hardware. Traditionally

the 4 to 2 Wallace tree has been considered the most efficient compression

choice but in our case the choice of 6 to 3 compression techniques is a better

option as it exactly maps on the inherent structure of FPGAs which have 6 i/p

LUT.

Combining all the above deductions concludes an algorithm for optimization across slice

fabric of FPGA, the optimization steps are as under:-

1. Ascertain the word length allocation.

2. Check the internal pipelining of DSP blocks within the FPGA under use.

3. For multiplication use the compression technique that exactly maps on the

internal architecture of the target device.

4. For Serial multiplication use proposed bit wise serial triangular compressor

multiplier.

5. Proposed CORDIC can be used as DDFS


2013

80

Further extension of the work leads to the compilation of component library for different

FPGA vendors to automate the optimization of the DSP algorithms by the designers.

Another extension of the work is the implementation of multiple implementation

techniques on the components of any complex digital signal processing systems

Optimized Implementation across Slice Fabric on...

Documents

Transcript of Optimized Implementation across Slice Fabric on...