Module 6 - Imperial College LondonStarCore has strongly pipelined architecture 4 MACs per clock 1...

Digital Signal Processing. Slide 6.1

Module 6

Architectures for DSP


Contents

Nature of DSP algorithms Microprocessor architectures

Von Neumann Harvard

Desirable features in DSP hardware Examples

TMS320C5x DSP56000/1 DSP56300

FPGA Implementations


Practical Applications

Real-time implementation One sample out for one sample in Finite precision Fixed or floating point computation

DSP Algorithms are Compute intensive Data move intensive

Standard microprocessor architectures are not fastenough


DSP Processor Architecture

Von Neumann (general purpose computers) Single memory space Single address bus Single data bus

Harvard architecture (DSP hardware) Separate memory space for program and data Separate memory and data busses Separate address arithmetic unit for each memory space


Useful Hardware Features forDSP

Single cycle multiply-accumulate (MAC) Long accumulator word length

Results of MAC grow

Bit reverse addressing Needed to re-order data in FFTs

Zero-overhead looping Many DSP algorithms have short loops executed many times

Short cycle time The faster the better!


Example ALU


Motorola DSP56300

Key Features general-purpose digital signal processing multimedia and telecommunication applications,

videoconferencing, cellular telephony. It is a member of theDSP56300 core family of programmable CMOS

24-bit addressing, instruction cache, and DMA 80/100 MIPS using an internal 80/100 MHz clock at 3.0-3.6 volts Highly parallel instruction set Fully pipelined 24 x 24-bit parallel Multiplier-Accumulator (MAC)


56-bit parallel barrel shifter 24-bit or 16-bit arithmetic support under software control Addressing modes optimized for DSP applications Nested hardware DO loops Fast auto-return interrupts 32-bit parallel PCI Host Interface Two Enhanced Synchronous Serial Interfaces (ESSI) Very low power CMOS design Wait and Stop low power standby modes


Memory Interface 1024-4096 x 24-bit Program RAM 2048/3072 x 24-bit X data RAM 2048/3072 x 24-bit Y data RAM 3K x 24-bit bootstrap ROM Data memory expansion to two 16 M x 24-bit word memory spaces Program memory expansion to one 16 M x 24-bit word memory

space On-chip DRAM controller requires no additional circuitry to

interface to DRAMs


Motorola StarCore

High performance DSP processor Supported by joint effort of consortium involving

manufactures and developers (eg. Lucent) A key criteria for success of a processor is ‘on the

plate’ delivery to developers pre-written software modules for standard functions such as FFTs

and filters application notes prototyping support such as hardware evaluation boards with PC

software development environments


StarCore has strongly pipelined architecture 4 MACs per clock 1 StarCore MIP is worth about 4 regular MIPS

Dedicated on-chip co-processor for filtering 16, 24 and double precision 32 bit support Aimed mostly at ‘high-end’ applications

Telecoms infrastructure products such as network gateways Eg: 96 channels of 128-taps adaptive filtering per chip

Top

A

n 1.2

e DSP market hassuch as the ever

be implemented inice are possible.

iers onto the deviceultpliers from 4 to >

signal flow graphsR SFGs filters for

DSP power, morede), perhaps blocke coming....

ugust 2005, For Academic Use Only, All Rights Reserved

The FPGA DSP Evolutio• Since around 1998 the evolution of FPGAs into th

been sustained by classic technology progress present Moore’s law.

• Late 1990s FPGAs allow multipliers to FPGA logic fabric. A few multipliers per dev

• Early 2000s FPGA place hardwired multiplwith clocking speeds of > 100MHz. No of m500.

• Mid 2000s FPGA place DSP algorithms (SFGs) onto devices. Full (pipelined) FIexample are available (DSP48 slice)

• Late 2000s - who knows! Probably morearithmetic capability (fast square root, divifloating point. But rest assured there is mor

Patrick Naylor

Rectangle

Top

A

7.5


FPGA Architecture

Routing channels

Programmableinterconnectpoint (PIP)

I/O Blocks

RAM

Logic Blocks

Patrick Naylor

Rectangle

Top

A

7.6

ontain typical logicf logic functionality

s are used to build

Carry Logic


The Logic Slice• A logic block, or slice in Xilinx terminology should c

components to be flexible and allow a wide range oand arithmetic functionality to be created:

• The elements of a logic block or group of logic blockuser defined functions.

Look-up Tables MultiplexersFlip-flops

GENERIC FPGA LOGIC BLOCK

Patrick Naylor

Rectangle

Top

A

cks 1.3

as repositories ofther.

eful about running considerations are other tools.

build it:


FPGAs: A “Box” of DSP blo• We might be tempted to view latest Xilinx FPGAs

DSP components just waiting to be connected toge

• In the days of circuits boards one had to be carbusses close together, lengths of wires etc. Similiarrequired for FPGAs and dealt with by synthesis and

• However, the high level concept, take the blocks, &

“Connectors” Logic Arithmetic

Registers and Memory

DesignVerify

Place and Route

Clocks Input/Output

Patrick Naylor

Rectangle

Top

A

Design 1.12

MHz and more are

ds need only be as

le” (i.e. low FGPA.

delta techniques

tions we can makeal downconversion.

rations. which arers.


Using FPGA - Rethinking DSP • Think very fast - Current data clocking rates of 200

achievable now.

• Think minimum data bit-widths - FPGA data worwide as is necessary for the algorithm/application.

• Think DSP “tricks” - we will be using some “simpcost filters) - CIC, difference filters, moving average

• Think Oversampling Strategies - using sigmaproduce simple multiplier-free digital filters.

• Think Undersampling Strategies - for communicause of high sampling rates and digital filters for digit

• Think algorithms with square root and divide opetraditionally avoided for conventional DSP processo

• Think differently - it’s a new design challenge.

Patrick Naylor

Rectangle

Top

A

GAs 1.13

w level simplicity

building digital filter

one or more FAsoaches).

A B C⊕ ⊕

AB AC BC+ +

Cin

fclk = 200MHz


DSP Implementation with FP• The power of FPGAs for DSP is primarily in their lo

on which to build high level complexity.

• We can demonstrate some of the design options by from first principles using just full adders (FA):

• With a typical FPGA logic block we can produce(either from available logic or via look-up table appr

0 0 00 0 10 1 00 1 11 0 01 0 11 1 01 1 1

0 00 10 11 00 11 01 01 1

A B Cin Cout SoutSout ABC ABC ABC ABC+ + += =

Cout ABC ABC ABC ABC+ + += =

ΣCout

A B

Sout

Patrick Naylor

Rectangle

NIn ts sake we will specify fclk =2

In his simple FA can be use top second. Therefore the slidesp ced. However the design willa , and strategies for reducingc

T o on, are probably well knownto

otes: a typical FPGA the FA circuit can be clocked at a very high rate, for argumen

00MHz.

the following sequence of high level designs we want to demonstrate how troduce a powerful DSP digital filter also, potentially, running at 200 Msamples/erhaps do not present exactly how a custom DSP digital filter would be produllow us to demonstrate the difference between data rates and logic clock ratesosts by efficient design.

he design techniques associated with implementing multipliers and adders and s ASIC engineers, but probably not well known to DSP engineers.

Top

A

er 1.14

r (pipelined):

MHz

0Σp

fclk = 200MHz


FPGA - 8 Bit Parallel Add• 8 FAs and some flip-flops produce an 8 bit full adde

• Data can be clocked into this circuit at a rate of 200

i.e. 200,000,000 8-bit additions per second.

Σ Σ Σ Σ Σ Σ Σ Σ

00101001+0100010101101110

0 1 0 0 0 1 0 10 1 0 0 0 1 0 10 1 0 0 0 1 0 1

0 1 1 0 1 1 1 0

Patrick Naylor

Rectangle

NIf o at fclk = 200MHz, meaning2

If

th his is not necessarily a wrongth is sequence of examples wec

N are sharing the FA then thed 00,000 adds/s.

N e parallel adder, or clock thes 14.28 million adds/s

Σ

der

otes: we choose to pipeline the 8 bit parallel adder then we can reliably clock this als00,000,000 adds/s.

we chose not to pipeline (insert no single bit delays or flip-flops between FAs)

en the adder has a carry ripple which can limit the maximum clocking speed. (Ting to do and in some cases not pipelining may be desirable, however for th

hoose to pipeline.)

ote that we could also use a FA and perform the addition serially. Because weata processing rate is reduced by a factor of 8, i.e. fdata = 200 / 8 = 25MHz, 25,0

ote to extend to, for example, a 14 bit serial adder just add 6 more stages for therial adder for another 6 cycles, however the data rate then reduces to 200/14 =

Σ Σ Σ Σ Σ Σ Σ

Σ

1001010010100010

LSB MSB

Delay

01110110LSB MSB

Σs

fdata = 25MHz

fclk = 200MHz Bit serial 8 bit ad

Top

A

1.15

ce an 8 bit parallel

d in at 200MHz

Mp

0MHz

tional logic (flip-flop, XOR gate...)

dder for multiply array - FAx


FPGA - 8 bit Multiplier• With just a few additional logic gate, we can produ

multiplier:

• Data in this circuit can also be pipelined and clocke(although there would be a latency)i.e. 200,000,000 8 bit multiplies per second








Σ Σ Σ Σ Σ Σ Σ Σfdata = 20

0 1 0 0 1 0 1 10

0

11

11

0

- Addi

Σ - Full a

Parallel Multiplier

Patrick Naylor

Rectangle

Patrick Naylor

Rectangle

NTw“cdisu

T

A feedback partial products top MHz, but one multiply wouldta

T parallel and serial multipliers(a we reduce the silicon area/re he resources must be sharedb

11010110x001011011101011000000000110101101101011000000000110101100000000000000000

0010010110011110

Ms

otes:his “example” array is simply a “mapping” of a direct 8 bit multiplicationhereby 8 partial products are created and added together. The cost of eachell” is just a little more than the logic cost of a full adder (FA), which we mightenote simply as FAx (Regardless of how the multiplier is implemented there a cost associated, and the more bits then the higher the cost, e.g. if donesing memory then require more memory for more bits)

he array has many variants (for signed numbers, carry lookahead etc)

lternatively we could reduce the hardware costs and use one parallel adder androduce a serial multiplier. The logic in this circuit could still be clocked at 200ke 8 cycles and hence the data rate is only 200/8 = 25MHz.

he concept constant area-speed product is evident from a comparison of thend also the parallel and serial adder example above). Generally speaking ifsources required by a factor of N, then the computation time increases by N, as t

y N different sub-computations in a time sequential manner.


fdata = 25MHz

Serial Multiplier

Top

A

1.16

e can produce an 8

0MHz

FIRp

data = 200MHz


FPGA FIR Filter

• Using 7 parallel multipliers and 6 parallel adders wtap parallel FIR digital filter (FIRp):

• Data in this FIR filter is pipelined and clocked at 20

i.e. 200,000,000 samples per second

w0 w1 w5 w6w2 w3 w4

Mp Mp Mp Mp Mp Mp Mp

Σp Σp Σp Σp Σp Σp

f

Patrick Naylor

Rectangle

NIf e hardware reduces by 1/8,h

O der - approximately the samec 25MHz.

S ocessor!

FIRs

ata = 25MHz

otes: we chose to use the slower serial multipliers and adders, then the cost of thowever the data rate is only 25MHz:

r alternatively we could share a single parallel multiplier and a single parallel adost of the circuit above - but this time only have a data sampling rate of 200/8 =

haring a single parallel multiplier is of course similar to the concept of a DSP pr

Ms

Σs

Ms

Σs

Ms

Σs

Ms

Σs

Ms

Σs

Ms

Σs

Ms

fd

DataMult

Σp

Mp

fdata = 25MHz

FIRs

Top

A

1.17

tion we require 4

running at 200MHz

billion MAC/sec!

fdata = 200MHz


FIR Filter Banks • For a particular digital communications applica

channels each of 25 MHz bandwidth:

• We can set up 4 parallel FIR filters with each one sampling rate

• So the total computation rate is 4 x 7 x 200M = 5.6

MAC - Multiply/accumulate operation

freq

25MHz

mag

nitu

de

FIRpFIRp FIRp FIRp

Patrick Naylor

Rectangle

Patrick Naylor

Rectangle

N5 type of design is absolutelyp ation FPGAs.

T n multiply-adds per second.T or has other capabilities andfl a DSP processor is a poors

O 5MHz was required then wec member the individual logice

O

= 25MHz

data = 25MHz

otes:.6 billion MACs/sec is a lot of processing! With current FPGA technology thisossible. In fact we can easily go an order of magnitude higher with high specific

ypically a state of the art DSP processor could implement around 500 millioherefore around 12 are required to sustain this rate! Of course the DSP processexibilities that the FPGA does not have however for this specific requirementolution compared to the FPGA solution.

nce again, if our requirement was different and only a data sampling rate of 2ould design using serial FIR filters with a total of 1/8 of the hardware cost (relements are still clocked at 200MHz):

r we could share one fully parallel FIR filter and multiplex the four channels

fdata

FIRs FIRs FIRs FIRs

fFIRp

Module 6 - Imperial College LondonStarCore has strongly pipelined architecture 4 MACs per clock 1...

Documents

Transcript of Module 6 - Imperial College LondonStarCore has strongly pipelined architecture 4 MACs per clock 1...