Module 6 - Imperial College LondonStarCore has strongly pipelined architecture 4 MACs per clock 1...
Transcript of Module 6 - Imperial College LondonStarCore has strongly pipelined architecture 4 MACs per clock 1...
Digital Signal Processing. Slide 6.1
Module 6
Architectures for DSP
Digital Signal Processing. Slide 6.2
Contents
Nature of DSP algorithms Microprocessor architectures
Von Neumann Harvard
Desirable features in DSP hardware Examples
TMS320C5x DSP56000/1 DSP56300
FPGA Implementations
Digital Signal Processing. Slide 6.3
Practical Applications
Real-time implementation One sample out for one sample in Finite precision Fixed or floating point computation
DSP Algorithms are Compute intensive Data move intensive
Standard microprocessor architectures are not fastenough
Digital Signal Processing. Slide 6.4
DSP Processor Architecture
Von Neumann (general purpose computers) Single memory space Single address bus Single data bus
Harvard architecture (DSP hardware) Separate memory space for program and data Separate memory and data busses Separate address arithmetic unit for each memory space
Digital Signal Processing. Slide 6.5
Useful Hardware Features forDSP
Single cycle multiply-accumulate (MAC) Long accumulator word length
Results of MAC grow
Bit reverse addressing Needed to re-order data in FFTs
Zero-overhead looping Many DSP algorithms have short loops executed many times
Short cycle time The faster the better!
Digital Signal Processing. Slide 6.6
Example ALU
Digital Signal Processing. Slide 6.7
Motorola DSP56300
Key Features general-purpose digital signal processing multimedia and telecommunication applications,
videoconferencing, cellular telephony. It is a member of theDSP56300 core family of programmable CMOS
24-bit addressing, instruction cache, and DMA 80/100 MIPS using an internal 80/100 MHz clock at 3.0-3.6 volts Highly parallel instruction set Fully pipelined 24 x 24-bit parallel Multiplier-Accumulator (MAC)
Digital Signal Processing. Slide 6.8
56-bit parallel barrel shifter 24-bit or 16-bit arithmetic support under software control Addressing modes optimized for DSP applications Nested hardware DO loops Fast auto-return interrupts 32-bit parallel PCI Host Interface Two Enhanced Synchronous Serial Interfaces (ESSI) Very low power CMOS design Wait and Stop low power standby modes
Digital Signal Processing. Slide 6.9
Memory Interface 1024-4096 x 24-bit Program RAM 2048/3072 x 24-bit X data RAM 2048/3072 x 24-bit Y data RAM 3K x 24-bit bootstrap ROM Data memory expansion to two 16 M x 24-bit word memory spaces Program memory expansion to one 16 M x 24-bit word memory
space On-chip DRAM controller requires no additional circuitry to
interface to DRAMs
Digital Signal Processing. Slide 6.10
Motorola StarCore
High performance DSP processor Supported by joint effort of consortium involving
manufactures and developers (eg. Lucent) A key criteria for success of a processor is ‘on the
plate’ delivery to developers pre-written software modules for standard functions such as FFTs
and filters application notes prototyping support such as hardware evaluation boards with PC
software development environments
Digital Signal Processing. Slide 6.11
StarCore has strongly pipelined architecture 4 MACs per clock 1 StarCore MIP is worth about 4 regular MIPS
Dedicated on-chip co-processor for filtering 16, 24 and double precision 32 bit support Aimed mostly at ‘high-end’ applications
Telecoms infrastructure products such as network gateways Eg: 96 channels of 128-taps adaptive filtering per chip
Top
A
n 1.2
e DSP market hassuch as the ever
be implemented inice are possible.
iers onto the deviceultpliers from 4 to >
signal flow graphsR SFGs filters for
DSP power, morede), perhaps blocke coming....
ugust 2005, For Academic Use Only, All Rights Reserved
The FPGA DSP Evolutio• Since around 1998 the evolution of FPGAs into th
been sustained by classic technology progress present Moore’s law.
• Late 1990s FPGAs allow multipliers to FPGA logic fabric. A few multipliers per dev
• Early 2000s FPGA place hardwired multiplwith clocking speeds of > 100MHz. No of m500.
• Mid 2000s FPGA place DSP algorithms (SFGs) onto devices. Full (pipelined) FIexample are available (DSP48 slice)
• Late 2000s - who knows! Probably morearithmetic capability (fast square root, divifloating point. But rest assured there is mor
Top
A
7.5
ugust 2005, For Academic Use Only, All Rights Reserved
FPGA Architecture
Routing channels
Programmableinterconnectpoint (PIP)
I/O Blocks
RAM
Logic Blocks
Top
A
7.6
ontain typical logicf logic functionality
s are used to build
Carry Logic
ugust 2005, For Academic Use Only, All Rights Reserved
The Logic Slice• A logic block, or slice in Xilinx terminology should c
components to be flexible and allow a wide range oand arithmetic functionality to be created:
• The elements of a logic block or group of logic blockuser defined functions.
Look-up Tables MultiplexersFlip-flops
GENERIC FPGA LOGIC BLOCK
Top
A
cks 1.3
as repositories ofther.
eful about running considerations are other tools.
build it:
ugust 2005, For Academic Use Only, All Rights Reserved
FPGAs: A “Box” of DSP blo• We might be tempted to view latest Xilinx FPGAs
DSP components just waiting to be connected toge
• In the days of circuits boards one had to be carbusses close together, lengths of wires etc. Similiarrequired for FPGAs and dealt with by synthesis and
• However, the high level concept, take the blocks, &
“Connectors” Logic Arithmetic
Registers and Memory
DesignVerify
Place and Route
Clocks Input/Output
Top
A
Design 1.12
MHz and more are
ds need only be as
le” (i.e. low FGPA.
delta techniques
tions we can makeal downconversion.
rations. which arers.
ugust 2005, For Academic Use Only, All Rights Reserved
Using FPGA - Rethinking DSP • Think very fast - Current data clocking rates of 200
achievable now.
• Think minimum data bit-widths - FPGA data worwide as is necessary for the algorithm/application.
• Think DSP “tricks” - we will be using some “simpcost filters) - CIC, difference filters, moving average
• Think Oversampling Strategies - using sigmaproduce simple multiplier-free digital filters.
• Think Undersampling Strategies - for communicause of high sampling rates and digital filters for digit
• Think algorithms with square root and divide opetraditionally avoided for conventional DSP processo
• Think differently - it’s a new design challenge.
Top
A
GAs 1.13
w level simplicity
building digital filter
one or more FAsoaches).
A B C⊕ ⊕
AB AC BC+ +
Cin
fclk = 200MHz
ugust 2005, For Academic Use Only, All Rights Reserved
DSP Implementation with FP• The power of FPGAs for DSP is primarily in their lo
on which to build high level complexity.
• We can demonstrate some of the design options by from first principles using just full adders (FA):
• With a typical FPGA logic block we can produce(either from available logic or via look-up table appr
0 0 00 0 10 1 00 1 11 0 01 0 11 1 01 1 1
0 00 10 11 00 11 01 01 1
A B Cin Cout SoutSout ABC ABC ABC ABC+ + += =
Cout ABC ABC ABC ABC+ + += =
ΣCout
A B
Sout
NIn ts sake we will specify fclk =2
In his simple FA can be use top second. Therefore the slidesp ced. However the design willa , and strategies for reducingc
T o on, are probably well knownto
otes: a typical FPGA the FA circuit can be clocked at a very high rate, for argumen
00MHz.
the following sequence of high level designs we want to demonstrate how troduce a powerful DSP digital filter also, potentially, running at 200 Msamples/erhaps do not present exactly how a custom DSP digital filter would be produllow us to demonstrate the difference between data rates and logic clock ratesosts by efficient design.
he design techniques associated with implementing multipliers and adders and s ASIC engineers, but probably not well known to DSP engineers.
Top
A
er 1.14
r (pipelined):
MHz
0Σp
fclk = 200MHz
ugust 2005, For Academic Use Only, All Rights Reserved
FPGA - 8 Bit Parallel Add• 8 FAs and some flip-flops produce an 8 bit full adde
• Data can be clocked into this circuit at a rate of 200
i.e. 200,000,000 8-bit additions per second.
Σ Σ Σ Σ Σ Σ Σ Σ
00101001+0100010101101110
0 1 0 0 0 1 0 10 1 0 0 0 1 0 10 1 0 0 0 1 0 1
0 1 1 0 1 1 1 0
NIf o at fclk = 200MHz, meaning2
If
th his is not necessarily a wrongth is sequence of examples wec
N are sharing the FA then thed 00,000 adds/s.
N e parallel adder, or clock thes 14.28 million adds/s
Σ
der
otes: we choose to pipeline the 8 bit parallel adder then we can reliably clock this als00,000,000 adds/s.
we chose not to pipeline (insert no single bit delays or flip-flops between FAs)
en the adder has a carry ripple which can limit the maximum clocking speed. (Ting to do and in some cases not pipelining may be desirable, however for th
hoose to pipeline.)
ote that we could also use a FA and perform the addition serially. Because weata processing rate is reduced by a factor of 8, i.e. fdata = 200 / 8 = 25MHz, 25,0
ote to extend to, for example, a 14 bit serial adder just add 6 more stages for therial adder for another 6 cycles, however the data rate then reduces to 200/14 =
Σ Σ Σ Σ Σ Σ Σ
Σ
1001010010100010
LSB MSB
Delay
01110110LSB MSB
Σs
fdata = 25MHz
fclk = 200MHz Bit serial 8 bit ad
Top
A
1.15
ce an 8 bit parallel
d in at 200MHz
Mp
0MHz
tional logic (flip-flop, XOR gate...)
dder for multiply array - FAx
ugust 2005, For Academic Use Only, All Rights Reserved
FPGA - 8 bit Multiplier• With just a few additional logic gate, we can produ
multiplier:
• Data in this circuit can also be pipelined and clocke(although there would be a latency)i.e. 200,000,000 8 bit multiplies per second
Σ Σ Σ Σ Σ Σ Σ Σ
Σ Σ Σ Σ Σ Σ Σ Σ
Σ Σ Σ Σ Σ Σ Σ Σ
Σ Σ Σ Σ Σ Σ Σ Σ
Σ Σ Σ Σ Σ Σ Σ Σ
Σ Σ Σ Σ Σ Σ Σ Σ
Σ Σ Σ Σ Σ Σ Σ Σ
Σ Σ Σ Σ Σ Σ Σ Σfdata = 20
0 1 0 0 1 0 1 10
0
11
11
0
- Addi
Σ - Full a
Parallel Multiplier
NTw“cdisu
T
A feedback partial products top MHz, but one multiply wouldta
T parallel and serial multipliers(a we reduce the silicon area/re he resources must be sharedb
11010110x001011011101011000000000110101101101011000000000110101100000000000000000
0010010110011110
Ms
otes:his “example” array is simply a “mapping” of a direct 8 bit multiplicationhereby 8 partial products are created and added together. The cost of eachell” is just a little more than the logic cost of a full adder (FA), which we mightenote simply as FAx (Regardless of how the multiplier is implemented there a cost associated, and the more bits then the higher the cost, e.g. if donesing memory then require more memory for more bits)
he array has many variants (for signed numbers, carry lookahead etc)
lternatively we could reduce the hardware costs and use one parallel adder androduce a serial multiplier. The logic in this circuit could still be clocked at 200ke 8 cycles and hence the data rate is only 200/8 = 25MHz.
he concept constant area-speed product is evident from a comparison of thend also the parallel and serial adder example above). Generally speaking ifsources required by a factor of N, then the computation time increases by N, as t
y N different sub-computations in a time sequential manner.
Σ Σ Σ Σ Σ Σ Σ Σ
fdata = 25MHz
Serial Multiplier
Top
A
1.16
e can produce an 8
0MHz
FIRp
data = 200MHz
ugust 2005, For Academic Use Only, All Rights Reserved
FPGA FIR Filter
• Using 7 parallel multipliers and 6 parallel adders wtap parallel FIR digital filter (FIRp):
• Data in this FIR filter is pipelined and clocked at 20
i.e. 200,000,000 samples per second
w0 w1 w5 w6w2 w3 w4
Mp Mp Mp Mp Mp Mp Mp
Σp Σp Σp Σp Σp Σp
f
NIf e hardware reduces by 1/8,h
O der - approximately the samec 25MHz.
S ocessor!
FIRs
ata = 25MHz
otes: we chose to use the slower serial multipliers and adders, then the cost of thowever the data rate is only 25MHz:
r alternatively we could share a single parallel multiplier and a single parallel adost of the circuit above - but this time only have a data sampling rate of 200/8 =
haring a single parallel multiplier is of course similar to the concept of a DSP pr
Ms
Σs
Ms
Σs
Ms
Σs
Ms
Σs
Ms
Σs
Ms
Σs
Ms
fd
DataMult
Σp
Mp
fdata = 25MHz
FIRs
Top
A
1.17
tion we require 4
running at 200MHz
billion MAC/sec!
fdata = 200MHz
ugust 2005, For Academic Use Only, All Rights Reserved
FIR Filter Banks • For a particular digital communications applica
channels each of 25 MHz bandwidth:
• We can set up 4 parallel FIR filters with each one sampling rate
• So the total computation rate is 4 x 7 x 200M = 5.6
MAC - Multiply/accumulate operation
freq
25MHz
mag
nitu
de
FIRpFIRp FIRp FIRp
N5 type of design is absolutelyp ation FPGAs.
T n multiply-adds per second.T or has other capabilities andfl a DSP processor is a poors
O 5MHz was required then wec member the individual logice
O
= 25MHz
data = 25MHz
otes:.6 billion MACs/sec is a lot of processing! With current FPGA technology thisossible. In fact we can easily go an order of magnitude higher with high specific
ypically a state of the art DSP processor could implement around 500 millioherefore around 12 are required to sustain this rate! Of course the DSP processexibilities that the FPGA does not have however for this specific requirementolution compared to the FPGA solution.
nce again, if our requirement was different and only a data sampling rate of 2ould design using serial FIR filters with a total of 1/8 of the hardware cost (relements are still clocked at 200MHz):
r we could share one fully parallel FIR filter and multiplex the four channels
fdata
FIRs FIRs FIRs FIRs
fFIRp