A New Class of High Performance FFTs Dr. J. Greg Nash Centar () High Performance Embedded Computing...

A New Class of High Performance FFTs

Dr. J. Greg NashCentar (www.centar.net)[email protected]

High Performance Embedded Computing (HPEC)Workshop

19-21 September 2006

New Base-4 DFT Matrix Equation

“ ”= element by element multiply1

2

t

tM M

M

Y W C XZ C Y

•Traditional DFT Matrix form:

•New Matrix form for DFT†

•CM 1 and CM 2 contain only elements from the set

– CM 1 X and CM 2Yt only involve complex additions/subtractions

•Twiddle factor matrix WM is of size N/4 x N/4 rather than N x N of C

– x16 fewer multiplies than traditional DFT equation (Z=CX)

Z C X

{1, -1, - , }I I

†J. G. Nash, “Computationally efficient systolic architecture for computing the discrete Fourier transform,” IEEE Transactions on Signal Processing, Volume 53, Issue 12, Dec. 2005, pp. 4640 – 4651.

Find Systolic Architecture Using SPADE†

MathematicalAlgorithm

AutomaticSearch for Space-Time

Transformations, T

InputCode

Simulator,GraphicalOutputs

for j to N/4 do for k to N/4 do Y[j,k]:=WM[j,k]*add(CM1[j,i]*X[i,k],i=1..4); od; for k to 4 do Z[k,j] := add(CM2[k,i]*Y[j,i],i=1..N/4); odod;

1

2

tM MtM

Y W C XZ C Y

†Symbolic Parallel Algorithm Development Environment

-2-D mesh array -fine grained PEs (registers,adder,mux)-linear arrays of multipliers, memory

FPGA Architectural Constraints

Objective Functions

, , , 1, 2,

time ix T jy k

X Y Z CM CM WM

vv

v

Functional Operation

• Processing flow for DFT of length N = N1 * N2

Stage 1: N2 column DFTs (Xci) of length N1

Stage 2: Twiddle multiplication Stage 3: N1 row DFTs (Xri) of length N2

• Systolic adder arrays for matrix multiplication– N1/4 x 4 array for column multiplies CM1Xci and

CM2Ytci

– N2/4 x 4 array for row multiplies CM1Xri and CM2Yt

ri

• N2 /4 x 4 array is implemented virtually on one row of N1/4 x 4 array

• Uses systolic 1-D array matrix multiplication

FFT Systolic Architecture

• Simple PEs, locally connected Higher clock speeds Easier design/test/maintainability Lower power Efficient use of FPGA fabric Simple control

• Small memory blocks (one per PE) Faster read/write times Lower power

• Linear structure (scales in N/S direction) Matches fabric of FPGA linear distributed

embedded elements (eg., memory and multipliers)

Example Architecture for N = 1024 (N1 = N2 = 32)

Input Data (X) CM 2

PE 1: 2 registers, 1 adder, sm all m em ory

PE 2: m ultip lier, sh ifter, coeffic ient m em ory

PE 3: 2 registers, 1 adder, sm all m em oryData flow bus

Enhanced Functionality

• Transform size N not restricted to powers of two– N = 256n, (n = 1,2,3,..)– More reachable points– Uniform distribution of points

• Circuit is scalable– Any DFT size can be computed on the same hardware with

sufficient memory– Larger FFT circuits constructed by replication of identical 4x4 PE

array processing blocks

• Low computational latency

– Pipeline depth small, vs for traditional pipelined FFTs

• 1-D and 2-D transforms possible on the same circuit

~ / 4N N

Block Floating Point/Floating Point Operation

• Multiple “regions” each with their own block floating point and floating point circuitry (32 regions in a 1024-point FFT)– Column DFTs use block floating point and row DFTs use floating

point– Higher dynamic range and lower signal to noise ratio

• Number of regions increases with transform size• Supports streaming FFT’s• Comparison of “single tone”, random frequency and phase

data sets (DR= dynamic range, “noise” = roundoff noise):FFT Circuit S/N DR

Name size bits mean max min std dev

mean max min std dev

Centar 256 16 89.0 94.6 83.5 3.06 96.3 99.3 90.3 2.85 Altera 256 16 77.2 87.3 72.1 3.17 84.9 87.3 78.3 1.96 Altera 256 20 86.9 92.2 82.4 2.42 98.2 101 92.8 1.66 Centar 1024 16 86.9 91.0 84.4 1.55 96.2 102.3 93.0 2.20 Atera 1024 16 71.5 84.3 67.8 3.23 84.4 87.3 80.3 0.893 Altera 1024 20 84.4 90.6 80.0 2.52 99.5 104 93.2 1.94

Performance Comparison: 256-point DFT

• Altera block floating point circuit• “Streaming” (continuous data in and out)• Comparable dynamic range and signal to (roundoff) noise ratio• Both circuits mapped to Altera Stratix II EP2S15F484C3 FPGA• Altera circuit from Megacore FFT v2.2.0• Results from timing analysis (Altera Quartus 5.1 software)

Category Altera (20 bit)

Base-4 (16-bit)

Throughput (cycles/DFT) 256 240 Clock speed (MHz) 302 363 Throughput (µsec) 0.85 0.67 Dynamic Range (db) 98.2 96.3 Signal/Noise (db) 86.9 89.0 Total ALUTs 7555 7790 ALMs 4192 4256 Memory Bits 48640 78708 18-bit multipliers 12 16 Dynamic Power (mW) 1748 2232

Preliminary Figure of Merit

• Altera block floating point circuits• “Streaming” (continuous data in and out)• Comparable dynamic range and signal to noise ratio• Circuits mapped to Altera Stratix II FPGAs• Altera circuit from Megacore FFT v2.2.0

FOM = Area (ALMs) x Throughput (Cycles/DFT) / Clock (MHz)*Estimate (no timing analysis or layout)

Altera CentarPoints Throughput Clock ALMs Throughput Clock ALMs Altera Base-4

(cycles/DFT) (MHz) (cycles/DFT) (MHz) FOM FOM256 256 302 4192 232 363 4256 3553 2720512 512 294 4192 296 347 8441 7300 7200

1024 1024 297 5096 680 347 8441 17570 165412048 2048 294 7994 936 350 16500 55686 *440004096 4096 294 8566 2344 350 16500 119341 *1100008192 8192 270 8682 3368 330 33000 263418 *329000

Performance Comparison: 256-point DFT

Category Altera Base-4 Throughput (cycles/DFT) 256 240 Clock speed (MHz) 302 363 Throughput (µsec) 0.85 0.67 Signal/Noise (db) 86.9 89.0 Total ALUTs 7555 7790 18-bit multipliers 12 16

Input Data (X) coefficient

Adder Array 1 Adder Array 2M ultipliers

Comparative Features

• Transform size N not restricted to powers of two

• Circuit is scalable

• Uses block floating point and floating point

• Higher throughput

• Low computational latency

• Based on small, simple PE (adder), locally connected

• 1-D or 2-D transforms

A New Class of High Performance FFTs Dr. J. Greg Nash Centar () High Performance Embedded Computing...

Documents

Transcript of A New Class of High Performance FFTs Dr. J. Greg Nash Centar () High Performance Embedded Computing...