High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg Nash ...
-
Upload
amelia-stone -
Category
Documents
-
view
239 -
download
1
Transcript of High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg Nash ...
High-Throughput Programmable Systolic ArrayFFT Architecture and FPGA Implementations
J. Greg Nash
www.centar.net
ICNC 2014
Outline
• Motivation for new FFT designs in wireless
applications?
• Review of FFT architectures
• New systolic FFT architecture
• Circuit FPGA performance comparisons
– LTE SC-FDMA
– Fixed-size power-of-two transforms
– Variable transforms (LTE, WiMAX)
• Conclusions
Future Drivers for Wireless FFT Design
• Algorithmic (OFDM)
– Large transform sizes (LTE: 2048 points; DVB: 32K points)
– Run-time scalable OFDMA (LTE : 128 to 2048 points)– Non-power-of-two transform sizes (LTE SC-FDMA: 35 sizes, 12 to 1296
points)– High performance (LTE advanced)
• BW = 100MHz with 8 MIMO streams <1.0sec for 2K FFT)
• Critical system requirements
– Power
– Cost
FFT Architecture Review (1): Pipelined
W=e-2πI/N
Collapse onto pipelined hardware blocks
Signal Flow Graph (8-point DFT) Block Diagram
Features
• Fast• Hardware Intensive• Non-programmable
FFT Architecture Review (2): Memory Based
Features• Programmable• Compact• Typically slow
Traditional Proposed Systolic Array
Features• Programmable• Faster than pipelined FFT• Scalable• Higher SQNR
Matrix Form DFT (16-Point DFT)
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 W W2
W3
W4
W5
W6
W7
W8
W9
W10
W11
W12
W13
W14
W15
1 W2
W4
W6
W8
W10
W12
W14
1 W2
W4
W6
W8
W10
W12
W14
1 W3
W6
W9
W12
W15
W2
W5
W8
W11
W14
W W4
W7
W10
W13
1 W4
W8
W12
1 W4
W8
W12
1 W4
W8
W12
1 W4
W8
W12
1 W5
W10
W15
W4
W9
W14
W3
W8
W13
W2
W7
W12
W W6
W11
1 W6
W12
W2
W8
W14
W4
W10
1 W6
W12
W2
W8
W14
W4
W10
1 W7
W14
W5
W12
W3
W10
W W8
W15
W6
W13
W4
W11
W2
W9
1 W8
1 W8
1 W8
1 W8
1 W8
1 W8
1 W8
1 W8
1 W9
W2
W11
W4
W13
W6
W15
W8
W W10
W3
W12
W5
W14
W7
1 W10
W4
W14
W8
W2
W12
W6
1 W10
W4
W14
W8
W2
W12
W6
1 W11
W6
W W12
W7
W2
W13
W8
W3
W14
W9
W4
W15
W10
W5
1 W12
W8
W4
1 W12
W8
W4
1 W12
W8
W4
1 W12
W8
W4
1 W13
W10
W7
W4
W W14
W11
W8
W5
W2
W15
W12
W9
W6
W3
1 W14
W12
W10
W8
W6
W4
W2
1 W14
W12
W10
W8
W6
W4
W2
1 W15
W14
W13
W12
W11
W10
W9
W8
W7
W6
W5
W4
W3
W2
W
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
z
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Z = C X
W=e-2πI/N (N=16)
Inputs X and Outputs Z in Bit-reversed Form(N=16)
1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 00 1 0 0 0 0 0 0 1 0 0 0 0 01 ; 2 ; 3 ; 40 0 1 0 0 0 1 0 0 0 1 0 0 0 1 00 0 0 1 0 0 0 0 0 0 1 0 0 0
I Id d d d
I I
Cb =
é
ë
êêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêêê
ù
û
úúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúúú
d1
é
ë
êêêêêêêêêê
ù
û
úúúúúúúúúú
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
d2
é
ë
êêêêêêêêêê
ù
û
úúúúúúúúúú
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
d3
é
ë
êêêêêêêêêê
ù
û
úúúúúúúúúú
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
d4
é
ë
êêêêêêêêêê
ù
û
úúúúúúúúúú
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
d1
é
ë
êêêêêêêêêê
ù
û
úúúúúúúúúú
1 - I -1 I
1 - I -1 I
1 - I -1 I
1 - I -1 I
Wd2
é
ë
êêêêêêêêêê
ù
û
úúúúúúúúúú
1 - I -1 I
1 - I -1 I
1 - I -1 I
1 - I -1 I
W2 d3
é
ë
êêêêêêêêêê
ù
û
úúúúúúúúúú
1 - I -1 I
1 - I -1 I
1 - I -1 I
1 - I -1 I
W3 d4
é
ë
êêêêêêêêêê
ù
û
úúúúúúúúúú
1 - I -1 I
1 - I -1 I
1 - I -1 I
1 - I -1 I
d1
é
ë
êêêêêêêêêê
ù
û
úúúúúúúúúú
1 -1 1 -1
1 -1 1 -1
1 -1 1 -1
1 -1 1 -1
W2 d2
é
ë
êêêêêêêêêê
ù
û
úúúúúúúúúú
1 -1 1 -1
1 -1 1 -1
1 -1 1 -1
1 -1 1 -1
W4 d3
é
ë
êêêêêêêêêê
ù
û
úúúúúúúúúú
1 -1 1 -1
1 -1 1 -1
1 -1 1 -1
1 -1 1 -1
W6 d4
é
ë
êêêêêêêêêê
ù
û
úúúúúúúúúú
1 -1 1 -1
1 -1 1 -1
1 -1 1 -1
1 -1 1 -1
d1
é
ë
êêêêêêêêêê
ù
û
úúúúúúúúúú
1 I -1 - I
1 I -1 - I
1 I -1 - I
1 I -1 - I
W3 d2
é
ë
êêêêêêêêêê
ù
û
úúúúúúúúúú
1 I -1 - I
1 I -1 - I
1 I -1 - I
1 I -1 - I
W6 d3
é
ë
êêêêêêêêêê
ù
û
úúúúúúúúúú
1 I -1 - I
1 I -1 - I
1 I -1 - I
1 I -1 - I
W9 d4
é
ë
êêêêêêêêêê
ù
û
úúúúúúúúúú
1 I -1 - I
1 I -1 - I
1 I -1 - I
1 I -1 - I
1 2 3 42 35 6 7 8
2 4 69 10 11 12
3 6 913 14 15 16
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
:
1 1 1 1 1 1 1 11 1 1
1 1 1 111 11
1 1 1 11
b
b
Z CX becomes
x x x xW W W x x x xI IY x x x xW W W
I I x x x xW W W
z z z zz z z zZ z z z zz z z z
11 1 1 11 1
tb
I I Y
I I
“ ”= element by element multiply
New FFT Matrix Form
“ ”= element by element multiply1
2t
M b
b
M
M
Y W C XZ C Y
1 | |...tt t
M B BC C C
2 | |...M B BC C C1 1 1 11 11 1 1 11 1
BI ICI I
where
(for b=4)
“Base-b” FFT Architecture
1
2t
b
b
M M
M
Y W C XZ C Y
Base-b DFT equations:
Base-4 DFT architecture:
Virtual Physical
Processing flow for DFT of length N = Nr Nc
1. Nc column DFTs (Xci) of length Nr
2. Nr row DFTs (Xri) of length Nc
Base-4 Array Architecture
256 Point FFT (Nr =Nc=16)
1024 Point FFT (Nr =Nc=32)
Array Processing Elements
Interconnection Delays
Altera Pipelined FFT
65nm Technology: 256pt FFT
Systolic
Critical
Path
Fmax = 351 MHz Fmax = 537 MHz
LTE Uplink: Single Carrier FDMA
• DFT spreading of data symbols in frequency domain– Reduces PAPR in uplink– Less dependence on frequency offset
• 35 DFT sizes N (12-points to 1296-points)
• Run-time choice of DFT size
LTE Systolic DFT
• Array size uses base-b = 6
• Example→– N = 520-points (– Use subset of physical array for
P,Q≠6
36-ptDFTs
15-ptDFTs
Programmability
• Parameter List (Matlab): – Matrix factorization parameters(ax,by,cz,…)– Addresses for coefficients
240 points
LTE DFT: FPGA Cycle Counts
Average Latency
Time
Average Throughput
Rate
Resource Block Computation
Time
Altera 1.39 0.47 2.01
Xilinx 0.86 0.65 1.50
Systolic FFT 1.00 1.00 1.00
LTE DFT: FPGA Circuit Usage Comparisons
Design FPGA LUTALM/LE
Fmax
(MHz)
Systolic Stratix III 3582 2733 394
Xilinx Virtex-5 4707 3864 276
Altera Stratix III 2600 n.a. 260
Chen Virtex-5 7791 n.a. 123
(65nm Technology)
LTE Systolic DFT: Performance Comparisons
DesignAverage LTE Resource Block
Compute Time
Systolic FFT 1.0
Xilinx 2.1
Altera 3.0
Fixed Size FFT: Power-of-two
• Streaming (continuous data in/out)• Array size uses base-b = 4• Altera Stratix III FPGAs (65nm technology)
Altera Systolic FFT Altera Systolic FFT
20-bits 16-bits 20-bits 16-bits
Transform Size 256 256 1024 1024
ALMs 4261 3982 4394 4331
Memory Bits (K) 49 40.6 195 145
Multipliers (18-bit) 24 33 24 33
SQNR 76.6 86.7 81.3 82.8
Sample Rate (MHz) 387 566 382 533
Variable Size FFT: Power-of-two
• Transform sizes: 128/256/512/1024/2048-points• Streaming (continuous data in/out)• Run-time transform size• Array size uses base-b = 4• Altera Stratix III FPGAs (65nm technology)
Systolic FFT16-bits in/16-bits out
Altera16-bits in/30-bits out
Architecture Systolic Single Delay Feedback
ALMs 4522 3826
RAM Memory (K) 290 208
Multipliers (18-bits) 33 36
Fmax (MHz) 510 315
Conclusion: Better FFTs are Possible
• Improved performance– Algorithmic reduction in computation cycles– Localized interconnects for high clocks speeds (>500MHz for 65nm
FPGA technologies)
• Reduced usage of FPGA logic cells
• Programmability
• Throughput scalability due to the use of systolic algorithms
• Higher dynamic range (smaller word lengths needed)