Efficient FFTs On VIRAM
-
Upload
ignatius-dayton -
Category
Documents
-
view
36 -
download
0
description
Transcript of Efficient FFTs On VIRAM
Efficient FFTs On VIRAM
Randi Thomas and Katherine Yelick
Computer Science DivisionUniversity of California,
Berkeley
IRAM Winter 2000 Retreat
{randit, yelick} @cs.berkeley.edu
Outline
• What is the FFT and Why Study it?
• VIRAM Implementation Assumptions
• About the FFT
• The “Naïve” Algorithm
• 3 Optimizations to the “Naïve” Algorithm
• 32 bit Floating Point Performance Results
• 16 bit Fixed Point Performance Results
• Conclusions and Future Work
What is the FFT?
The Fast Fourier Transform converts
a time-domain function
into
a frequency spectrum
Why Study The FFT?• 1D Fast Fourier Transforms (FFTs) are:
– Critical for many signal processing problems
– Used widely for filtering in Multimedia Applications
» Image Processing
» Speech Recognition
» Audio & video
» Graphics
– Important in many Scientific Applications
– The building block for 2D/3D FFTs
All of these are VIRAM target applications!
Outline
• What is the FFT and Why Study it?
• VIRAM Implementation Assumptions
• About the FFT
• The “Naïve” Algorithm
• 3 Optimizations to the “Naïve” Algorithm
• 32 bit Floating Point Performance Results
• 16 bit Fixed Point Performance Results
• Conclusions and Future Work
• System on the chip:
– Scalar processor: 200 MHz “vanilla” MIPS core
– Embedded DRAM: 32MB, 16 Banks, no subbanks
– Memory Crossbar: 25.6 GB/s
– Vector processor: 200 MHz
– I/O: 4 x 100 MB/sec
VIRAM Implementation Assumptions
VIRAM Implementation Assumptions
• Vector Processor has four 64-bit pipelines=lanes
– Each lane has:» 2 integer functional units» 1 floating point functional unit
– All functional units have a 1 cycle multiply-add operation
– Each lane can be subdivided into:» two 32-bit virtual lanes » four 16-bit virtual lanes
64-bits 64-bits64-bits64-bitsLANE 1 LANE 3 LANE 4LANE 2
32-bits 32-bits32-bits 32-bits32-bits 32-bits32-bits32-bitsVL 1 VL 2 VL 3 VL 4 VL 5 VL 6 VL 7 VL 8
16 16 16 16 16 16 16 1616 16 16 1616 16 16 16VL 1 VL 2 VL 3 VL 4 VL 5 VL 6 VL 7 VL 8 VL 9VL 10VL 11VL 12 VL 13VL 14VL 15VL 16
Peak Performance• Peak Performance of This VIRAM
Implementation
• Implemented:– A 32 bit Floating point version (8 lanes, 8 FUs)– A 16 bit Fixed point version (16 lanes, 32 FUs)
Nomultiply-adds
Allmultiply-adds
Nomultiply-adds
Allmultiply-adds
Nomultiply-adds
Allmultiply-adds
Operationsper
Cycle
PeakPerformance
16 Floating Point
8 Floating Point
32Integer
16Integer
64Integer
32Integer
3.2GFLOP/s
1.6GFLOP/s
6.4GOP/s
3.2GOP/s
12.8GOP/s
6.4GOP/s
32-bit Single Precision 32-bit Integer 16-bit Integer
Outline
• What is the FFT and Why Study it?
• VIRAM Implementation Assumptions
• About the FFT
• The “Naïve” Algorithm
• 3 Optimizations to the “Naïve” Algorithm
• 32 bit Floating Point Performance Results
• 16 bit Fixed Point Performance Results
• Conclusions and Future Work
Computing the DFT (Discrete FT)
• Given the N-element vector x, its 1D DFT is another N-element vector y, given by formula:
– where = the jkth root of unity
– N is referred to as the number of points
• The FFT (Fast FT)– Uses algebraic Identities to compute DFT in
O(NlogN) steps
– The computation is organized into log2N stages » for the radix 2 FFT
)1(,...,1,0 Nj
NijkjkN e /2
j
N
k
jkNj xy
1
0
• Basic computation for a radix 2 FFT:
• The basic computation on VIRAM for Floating Point data points: – 2 multiply-adds + 2 multiplies + 4 adds =– 8 operations
• 2 GFLOP/s is the VIRAM Peak Performance for this mix of instructions
– Xi are the data points– is a “root of unity”
X0
XN/2
= X0 + *XN/2
= X0 - *XN/2
.
.
.
.
0X
N/2X
Computing A Complex FFT
Vector Terminology
• The Maximum Vector Length (MVL):– The maximum # of elements 1 vector register
can hold– Set automatically by the architecture– Based on the data width the algorithm is using:
» 64-bit data, MVL = 32 elements/vector register» 32-bit data, MVL = 64 elements/vector register» 16-bit data, MVL = 128 elements/vector register
• The Vector Length (VL):– The total number of elements to be computed – Set by the algorithm: the inner for-loop
One More (FFT) Term!
• A butterfly group (BG):
– A set of elements that can be computed upon in 1 FFT stage using:
» The same basic computation
AND
» The same root of unity
– The number of elements in a stage’s BG determines the Vector Length (VL) for that stage
Outline
• What is the FFT and Why Study it?
• VIRAM Implementation Assumptions
• About the FFT
• The “Naïve” Algorithm
• 3 Optimizations to the “Naïve” Algorithm
• 32 bit Floating Point Performance Results
• 16 bit Fixed Point Performance Results
• Conclusions and Future Work
vr1
vr2
vr1
vr2
vr1vr2
vr1
vr2
Stage 1VL = 8
Stage 2VL = 4
Stage 4VL = 1
Stage 3VL = 2
Time
Cooley-Tukey FFT Algorithm
vr1+vr2=1 butterfly group; VL = vector length
– Diagram illustrates “naïve” vectorization
– A stage vectorizes well when VL MVL
– Poor HW utilization when VL is small ( MVL)
– Later stages of the FFT have shorter vector lengths:
» the # of elements in one butterfly group is smaller in the later stages
vr1
vr2
vr1
vr2
vr1vr2
vr1
vr2
Stage 1VL = 8
Stage 2VL = 4
Stage 4VL = 1
Stage 3VL = 2
Time
Vectorizing the FFT
Stage #
1 2 3 4 5 6 7 8 9 10
MF
LO
PS
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
1024 points512 points256 points128 points
IRAM Peak Performance (2000 MFLOPS)
VL=8=#lanes
Naïve Algorithm: What Happens When Vector Lengths
Get Short?
• Performance peaks (1.4-1.8 GFLOPs) if vector lengths are MVL • For all FFT sizes, 94% to 99% of the total time is spent doing the
last 6 stages, when VL MVL (= 64)– For 1024 point FFT, only 60% of the work is done in the last 6 stages
• Performance significantly drops when vector lengths # lanes (=8)
32 bit Floating Point
VL=64=MVL
Outline
• What is the FFT and Why Study it?
• VIRAM Implementation Assumptions
• About the FFT
• The “Naïve” Algorithm
• 3 Optimizations to the “Naïve” Algorithm
• 32 bit Floating Point Performance Results
• 16 bit Fixed Point Performance Results
• Conclusions and Future Work
Optimization #1: Add auto-increment
• Automatically adds an increment to the current address in order to obtain the next address
• Auto-increment helps to:– Reduce the scalar code overhead
• Useful:– To jump to the next butterfly group in an FFT
stage– For processing a sub-image of a larger image
in order to jump to the appropriate pixel in next row
FFT Size (#points)
0 200 400 600 800 1000 1200
MF
LO
PS
0
50
100
150
200
250
NO auto-incrementauto-increment
1024512
256
128
64
32
16
8
4
Optimization #1: Add auto-increment
– Small gain from auto-increment
» For 1024 point FFT:•202 MFLOP/s w/o AI
•225 MFLOP/s with AI
– Still 94-99% of the time spent in last 6 stages where the VL 64
– Conclusion: Auto-increment helps, but scalar overhead is not the main source of the inefficiency
32 bit Floating Point
Optimization #2: Memory Transposes
• Reorganize the data layout in memory to maximize the vector length in later FFT stages– View the 1D vector as a 2D matrix– Reorganization is equivalent to a matrix
transpose
• Transposing the data in memory only works for N (2 * MVL)
• Transposing in memory adds significant overhead– Increased memory traffic
» cost too high to make it worthwhile
– Multiple transposes exacerbate the situation:Number of Transposes NeededFFT Sizes > 2048
1 512 - 2048
2 2563
1285
Optimization #3: Register Transposes
• Rearrange the elements in the vector registers– Provides a way to swap elements between 2
registers– What we want to swap (after stage 1 VL = MVL
= 8): Stage 2: SWAPvr10 1 2 3 4 5 6 7
vr2 8 9 10 11 12 13 14 15
– This behavior is hard to implement with one instruction in hardware
SWAPStage 3:
SWAP
vr10 1 2 3 8 9 10 11
vr2 4 5 6 7 12 13 14 15
SWAPStage 4: SWAP
vr10 1 4 5 8 9 12 13
vr2 2 3 6 7 10 11 14 15
VL = 4BGs= 2
VL = 2BGs= 4
Optimization #3: Register Transposes
• Two instructions were added to the VIRAM Instruction Set Architecture (ISA): – vhalfup and vhalfdn: both move elements one-
way between vector registers
• Vhalfup/dn:– Are extensions of already existing ISA support
for fast in-register reductions– Required minimal additional hardware support
» mostly control lines
– Much simpler and less costly than a general element permutation instruction
» Rejected in the early VIRAM design phase
– An elegant, inexpensive, powerful solution to the short vector length problem of the later stages of the FFT
Optimization #3: Register Transposes
•vhalfup8 9 10 11 12 13 14 15 vr2
vr1 0 1 2 3 8 9 10 11
•move
0 1 2 3 4 5 6 7 vr1
0 1 2 3 4 5 6 7 vr3
vr3 0 1 2 3 4 5 6 7
vr2 4 5 6 7 12 13 14 15
•vhalfdn
Stage 1:
SWAP
vr10 1 2 3 4 5 6 7
vr2 8 9 10 11 12 13 14 15
• Three steps to swap elements:– Copy vr1 into vr3– Move vr2’s low to vr1’s high (vhalfup)
» vr1 now done– Move vr3’s high to vr2’s low (vhalfdn)
» vr2 now done
Optimization #3: Final Algorithm • The optimized algorithm has two phases:
– Naïve algorithm is used for stages whose VL MVL– Vhalfup/dn code is used on:
» Stages whose VL MVL = the last log2 (MVL) stages
• Vhalfup/dn:– Eliminates short vector length problem
» Allows all vector computations to have VL equal to MVL•Multiple butterfly groups done with 1 basic operation
– Eliminates all loads/stores between these stages
• Optimized vhalf algorithm does: – Auto-increment, software pipelining, code scheduling– the bit reversal rearrangements of the results– Single precision, floating point, complex, radix-2
FFTs
Stage #
1 2 3 4 5 6 7 8 9 10
MF
LO
PS
0
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
1024 points512 points256 points128 points
IRAM Peak Performance (2000 MFLOPS)
VL=8=#lanes
• Every vector instruction operates with VL=MVL– For all stages – Keeps the vector pipeline fully utilized
• Time spent in the last 6 stages – drops to 60% to 80% of the total time
Optimization #3: Register Transposes
32 bit Floating Point
Outline
• What is the FFT and Why Study it?
• VIRAM Implementation Assumptions
• About the FFT
• The “Naïve” Algorithm
• 3 Optimizations to the “Naïve” Algorithm
• 32 bit Floating Point Performance Results
• 16 bit Fixed Point Performance Results
• Conclusions and Future Work
Size (#points in FFT)
1024512256128
Tim
e (m
icro
seco
nd
s)
0
50
100
150
200
250Naive Naive, no bit reversal
Vhalfup/dn
• Both Naïve versions utilize the auto-increment feature– 1 does bit reversal, the other does not
• Vhalfup/dn with and without bit reversal are identical• Bit reversing the results slows naïve algorithm, but not
vhalfup/dn
Performance Results
32 bit Floating Point
Size (#points in FFT)
1024512256128
Tim
e (m
icro
seco
nd
s)
0
50
100
150
200
250Naive Naive, no bit reversal
Vhalfup/dn
• The performance gap testifies: – To the effectiveness of the vhalfup/dn algorithm in
fully utilizing the vector unit– The importance of the new vhalfup/dn instructions
Performance Results
32 bit Floating Point
Size (#points in FFT)
1024512256128
Tim
e (m
icro
seco
nd
s)
0
50
100
150
200
250Naive Naive, no bit reversalVhalfup/dn
TMS320C67x: 124 us
TigerSHARC: 41 us
CRI Pathfinder-1: 22.3 usCRI Pulsar: 27.9 usWildstar: 25 us
PPC604e: 87 us
Pentium/200: 151 us
VIRAM: 37 us
• VIRAM is competitive with high-end specialized Floating Point DSPs – Could match or exceed the performance of these
DSPs if the VIRAM architecture were implemented commercially
Performance Results
32 bit Floating Point
Outline
• What is the FFT and Why Study it?
• VIRAM Implementation Assumptions
• About the FFT
• The “Naïve” Algorithm
• 3 Optimizations to the “Naïve” Algorithm
• 32 bit Floating Point Performance Results
• 16 bit Fixed Point Performance Results
• Conclusions and Future Work
16 bit Fixed Point Implementation
• Resources:– 16 lanes (each 16 bits wide)
» Two Integer Functional Units per lane» 32 Operations/Cycle
– MVL = 128 elements
• Fixed Point Multiply-Add not utilized
– 8 bit operands too small
» 8 bits * 8 bits = 16 bit product
– 32 bit product too big
» 16 bits * 16 bits = 32 bit product
16 bit Fixed Point Implementation (2)
• The basic computation takes: – 4 multiplies + 4 adds + 2 subtracts = 10
operations – 6.4 GOP/s is Peak Performance for this mix
• To prevent overflow two bits are shifted right and lost for each stage
InputSbbb bbbb bbbb bbbb.
Output Sbbb bbbb bbbb bbbb bb.
Decimal points
Shifted out
Size (#points in FFT)
1024512256128
Tim
e (m
icro
seco
nd
s)
0
50
100
150
200
Fixed Point (16 bit)Floating Point (32 bit)
TMS320C67x: 124 us
TigerSHARC: 41 us
CRI Pathfinder-1: 22.3 usCRI Pulsar: 27.9 usWildstar: 25 us
PPC604e: 87 us
Pentium/200: 151 us
VIRAM: 37 us
• Fixed Point is Faster than Floating point on VIRAM– 1024 pt = 28.3 us verses 37 us
• This implementation attains 4 GOP/s for 1024 pt FFT and is:
– An Unoptimized work in progress!
Performance Results
16 bit Fixed Point
Size (#points in FFT)
1024512256128
Tim
e (m
icro
seco
nd
s)
0
10
20
30
40
50
Fixed Point (16 bit)Floating Point (32 bit)
TigerSHARC: 4.4 us (Fixed Pt.)Pentium III (400MHz): 4.64 (16 bit Int)
CRI Pathfinder-1: 22.3 us
CRI Pulsar: 27.9 us
Wildstar: 25 us
VIRAM-FP: 37 us
TigerSHARC: 41 us (Floating Pt.)
• Again VIRAM is competitive with high-end specialized DSPs
– CRI Scorpio 24 bit complex fixed point FFT DSP: » 1024 pt = 7 microseconds
Performance Results
16 bit Fixed Point
Outline
• What is the FFT and Why Study it?
• VIRAM Implementation Assumptions
• About the FFT
• The “Naïve” Algorithm
• 3 Optimizations to the “Naïve” Algorithm
• 32 bit Floating Point Performance Results
• 16 bit Fixed Point Performance Results
• Conclusions and Future Work
Conclusions
• Optimizations to eliminate short vector lengths are necessary for doing the FFT
• VIRAM is capable of performing FFTs at performance levels comparable to or exceeding those of high-end floating point DSPs. It achieves this performance via:– A highly tuned algorithm designed specifically for
VIRAM– A set of simple, powerful ISA extensions that
underlie it– Efficient parallelism of vector processing
embedded in a high-bandwidth on-chip DRAM memory
Conclusions (2)
• Performance of FFTs on VIRAM has the potential to improve significantly over the results presented here:– 32-bit fixed point FFTs could run up to 2 times faster than
floating point versions
– Compared to 32-bit fixed point FFTs, 16-bit fixed point FFTs could run up to:
» 8x faster (with multiply-add ops)
» 4x faster (with no multiply-add ops)
– Adding a second Floating Point Functional Unit would make floating point performance comparable to the 32-bit Fixed Point performance.
– 4 GOP/s for Unoptimized Fixed Point implementation (6.4 GOP/s is peak!)
Conclusions (3)• Since VIRAM includes both general-
purpose CPU capability and DSP muscle, it shares the same space in the emerging market of hybrid CPU/DSPs as:– Infineon TriCore– Hitachi SuperH-DSP– Motorola/Lucent StarCore– Motorola PowerPC G4 (7400)
• VIRAM’s vector processor plus embedded DRAM design may have further advantages over more traditional processors in:– Power– Area– Performance
Future Work
• On Current Fixed Point implementation:– Further optimizations and tests
• Explore the tradeoffs between precision & accuracy and Performance by implementing:– A Hybrid of the current implementation which
alternates the number of bits shifted off each stage
» 2 1 1 1 2 1 1 1...– A 32 bit integer version which uses 16 bit data
» If data occupies the 16 most significant bits of the 32 bits, then there are 16 zeros to shift off:
Sbbb bbbb bbbb bbbb b000 0000 0000 0000 0000
Backup Slides
Size (#points in FFT)
1024512256128
Tim
e (m
icro
seco
nd
s)
0
50
100
150
200
250Naive without autoincrementNaive with autoincrement
Why Vectors For IRAM?• Low complexity architecture
– means lower power and area
• Takes advantage of on-chip memory bandwidth– 100x bandwidth of Work Station memory hierarchies
• High performance for apps w/ fine-grained ||ism• Delayed pipeline hides memory latency
– Therefore no cache is necessary» further conserves power and area
• Greater code density than VLIW designs like:– TI’s TMS320C6000– Motorola/Lucent StarCore – AD’s TigerSHARC– Siemens (Infineon) Carmel