FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf ·...

47
FFTs, Portability, & Performance Steven G. Johnson, MIT Dept. Physics Matteo Frigo, ITA Software (formerly MIT LCS)

Transcript of FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf ·...

Page 1: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

FFTs, Portability, & Performance

Steven G. Johnson, MIT Dept. PhysicsMatteo Frigo, ITA Software (formerly MIT LCS)

Page 2: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

A Need for Speed?

Scientists(along with gamers)

often pushperformance limits

low-level programming?

Codes havelong lifetimes,

and needflexibility & portability

high-level programming?

Perhaps there is a better way?

Page 3: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

FFTW

• C library for real & complex FFTs (arbitrary size/dimensionality)

• Computational kernels (80% of code) automatically generated

• Self-optimizes for your hardware: portability + performance

(+ parallel versions for threads & MPI)

free software: http://www.fftw.org/

Page 4: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

?The “Fastest Fourier Transform in the West”

no code is always fastest, but…

Page 5: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

FFTW on 167MHz UltraSPARCdouble precision, complex 1d transforms

But this is OLD!

Page 6: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

okay, I’ll present some new stuff…

FFTW 3.0(soon to be released)

Page 7: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

FFTW on 2GHz Pentium IV

FFTW 3

FFTW 2

Page 8: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

FFTW on 2GHz Pentium IV

FFTW 3

FFTW 2

Page 9: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

FFTW on 1GHz Alpha (EV7)

FFTW 3FFTW 2

Page 10: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Why is FFTW fast?

FFTW implements many FFT algorithms:A planner picks the best composition

by measuring the speed of different combinations.1

The resulting plan is executedwith explicit recursion:

enhances locality2

3 The base cases of the recursion are codelets:highly-optimized dense code that is

automatically generated by a special-purpose “compiler”

Page 11: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

FFTW is easy to use{

complex x[n];plan p;

p = plan_dft_1d(n, x, x, FOR WARD, MEASURE);...

execute(p); /* repeat as needed */...destroy_plan(p);

}

Key fact: usually,many transforms of same size

are required.

Page 12: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Outline

FFT algorithm basics

Recursion and caches

The planner

The codelet generator

Page 13: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Outline

FFT algorithm basics

Recursion and caches

The planner

The codelet generator

Page 14: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Cooley-Tukey FFTs[ 1965 … or Gauss, 1802 ]

n = pq1d DFT of size n:

= ~2d DFT of size p x q

multiply by n “twiddle factors”

q

p

transpose

p

q

= contiguousfirst DFT columns, size q

(non-contiguous) finally, DFT columns, size p

(non-contiguous)

Page 15: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Cooley-Tukey FFTs[ 1965 … or Gauss, 1802 ]

n = pq1d DFT of size n:

= ~2d DFT of size p x q

= Recursive DFTs of sizes p and q

O(n2) O(n log n)(divide-and-conquer algorithm)

Page 16: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Cooley-Tukey FFTs[ 1965 … or Gauss, 1802 ]

twiddlessize-q DFTssize-p DFTs

Page 17: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Outline

FFT algorithm basics

Recursion and caches

The planner

The codelet generator

Page 18: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Cooley-Tukey is Naturally Recursive

But traditional implementation is non-recursive,breadth-first traversal:

log2 n passes over whole array

Size 8 DFT

Size 4 DFT Size 4 DFT

Size 2 DFT Size 2 DFT Size 2 DFT Size 2 DFT

p = 2 (radix 2)

Page 19: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Recursive Divide & Conquer is Good

Size 8 DFT

Size 4 DFT Size 4 DFT

Size 2 DFT Size 2 DFT Size 2 DFT Size 2 DFT

p = 2 (radix 2)

eventually small enough to fit in cache…no matter what size the cache is

Page 20: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Out-of-cache FFTs: “Blocking”

Page 21: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Cache-oblivious Recursive FFT

[ Vitter & Shriver, 1994 ]

Page 22: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Cache Obliviousness• A cache-oblivious algorithm does not know the cache size

— it can be optimal for any machine& for all levels of cache simultaneously

• They exist for matrix multiplication, LU decomposition, sorting, transposition, binary search trees, etc. [Frigo et al. 1999]

— all via the recursive divide & conquer approach

FFTW uses a finite-radix (p) recursive cache-oblivious algorithm with suboptimal “cache complexity” O(n log[n/C]),

…but an optimal algorithm is used in the generator (cache == registers)

Page 23: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Outline

FFT algorithm basics

Recursion and caches

The planner

The codelet generator

Page 24: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

The Planner

• There are many choices in implementing the C-T algorithm— which factor p? & memory access ordering…

Each algorithm step is represented by a solver.

• The planner tries the different solver combinations for a given n,

measures their speed, and picks the fastest.

— uses dynamic programming

— can use heuristics or saved plansif planning time is a concern

Page 25: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Vectors and Solversa problem is specified as a DFT(v,n):

multi-dimensional transform n of multi-dimensional vectors v

SOLVE[v,n] Directly solve size n with 1d vector (loop) v

by an efficient codelet (hard-coded FFT loop)

CT-FACTOR[p]DFT(v, n = pq) =

DFT(vxp, q)

+ v size-p DFTs with twiddles[loop v of hard-coded twiddle codelet p]

VECLOOP DFT(vxm, n) = loop m of DFT(v,n)

each solver knows what problems it can solveand tells the planner its recursive “child” problems

Page 26: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Dynamic Programmingthe assumption of “optimal substructure”

DFT(16) = fastest of: CT-FACTOR[2]: 2 DFT(8)CT-FACTOR[4]: 4 DFT(4)

DFT(8) = fastest of:CT-FACTOR[2]: 2 DFT(4)CT-FACTOR[4]: 4 DFT(2)SOLVE[1,8]

Try all applicable solvers:assume VECLOOP

strips off loops

If exactly the same problem appears twice,assume that we can re-use the plan.

Page 27: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

More Solvers (out of ~16 total)

(a) DFT(vxm, n) = loop m of DFT(v,n)

(b) DFT(mxv, n) = loop m of DFT(v,n)ORVECLOOP

i.e. interchange loop orders!

INDIRECT DFT(v,n) = DFT(v,{}) + DFT(v,n)

zero-dimensional DFT = copy loop in-place

Page 28: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Actual Plan for size 219=524288(2GHz Pentium IV, double precision, out-of-place)

CT-FACTOR[4] (buffered variant)CT-FACTOR[32] (buffered variant)

VECLOOP(b) x32CT-FACTOR[64]

INDIRECT

VECLOOP(a) x4SOLVE[64, 64]

VECLOOP(b) x64VECLOOP(a) x4

COPY[64]

~2000 lineshard-coded C!INDIRECT

+VECLOOP(b)

(+ …)=

demolishes FFTW 2for large 1d sizes

Unpredictable: (automated) experimentation is the only solution.

Page 29: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Outline

FFT algorithm basics

Recursion and caches

The planner

The codelet generator

Page 30: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

The Codelet Generatora domain-specific FFT “compiler”

• Generates fast hard-coded C for FFTs of arbitrary size

Necessary to give the planner a large space of codelets to

experiment with.

Exploits modern CPUdeep pipelines & large register sets.

Allows easy experimentation with different optimizations & algorithms.

…and you only have to get it right once.

Page 31: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

The Codelet Generatorwritten in Objective Caml [Leroy, 1998], an ML dialect

Symbolic graph (dag)

Simplifications

Cache-oblivious scheduling(cache .EQ. registers)

Optimized C code (or other language)

n

powerful enoughto e.g. derive real-input FFTfrom complex FFT algorithmand even find new algorithms

Abstract FFT algorithmCooley-Tukey: n=pq,

Prime-Factor: gcd(p,q) = 1,Rader: n prime, …

Page 32: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

The Generator Finds Good/New FFTs

Page 33: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Symbolic Algorithms are EasyCooley-Tukey in OCaml

Page 34: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Simple Simplifications

Well-known optimizations:

Algebraic simplification, e.g. a + 0 = a

Constant folding

Common-subexpression elimination

Page 35: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Symbolic Pattern Matching in OCamlThe following actual code fragment issolely responsible for simplifying multiplications:

stimesM = function| (Uminus a, b) -> stimesM (a, b) >>= suminusM| (a, Uminus b) -> stimesM (a, b) >>= suminusM| (Num a, Num b) -> snumM (Number.mul a b)| (Num a, Times (Num b, c)) ->

snumM (Number.mul a b) >>= fun x -> stimesM (x, c)| (Num a, b) when Number.is_zero a -> snumM Number.zero| (Num a, b) when Number.is_one a -> makeNode b| (Num a, b) when Number.is_mone a -> suminusM b| (a, b) when is_known_constant b && not (is_known_constant a) ->

stimesM (b, a)| (a, b) -> makeNode (Times (a, b))

(Common-subexpression elimination is implicitvia “memoization” and monadic programming style.)

Page 36: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Simple Simplifications

Well-known optimizations:

Algebraic simplification, e.g. a + 0 = a

Constant folding

Common-subexpression elimination

FFT-specific optimizations:

_________________ negative constants…

Network transposition (transpose + simplify + transpose)

Page 37: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

A Quiz: Is One Faster?Both compute the same thing, and

have the same number of arithmetic operations:

a = 0.5 * b;c = 0.5* d;e = 1.0 + a;f = 1.0 -c;

Faster because no separate load for -0.5

a = 0.5 * b;c = -0.5 * d;e = 1.0 + a;f = 1.0 + c;

10–15% speedup

Page 38: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Non-obvious transformations require experimentation

Page 39: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

Quiz 2: Which is Faster?accessing strided array

inside codelet (amid dense numeric code)

array[stride * i] array[strides[i]]

strides[i] = stride * i

using precomputed stride array:

This is faster, of course!Except on brain-dead architectures…

…namely, Intel Pentia:integer multiplication

conflicts with floating-point

up to ~20% speedup

Page 40: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

SIMD: The Revenge of the Crays= Single Instruction, Multiple Data

Available on most popular processors today:

Pentium III+ SSE: operate on 4 floatvaluesPowerPC G4 AltiVec: operate on 4 floatvalues

AMD Athlon 3dNow!: operate on 2 floatvalues

Pentium IV SSE2: operate on 2 double values

Modify only the generator to produce SIMD codelets[ initiated by S. Kral and F. Franchetti, Univ. Vienna ]

Page 41: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

SSE2 FFTW on 2GHz Pentium IV

SSE2 FFTW 3

FFTW 3IntelMKL

Page 42: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

SSE FFTW on 2GHz Pentium IV

SSE FFTW 3

FFTW 3

Intel MKL

Page 43: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

with a generator,it’s easy to include

less-popular cases…

Page 44: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

SSE2 FFTW on 2GHz Pentium IV

SSE2 FFTW 3

FFTW 3

Page 45: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

We’ve Come a Long Way

1965 Cooley & Tukey, IBM 7094, 36-bit single precision:size 2048 DFT in 1.2 seconds

2003 FFTW3+SIMD, 2GHz Pentium-IV 64-bit double precision:size 2048 DFT in 50 microseconds (24,000x speedup)

(= 30% improvement per year)

Page 46: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

We’ve Come a Long Way?In the name of performance,computers have become complex and

unpredictable.Optimization is hard:you cannot simply minimize the number of operations.

The solution is to avoid the details, not embrace them:(Recursive) composition of simple modules

+ feedback (self-optimization)High-level languages (not C) & code generationare a powerful tool for high performance.

Page 47: FFTs, Portability, & Performancedspace.mit.edu/.../contents/lecture-notes/lec10.pdf · 2019-09-12 · Cache-oblivious scheduling (cache .EQ. registers) Optimized C code (or other

FFTW Homework Problems?• Try an FFTPACK-style back-and-forth solver

• Implement Vector-Radix for multi-dimensional n

• Pruned FFTs: VECLOOP that skips zeros

• Better heuristic planner—some sort of optimization of per-solver “costs?”

• Modify generator for fixed-point arithmetic—e.g. faster integer MDCT for Ogg Vorbis audio

• Implement convolution solvers