CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum –...
-
Upload
angelina-jenkins -
Category
Documents
-
view
215 -
download
0
Transcript of CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum –...
![Page 1: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/1.jpg)
CS 179: GPU ProgrammingLecture 8
![Page 2: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/2.jpg)
Last time
• GPU-accelerated:– Reduction– Prefix sum– Stream compaction– Sorting (quicksort)
![Page 3: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/3.jpg)
Today
• GPU-accelerated Fast Fourier Transform• cuFFT (FFT library)
![Page 4: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/4.jpg)
Signals (again)
![Page 5: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/5.jpg)
“Frequency content”
• What frequencies are present in our signals?
![Page 6: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/6.jpg)
Discrete Fourier Transform (DFT)
• Given signal over time, represents DFT of
– Each row of W is a complex sine wave– Each row multiplied with - inner product of wave with signal– Corresponding entries of - “content” of that sine wave!
![Page 7: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/7.jpg)
Discrete Fourier Transform (DFT)
• Alternative formulation:
– - values corresponding to wave k• Periodic – calculate for 0 ≤ k ≤ N - 1
![Page 8: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/8.jpg)
Discrete Fourier Transform (DFT)
• Alternative formulation:
– - values corresponding to wave k• Periodic – calculate for 0 ≤ k ≤ N - 1
– Naive runtime: O(N2)• Sum of N iterations, for N values of k
![Page 9: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/9.jpg)
Discrete Fourier Transform (DFT)
• Alternative formulation:
– - values corresponding to wave k• Periodic – calculate for 0 ≤ k ≤ N - 1
– Naive runtime: O(N2)• Sum of N iterations, for N values of k
![Page 10: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/10.jpg)
Roots of unity
![Page 11: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/11.jpg)
Discrete Fourier Transform (DFT)
• Alternative formulation:
– - values corresponding to wave k• Periodic – calculate for 0 ≤ k ≤ N - 1
Number of distinct values: N, not N2 !
![Page 12: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/12.jpg)
(Proof)
• Breakdown (assuming N is power of 2):– (Let , smallest root of unity)
![Page 13: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/13.jpg)
(Proof)
• Breakdown (assuming N is power of 2):– (Let , smallest root of unity)
![Page 14: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/14.jpg)
(Proof)
• Breakdown (assuming N is power of 2):– (Let , smallest root of unity)
![Page 15: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/15.jpg)
(Proof)
• Breakdown (assuming N is power of 2):– (Let , smallest root of unity)
![Page 16: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/16.jpg)
(Proof)
• Breakdown (assuming N is power of 2):– (Let , smallest root of unity)
DFT of xn, even n! DFT of xn, odd n!
![Page 17: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/17.jpg)
(Divide-and-conquer algorithm)Recursive-FFT(Vector x):
if x is length 1: return x
x_even <- (x0, x2, ..., x_(n-2) )x_odd <- (x1, x3, ..., x_(n-1) )
y_even <- Recursive-FFT(x_even)y_odd <- Recursive-FFT(x_odd)
for k = 0, …, (n/2)-1: y[k] <- y_even[k] + wk * y_odd[k] y[k + n/2] <- y_even[k] - wk * y_odd[k]
return y
![Page 18: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/18.jpg)
(Divide-and-conquer algorithm)Recursive-FFT(Vector x):
if x is length 1: return x
x_even <- (x0, x2, ..., x_(n-2) )x_odd <- (x1, x3, ..., x_(n-1) )
y_even <- Recursive-FFT(x_even)y_odd <- Recursive-FFT(x_odd)
for k = 0, …, (n/2)-1: y[k] <- y_even[k] + wk * y_odd[k] y[k + n/2] <- y_even[k] - wk * y_odd[k]
return y
T(n/2)
T(n/2)
O(n)
![Page 19: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/19.jpg)
Runtime
• Recurrence relation:– T(n) = 2T(n/2) + O(n)
O(n log n) runtime! Much better than O(n2)
• (Minor caveat: N must be power of 2)– Usually resolvable
![Page 20: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/20.jpg)
Parallelizable?
• O(n2) algorithm certainly is!for k = 0,…,N-1:
for n = 0,…,N-1: ...
• Sometimes parallelization outweighs runtime!– (N-body problem, …)
![Page 21: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/21.jpg)
Recursive index tree
x0, x1, x2, x3, x4, x5, x6, x7
x0, x2, x4, x6 x1, x3, x5, x7
x0, x4 x2, x6 x1, x5 x3, x7
x0 x4 x2 x6 x1 x5 x3 x7
![Page 22: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/22.jpg)
Recursive index tree
x0, x1, x2, x3, x4, x5, x6, x7
x0, x2, x4, x6 x1, x3, x5, x7
x0, x4 x2, x6 x1, x5 x3, x7
x0 x4 x2 x6 x1 x5 x3 x7
Order?
![Page 23: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/23.jpg)
0 0004 1002 0106 1101 0015 1013 0117 111
![Page 24: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/24.jpg)
Bit-reversal order
0 000 reverse of… 0004 100 0012 010 0106 110 0111 001 1005 101 1013 011 1107 111 111
![Page 25: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/25.jpg)
Iterative approach
x0, x1, x2, x3, x4, x5, x6, x7
x0, x2, x4, x6 x1, x3, x5, x7
x0, x4 x2, x6 x1, x5 x3, x7
x0 x4 x2 x6 x1 x5 x3 x7
Stage 3
Stage 2
Stage 1
![Page 26: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/26.jpg)
(Divide-and-conquer algorithm review)
Recursive-FFT(Vector x):
if x is length 1: return x
x_even <- (x0, x2, ..., x_(n-2) )x_odd <- (x1, x3, ..., x_(n-1) )
y_even <- Recursive-FFT(x_even)y_odd <- Recursive-FFT(x_odd)
for k = 0, …, (n/2)-1: y[k] <- y_even[k] + wk * y_odd[k] y[k + n/2] <- y_even[k] - wk * y_odd[k]
return y
T(n/2)
T(n/2)
O(n)
![Page 27: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/27.jpg)
Iterative approach
x0, x1, x2, x3, x4, x5, x6, x7
x0, x2, x4, x6 x1, x3, x5, x7
x0, x4 x2, x6 x1, x5 x3, x7
x0 x4 x2 x6 x1 x5 x3 x7
Stage 3
Stage 2
Stage 1
![Page 28: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/28.jpg)
Iterative approachBit-reversed
access
http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap32.htm
![Page 29: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/29.jpg)
Iterative approachBit-reversed
access
http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap32.htm
Stage 1 Stage 2 Stage 3
![Page 30: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/30.jpg)
Iterative approachIterative-FFT(Vector x):
y <- (bit-reversed order x)N <- y.lengthfor s = 1,2,…,log(N):
m <- 2s
wn <- e2πj/m
for k: 0 ≤ k ≤ N-1, stride m: for j = 0,…,(m/2)-1:
u <- y[k + j] t <- (wn)j * y[k + j +
m/2]
y[k + j] <- u + t y[k + j + m/2] <- u - t
return y
Bit-reversed access
http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap32.htm
Stage 1 Stage 2 Stage 3
![Page 31: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/31.jpg)
Iterative approachIterative-FFT(Vector x):
y <- (bit-reversed order x)N <- y.lengthfor s = 1,2,…,log(N):
m <- 2s
wn <- e2πj/m
for k: 0 ≤ k ≤ N-1, stride m: for j = 0,…,(m/2)-1:
u <- y[k + j] t <- (wn)j * y[k + j +
m/2]
y[k + j] <- u + t y[k + j + m/2] <- u - t
return y
O(log N) iterations
O(N/m) iterations
O(m) iterations
Total: O(N log N) runtime
![Page 32: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/32.jpg)
CUDA approach
Bit-reversed access
http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap32.htm
Stage 1 Stage 2 Stage 3
![Page 33: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/33.jpg)
CUDA approach
Bit-reversed access
http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap32.htm
Stage 1 Stage 2 Stage 3
__syncthreads() barriers!
![Page 34: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/34.jpg)
CUDA approach
Bit-reversed access
http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap32.htm
Stage 1 Stage 2 Stage 3
__syncthreads() barriers!
Non-coalesced memory access!!
![Page 35: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/35.jpg)
CUDA approach
Bit-reversed access
http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap32.htm
Stage 1 Stage 2 Stage 3
__syncthreads() barriers!
Load into shared memory
Non-coalesced memory access!!
![Page 36: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/36.jpg)
CUDA approach
Bit-reversed access
http://staff.ustc.edu.cn/~csli/graduate/algorithms/book6/chap32.htm
Stage 1 Stage 2 Stage 3
__syncthreads() barriers!
Load into shared memory
Non-coalesced memory access!!
Bank conflicts!!
![Page 37: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/37.jpg)
Generalization
• Above algorithm works for array sizes of 2n (integer n)
• Generalize?
![Page 38: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/38.jpg)
“Bit”-reversal order
• N = 6: (small example)– Factors into 3,2• 1st digit: base 3• 2nd digit: base 2
000110112021
001020011121
012345
024135
![Page 39: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/39.jpg)
Generalization
• Radix-3 case (followed by radix-2)
https://www.projectrhea.org/
![Page 40: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/40.jpg)
• Suppose N = N1N2 (factorization…)• Algorithm: (recursive)– Take FFT along rows– Do multiply operations (complex roots of unity)– Take FFT along columns
• “Cooley-Tukey FFT algorithm”
![Page 41: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/41.jpg)
Inverse DFT/FFT
• Similarly parallelizable!– (Sign change in complex terms)
![Page 42: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/42.jpg)
cuFFT
• FFT library included with CUDA– Approximately implements previous algorithms• (Cooley-Tukey/Bluestein)• Also handles higher dimensions
![Page 43: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/43.jpg)
cuFFT 1D example
Correction: Remember to use cufftDestroy(plan) when finished with transforms
![Page 44: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/44.jpg)
cuFFT 3D example
Correction: Remember to use cufftDestroy(plan) when finished with transforms
![Page 45: CS 179: GPU Programming Lecture 8. Last time GPU-accelerated: – Reduction – Prefix sum – Stream compaction – Sorting (quicksort)](https://reader030.fdocuments.net/reader030/viewer/2022032702/56649cef5503460f949bdc2e/html5/thumbnails/45.jpg)
Remarks
• As before, some parallelizable algorithms don’t easily “fit the mold”– Hardware matters more!
– Some resources:• Introduction to Algorithms (Cormen, et al), aka “CLRS”,
esp. Sec 30.5• “An Efficient Implementation of Double Precision 1-D
FFT for GPUs Using CUDA” (Liu, et al.)