The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
-
Upload
jovan-parramore -
Category
Documents
-
view
227 -
download
0
Transcript of The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
![Page 1: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/1.jpg)
The Study of Cache Oblivious Algorithms
Prepared by Jia Guo
![Page 2: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/2.jpg)
2CS598dhp
Cache-Oblivious Algorithmsby Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.
![Page 3: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/3.jpg)
3CS598dhp
Outline
Cache complexity Cache aware algorithmsCache oblivious algorithms
Matrix multiplicationMatrix transposition FFT
Conclusion
![Page 4: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/4.jpg)
4CS598dhp
Assumption
Only two levels of memory hierarchies: An ideal cache
Fully associativeOptimal replacement strategy“Tall cache”
A very large memory
![Page 5: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/5.jpg)
5CS598dhp
An Ideal Cache Model
An ideal cache model (Z,L)
Z: Total words in the cache
L: Words in one cache line
![Page 6: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/6.jpg)
6CS598dhp
Cache Complexity
An algorithm with input size n is measured by:Work complexity W(n)Cache complexity: the number of cache misses
it incurs. Q(n; Z, L)
![Page 7: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/7.jpg)
7CS598dhp
Outline
Cache complexity Cache aware algorithmsCache oblivious algorithms
Matrix multiplicationMatrix transposition FFT
Conclusion
![Page 8: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/8.jpg)
8CS598dhp
Cache Aware Algorithms
Contain parameters to minimize the cache complexity for a particular cache size (Z) and line length (L).
Need to adjust parameters when running on different platforms.
![Page 9: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/9.jpg)
9CS598dhp
Example:
A blocked matrix multiplication algorithm
s is a tuning parameter to make the algorithm run fast
A11s
s
n
A
![Page 10: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/10.jpg)
10CS598dhp
Example (2)
Cache complexity The three s x s sub matrices should fit into the cache so
they occupy cache lines
Optimal performance is obtained when Z/L cache misses needed to bring 3 sub matrices into
cache n2/L cache misses needed to read n2 elements It is
)( Zs
)//1(
))/()/(/1(32
32
ZLnLn
LZsnLn
)/()/,max( 22 LssLss
![Page 11: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/11.jpg)
11CS598dhp
Outline
Cache complexity Cache aware algorithmsCache oblivious algorithms
Matrix multiplicationMatrix transposition and FFT
Conclusion
![Page 12: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/12.jpg)
12CS598dhp
Cache Oblivious Algorithms
Have no parameters about hardware, such as cache size (Z), cache-line length (L).No tuning needed, platform independent.
The following algorithms introduced are proved to have the optimal cache complexity.
![Page 13: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/13.jpg)
13CS598dhp
Matrix Multiplication
Partition matrix A and B by half in the largest dimension. A: n x m, B: m x p
Proceed recursively until reach the base case - one element.
n ≥ max (m, p)
m ≥ max (n, p)
p ≥ max (n, m)
![Page 14: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/14.jpg)
14CS598dhp
Matrix Multiplication (2)
12
111211 B
BAA
2
121 B
BAAA*B
A1*B1 A2*B2
A11*B11 A12*B12 A21*B21 A22*B22
22
212221 B
BAA
Assume Sizes of A, B are nx4n, 4nxn
+ +
+
![Page 15: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/15.jpg)
15CS598dhp
Matrix Multiplication (3)
Intuitively, once a sub problem fits into the cache, its smaller sub problems can be solved in cache with no further misses.
![Page 16: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/16.jpg)
16CS598dhp
Matrix Multiplication (4)
Cache complexityCan achieve the same as the cache complexity
of Block-MULT algorithm (cache aware)For a square matrix, the optimal cache
complexity is achieved.
![Page 17: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/17.jpg)
17CS598dhp
Outline
Cache complexity Cache aware algorithmsCache oblivious algorithms
Matrix multiplicationMatrix transposition FFT
Conclusion
![Page 18: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/18.jpg)
18CS598dhp
If n is very large, the access of B in column will cause cache miss every time!
(No spatial locality in B)
Matrix Transposition
A AT for i 1 to m
for j 1 to n
B( j, i ) = A( i, j )
m x n
Bn x m
![Page 19: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/19.jpg)
19CS598dhp
Matrix Transposition (2)
Partition array A along the longer dimension and recursively execute the transpose function.
A1A111
A12A12
A21A21
A22A22
A11A11TT
A21A21TT
A12A12TT
A22A22TT
![Page 20: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/20.jpg)
20CS598dhp
Matrix Transposition (3)
Cache complexityIt has the optimal cache complexityQ(m, n) = Θ(1+mn/L)
![Page 21: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/21.jpg)
21CS598dhp
Fast Fourier Transform
Use Cooley-Tukey algorithm Cooley-Tukey algorithms recursively re-express a DF
T of a composite size n = n1n2 as:
Perform n2 DFTs of size n1.
Multiply by complex roots of unity called twiddle factors.
Perform n1 DFTs of size n2.
1
0
][][n
j
ijnjXiY
![Page 22: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/22.jpg)
22CS598dhp
1
0
[ ] [ ]n
ij
j
Y i X j w
2 1
1 1 1 2 2 2
1 2
2 1
1 1
1 2 1 1 2 20 0
[ ] [ ]n n
i j i j i jn n n
j j
Y i i n X j n j w w w
n2
n1
![Page 23: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/23.jpg)
23CS598dhp
Assume X is a row-major n1× n2 matrixSteps:
Transpose X in place.Compute n2 DFTsMultiply by twiddle factorsTranspose X in placeCompute n1 DFTsTranspose X in-place
![Page 24: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/24.jpg)
24CS598dhp
Fast Fourier Transform
*twiddle factor
Transpose to select n2 DFT of size n1
Call FFT recursively with n1=2, n2=2 Reach the base case, return
Transpose to select n1 DFT of size n2
Transpose and return
n1=4, n2=2
![Page 25: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/25.jpg)
25CS598dhp
Fast Fourier Transform
Cache complexityOptimal for a Cooley-Tukey algorithm, when n
is an exact power of 2Q(n) = O(1+(n/L)(1+logzn)
![Page 26: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/26.jpg)
26CS598dhp
Other Cache Oblivious Algorithms
Funnelsort Distribution sortLU decomposition without pivots
![Page 27: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/27.jpg)
27CS598dhp
Outline
Cache complexity Cache aware algorithmsCache oblivious algorithms
Matrix multiplicationMatrix transpositionFFT
Conclusion
![Page 28: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/28.jpg)
28CS598dhp
Questions
How large is the range of practicality of cache-oblivious algorithms?
What are the relative strengths of cache-oblivious and cache-aware algorithms?
![Page 29: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/29.jpg)
29CS598dhp
Practicality of Cache-oblivious Algorithms
Average time to transpose an NxN matrix, divided by N2
![Page 30: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/30.jpg)
30CS598dhp
Practicality of Cache-oblivious Algorithms (2)
Average time taken to multiply two NxN matrices, divided by N3
![Page 31: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/31.jpg)
31CS598dhp
Question 2
Do cache-oblivious algorithms perform as well as cache-aware algorithms?FFTW libraryNo answer yet.
![Page 32: The Study of Cache Oblivious Algorithms Prepared by Jia Guo.](https://reader035.fdocuments.net/reader035/viewer/2022062515/56649c795503460f9492e385/html5/thumbnails/32.jpg)
32CS598dhp
References
Cache-Oblivious Algorithmsby Matteo Frigo, Charles E. Leiserson, Harald Prokop, and Sridhar Ramachandran. In the 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17-18 October, 1999, New York, NY, USA.Cache-Oblivious Algorithmsby Harald Prokop. Master's Thesis, MIT Department of Electrical Engineering and Computer Science. June 1999.
Optimizing Matrix Multiplication with a Classifier Learning System by Xiaoming Li and María Jesus Garzarán. LCPC 2005.