Parallel Algorithms and Applications(2012)
-
Upload
rajeshsingh123 -
Category
Documents
-
view
22 -
download
3
Transcript of Parallel Algorithms and Applications(2012)
![Page 1: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/1.jpg)
Parallel Algorithms
January 2012
Kenichi Miura
![Page 2: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/2.jpg)
Classification of Computational Models (Miura 1980)
• Continuum Model - Fluid Model(Eulerean View) - Discretization of PDEs • Particle Model - Many-body Problems (Lagrangean View) - Discretization of ODEs (e.g., Newton’s Equations) • Structural Model - Discrete Model - Sparse Matrix Formulation • Mathematical Transform - Fourier Transform - Linear algebra
![Page 3: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/3.jpg)
High-end simulation in the physical sciences consists of seven algorithms:
1. Structured Grids (including locally structured grids, e.g. AMR)
2. Unstructured Grids 3. Fast Fourier Transform 4. Dense Linear Algebra 5. Sparse Linear Algebra 6. Particles 7. Monte Carlo Well-defined targets from algorithmic and software standpoint. Slide from “Defining Software Requirements for Scientific Computing”, Phillip Colella, 2004
Phillip Colella’s “Seven dwarves(2004)”
![Page 4: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/4.jpg)
High-end simulation in the physical sciences consists of thirteen algorithms: 1. Dense Linear Algebra 2. Sparse Linear Algebra 3. Spectral Methods (Fast Fourier Transform) 4. N-Body Methods 5. Structured Grids 6. Unstructured Grids 7. MapReduce (including Monte Carlo Methods) 8. Combinational Logic 9. Graph Traversal 10. Dynamic Programming 11. Backtrack and Branch-and-Bound 12. Graphical Models 13. Finite state machines
Phillip Colella’s “Thirteen Dwarfs(2006)”
![Page 5: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/5.jpg)
![Page 6: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/6.jpg)
Steps in Conducting Simulations
• Physical Phenomena • Modeling and mathematical formulation • Algorithm selection/development • Programming • Run on hardware platform • Verification of the results
![Page 7: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/7.jpg)
• SISD (Sequential Processing) • SIMD (Lock-step,Data Parallel) • MIMD (Control Parallel) • SPMD (A Variation of MIMD; Data Parallel Model on an MMD Machine)
Mike Flynn (1967)
Parallelism from the viewpoint of Computational Models and
Parallelism Description
![Page 8: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/8.jpg)
Parallel Programming -A Necessary Evil?-
• It is most ideal to be able to obtain performance (i.e., shorter wall-clock time), without doing anything with the codes ---- Automatic parallelizing compiler
• Why doesn’t it work so easily in many cases? ‐Computational algorithm is inherently serial Examples: Recursive formulation, many branches ‐Algorithm may be parallelizable, but actual implementation of the code is NOT. ‐Employed data structure is not suitable for parallel processing Examples: Stack vs FIFO, Array vs Linked List)
![Page 9: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/9.jpg)
Bottlenecks in Parallel Processing
• Overhead in Creating and Finalizing Tasks • Overhead in Synchronization • Significant fraction of non-parallel portion of the code (Amdahl’s Law ) • Overhead in Data Transfer (Latency,Bandwidth for the
Distributed Memory architecture • Memory Contention for Shared Memory architecture • Lack in Load Balancing
![Page 10: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/10.jpg)
Parallel Processing and Amdahl’s Law
Can a program be run faster in proportion to the number of processors?
Synchronous Parallel Processing Model Barrier Model Amdahl(1967) Asynchronous Parallel Processing Model Critical Section Model Miura(1991)
![Page 11: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/11.jpg)
Synchronous Parallel Processing Model
Barrier Model Gene Amdahl (1967)
Sp(n) = 1/(1-α + α/n) α 1−α
Serial
Parallel
Dr. G.M.Amdahl (2008)
![Page 12: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/12.jpg)
Amdahl’s Law
Sp(n) = 1/(1−α + α/n)
![Page 13: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/13.jpg)
Asynchronous Parallel Processing Model
Critical Section Model Miura (1991)
P
P
P
P
C
queue
Critical Section
Queuing Model (M/M/1)
![Page 14: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/14.jpg)
Asynchronous Case (Miura)
![Page 15: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/15.jpg)
Load Imbalance Model
T1
T2 T3
Tn
T= Σ Ti (i=1,......,n) Sp (n) = T/(Max(Ti)) < n Speed-up factor is n, if T1 = T2 =・・・・= Tn
![Page 16: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/16.jpg)
Synchronization Overhead Model Load-balance is assumed. Sp(n)=T(1)/T (n) = 1/(1/n + ε n ) =n/(1+ ε n2) (Linear Overhead) Sp(n)=T(1)/T (n) = 1/(1/n + ε log2n ) =n/(1+ ε n log2n ) (Logarithmic Overhead)
Speed-up
0 10 20 30 40 500
5
10
15
20
0 500 1000 1500 2000
20
40
60
80
100
ε =.001
n
n
Speed-up
![Page 17: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/17.jpg)
Hockney’s Performance Model (n1/2 Model)
T(n) = TOH + n*τ Total time Overhead execution time or P(n) = Roo n/ (n + n1/2) P(n) = n/T(n) Roo : Peak Performance n1/2: n which gives half of peak performance
Performance Model for Vector and Parallel Supercomputers time
n
TOH
performance
Roo
Roo/2
n n1/2 Note: n refers to the problem size, not the number of processors
![Page 18: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/18.jpg)
SIMD Computational Model ( Data Parallel)
• Simple Parallelism(Vector, Matrix) ci=ai+bi,C=A+B
• Reduction s = a1+ a2+ a3+ a4+ • Broadcast ai = s • Shift/Rotate ai= bi-k, • Recurrence ai = ai-1 + bi,← Problem!
Note:Vector processing is also included in this category
![Page 19: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/19.jpg)
SIMD Computational Model and
Vector/Parallel Processing
‐ Both vectorization and Parallelization detect identical but independently executable arithmetic operations ‐ Vectorization: Search from Innermost loop outward vs - Parallelization: Search from Outermost loop inward ‐ Same for both cases when partitioning data in the Simple loops or Innermost loops
![Page 20: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/20.jpg)
Examples where modification of algorithm is necessary (1)
- Simple Recurrence - Suitable for serial computing (Data locality, Better utilization of memory etc.) Ai = ki Ai-1 Ai = ki ki-1 …… k3 k2 k1 A0
k1 k2 k3 k4 k5 k6 k7 k8 Recursive Doubling
![Page 21: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/21.jpg)
ai = ki ai-1 + bi ai = k i bi ai-1
1 0 1 1 ai = MiMi-1…..M2M1 a0
Examples where modification of algorithm is necessary (2) - Linear Recurrence -
Μ1 Μ2 Μ3 Μ4 Μ5 Μ6 Μ7 Μ8
![Page 22: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/22.jpg)
Ki-1,j-1 Ci-1,j-1 +Ki-1,j+1 Ci-1,j+1 +Ki,jCi,j + Ki+1,j-1 Ci+1,j-1 + K i+1,j+1 C i+1,j+1 = d i
Examples of Recurrence Formula for Iterative Methods
Ki-1 Ci-1 +Ki Ci + K i+1 C i+1 = d i
One-Dimensional Case
Two- Dimensional Case
![Page 23: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/23.jpg)
Cyclic Reduction for Tridiagonal Equations
b1 c1 a2 b2 c2 a3 b3 c3
x1 x2 x3
=
k1 k2 k3
0
0
*
![Page 24: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/24.jpg)
Cyclic Reduction(Serial)
![Page 25: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/25.jpg)
Cyclic Reduction(Parallel)
![Page 26: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/26.jpg)
Random Number Generation Algorithms (1) Linear Congruencial Method:X n = (a X n-1 + c) Mod M Multiplicative : c=0. → Period = 2j-2 Mixed : c .ne. 0. → Period = 2j where M=2j (usually Machine Word Size) (2) Binary M-sequence with Primitive Trinomial: X n = (Xn-m .eor. Xn-k ) Mod 2 → Period = 2k-1 (m<k) (3) Generalized Fibbonacci Method with Primitive Trinomial: X n = (Xn-m op. Xn-k ) Mod M → Period =(2k-1)2j-1~ 2 k+j-1
where op. is {+, ‐, *}. (4) Generalized Recurrence Method With Large Prime Modulus (Multiple Recursive Generator or MRG) X n = (a1 Xn-1 + a2 Xn-2 + a3 Xn-3 +…..+ak Xn-k) Mod p → Period =pk-1~ 2 j*k ,where p ~ 2 j is a Prime Number (e.g., 2 31 – 1)
![Page 27: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/27.jpg)
Consider p=231 – 1 and k = 8. f(x) = (x8- a1x7 – a2x6 – a3x5 – a4x4 – a5x3 – a6x2 – a7x – a8) mod ( p )
- Good Lattice Structure with Full Coefficients - Simple and Fast Implementation with 64 bit Arithmetic, when modulus p = 231 - 1 - Long Period: (231-1)8 – 1 ~ 4.5*1074
- % of Primitive Polynomials: φ (pk – 1)/k/(pk – 1) = 2.2% - Easy to Extend for Vector/Parallel Processing
An Example of MRG with a 8th-order Full Primitive Polynomial
![Page 28: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/28.jpg)
Vectorization and Parallelization of MRG - Decimating the Sequence -
a1 a2 a3 a4 a5 a6 a7 a8 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 A = 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 ,
xn-1
xn-2
xn-3 x = xn-4
xn-5
xn-6
xn-7 xn-8 ,
Compute A2, A4, A8,……….Mod(p) , once. Obtain the new polynomial by multiplying the matrices.
xn
xn-1
xn-2 x’ = xn-3
xn-4
xn-5
xn-6 xn-7
Then x’ = A x Mod(p).
Transfer Matrix:
![Page 29: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/29.jpg)
Vectorization and Parallelization of MRG (Continued)
In order to compute xn = An x0 mod(p): (1) Store Aj = A2 j mod(p)
(j=0,1,2,3,4,……).
(2) Represent n in the binary form, e.g., (bm-1,…..,b2,b1,b0). (3) Multiply I by Aj mod(p)
when bj=1 (j=0,….,m-1),
where I is a k-th order identity matrix. Note: The same strategy works with the polynomial arithmetic (Knuth).
![Page 30: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/30.jpg)
SPMD Computational Model -Application of Data Parallel to MIMD Architecture-
• A “SINGLE” Program for all processors • Multiple instruction streams at Execution time • Each processor takes care of a portion of large-
sized data. Its behavior is similar to SIMD, but is allowed locally independent operations with, say, conditional branches
Necessity for various synchronization mechanisms and their description
![Page 31: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/31.jpg)
MIMD Computational Model -Control Parallel or Task Parallel-
• Master ‐Slave type operations • Fork, Join Construct • Synchronization mechanisms and their description (Barrier,Semaphor,Lock/Unlock) • Data Transfer( in case of Distributed Memory) (Send/Receive protocol) Example: Event Parallel Transport Monte Carlo Simulation, Ray Tracing
![Page 32: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/32.jpg)
Parallel Prefix
Source: L.Snyder “Paralle Progrmming”
![Page 33: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/33.jpg)
Three Models of Dense Matrix Multiplications
• Inner Product Do i = 1, n Do j = 1, n Do k = 1, n C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo
Cij = Σ Aik * Bkj
• Middle Product Do j = 1, n Do k = 1, n Do i = 1, n C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo
• Outer Product Do k = 1, n Do j = 1, n Do i = 1, n C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo
![Page 34: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/34.jpg)
Three Models of Dense Matrix Multiplications (1)
• Inner Product Do i = 1, n Do j = 1, n sum=0 Do k = 1, n sum = sum+ A(i,k)*B(k,j) enddo C(i,j) = sum enddo enddo
= *
Cij = Σ Aik * Bkj k
![Page 35: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/35.jpg)
Three Models of Dense Matrix Multiplications (2)
• Middle Product Do j = 1, n Do k = 1, n Do i = 1, n C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo
= * +
Cij = Σ Aik * Bkj k
![Page 36: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/36.jpg)
Three Models of Dense Matrix Multiplications(3)
• Outer Product Do k = 1, n Do j = 1, n Do i = 1, n C(i,j) = C(i,j) + A(i,k)*B(k,j) enddo enddo enddo
= * +
Cij = Σ Aik * Bkj k
![Page 37: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/37.jpg)
Strassen’s Algorithm for Matrix Multiply
• Multiplication of two matrices is one of the most basic operations of linear algebra and scientific computing.
• Conventional standard algorithm for n x n matrices requires Ο(n3) operations.
• Strassen’s algorithm, introduced 1969, has maximum operations of Ο(nlog2(7)) ≈ Ο(n2.807).
• Another Divide-and-Conquer approach.
Reference: 1. V. Strassen, “Gaussian Elimination is Not Optimal.” Journal of Numerical Mathematics, 13:354-356
2. S. Huss-Lederman etc. “Implementation of Strassen’s Algorithm for Matrix Multiplication”, SC96 Technical Paper.
![Page 38: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/38.jpg)
Conventional Matrix Multiply
No. of scalar multiplication = n3
No. of scalar addition = n3 - n2
No. of total arithmetic operations = 2 n3 - n2
A,B,C are n by n matrices.
C = A * B
![Page 39: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/39.jpg)
Conventional Matrix Multiply (with submatrices)
C11
C21
C12
C22
A11
A21
A12
A22
B11
B21
B12
B22 = *
C11 = A11 B11 + A12 B21 C12 = A11 B12 + A12 B22
C21 = A21 B11 + A22 B21 C22 = A21 B12 + A22 B22
No. of matrix multiplications = 8
No. of matrix additions = 4
Total arithmetic operations = 8(2 (n/2)3 – (n/2)2) + 4 (n/2)2 = 2 n3 - n2
![Page 40: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/40.jpg)
Strassen’s Algorithm - 1/3
• Strassen’s method has fewer multiply and offsets by more additions and subtractions.
• For each pair of sub-matrices, there are 7 multiplications and 18 additions/subtractions.
• Among these operations, 7 multiplications and 10 additions/subtractions are in steps of calculating P’s.
= * C11 C21
C12 C22
A11 A21
A12 A22
B11 B21
B12 B22
C11 = P1 + P4 - P5 + P7 C12 = P3 + P5 C21 = P2 + P4 C22 = P1 + P3 - P2 + P6
P1 = ( A11 + A22 )( B11 + B22 ) P5 = (A11 + A12 ) B22 P2 = ( A21 + A22 ) B11 P6 = (A21 - A11 )( B11 + B12 ) P3 = A11 ( B12 - B22 ) P7 = (A12 - A22 )( B21 + B22 ) P4 = A22 ( B21 - B11 )
![Page 41: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/41.jpg)
Strassen’s Algorithm - 2/3
• On 2 x 2 matrices, the count of arithmetic operations is:
Mult Add Complexity Conventional 8 4 16n3 - 4n2 Strassen 7 18 14n3 + 11n2 • On matrix multiply is replaced by 14 matrix
additions.
![Page 42: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/42.jpg)
Strassen’s Algorithm - 3/3 • In one level of Strassen’s algorithm applied to 2 x 2
matrices with elements of (n/2 )x (n/2) blocks and the conventional algorithm is used for the seven block matrix multiplications, the total number of operation count is
7(2(n/2)3 - (n/2)2) + 18(n/2)2 = (7/4)n3 + (11/4)n2
R = Strassen Operation Count
Conventional Operation Count = 7n3 + 11n2
8n3 - 4n2
lim R = 7/8 n→∞ 12.5% improvement!
Note: code can be accessed from:http://www-unix.mcs.anl.gov/prism/lib/software.html
![Page 43: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/43.jpg)
Winograd’s Variant of Strassen’s Algorithm
• Based on Strassen’s algorithm, Winograd reduced 3 of additions/subtractions by rearranging the order of calculation into 4 stages.
S1 = A21 + A22 T1 = B12 - B11 P1 = A11 B11 U1 = P1 + P2 S2 = S1 - A11 T2 = B22 - T1 P2 = A12 B21 U2 = P1 + P4 S3 = A11 - A21 T3 = B22 - B12 P3= S1 T1 U3 = U2 + P5 S4 = A12 - S2 T4 = B21 - T2 P4 = S2 T2 U4 = U3 + P7 P5 = S3 T3 U5 = U3 + P3 P6 = S4 B22 U6 = U2 + P3 P7 = A22 T4 U7 = U6 + P6
Stage 1 Stage 2 Stage 3 Stage 4
C11 = U1 , C12 = U7 , C21 = U4 , C22 = U5
7 multipy, 15 add./sub.
( Takes 7k multiplications and 5(7k – 4k) additions to Multiply 2k x 2k matirices.)
![Page 44: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/44.jpg)
Usage of Strassen’s Algorithm
• Usage of Strassen’s algorithm is limited by – Additional memory is required to store matrices P’s. – More memory traffic are necessary, memory bandwidth plays a key role.
• Loss of significance in Strassen’s algorithm: – Caused by adding relatively large and very small numbers.
Performance Example (over Cray Library MXM on Cray2 ) n=64 => x 1.35 n=2048 => x 2.01 (1988 by David Bailey)
![Page 45: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/45.jpg)
Fourier Transform and FFT
• Discrete Fourier Transform(DFT)
Zi =1/n Σ ωi k * Xk where ω = exp(-2πj/n)
O(n2) → O(log n)
![Page 46: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/46.jpg)
Fourier Transform and FFT
• Butterfly operation (DIT:Decimation in Time)
Z = X + αnk ∗ Y
W = X - αnk * Y
where αn = exp(2πj/n)
X
Y
Z
W
+
− ∗
αn
![Page 47: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/47.jpg)
Fourier Transform and FFT
• Butterfly operation (DIF: Decimation in Frequency)
Z = X + Y W = (X – Y)∗αn
k
where αn = exp(2πj/n)
X
Y
Z
W
+
− ∗
αn
![Page 48: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/48.jpg)
Number of Operations for Complex Fourier Transform ( N:power of 2)
FFT per butterfly : add/sub= 6, Mult = 4 N points: Total ops. =5 N log2 N DFT (Complex Matrix-Vector multiplication) 1 point: add/sub = 4 N-2, Mult = 4N N points: Total ops. = 8N2 – 2N = 2 N(4 N-1)
If the efficiency of FFT is 1%, 512 point DFT is faster than FFT If the efficiency of FFT is 3%, 128 point DFT is faster than FFT If the efficiency of FFT is 5%, 64 point DFT is faster than FFT
![Page 49: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/49.jpg)
Fast Fourier Transform
![Page 50: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/50.jpg)
FFT (Isogeometric)
![Page 51: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/51.jpg)
FFT (Self-sorting variant due to Stockham)
![Page 52: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/52.jpg)
Applications
![Page 53: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/53.jpg)
Concept of Numerical Weather Prediction (Richardson, 1922)
![Page 54: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/54.jpg)
![Page 55: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/55.jpg)
Simulation of Precipitation
![Page 56: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/56.jpg)
![Page 57: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/57.jpg)
Simulation of typhoon
Source: Earth Simulator Center
![Page 58: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/58.jpg)
Simulation of Tsunami (March 11, 2011)
Source: Prof. Imamura, Tohoku Univ.
![Page 59: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/59.jpg)
![Page 60: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/60.jpg)
![Page 61: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/61.jpg)
![Page 62: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/62.jpg)
Seismic Data Processing for Oil Exploration
![Page 63: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/63.jpg)
![Page 64: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/64.jpg)
![Page 65: Parallel Algorithms and Applications(2012)](https://reader034.fdocuments.net/reader034/viewer/2022051608/54401545afaf9f7c338b45c3/html5/thumbnails/65.jpg)
Applications Areas for Petaflops • Computational testing and simulation as
a replacement for weapons testing (stockpile stewardship)
• Simulation of plasma fusion devices and basic physics for controlled fusion (to optimize design of future reactors)
• Design of new chemical compounds and synthesis pathways (environmental safety and cost improvements)
• Comprehensive modeling of groundwater and oil reservoirs (contamination and management)
• Modeling of complex transportation, communication and economic systems
• Time dependent simulations of complex biomolecules (membranes, synthesis machinery and dna)
• Multidisciplinary optimization problems combining structures, fluids and geometry
• Modeling of integrated earth systems (ocean, atmosphere, bio-geosphere)
• Improved 4d/6d data assimilation capability applied to remote sensing and environmental models
• Computational cosmology (integration of particle models, astrophysical fluids and radiation transport)
• Materials simulations that bridge the gap between microscale and macroscale (bulk materials)
• Coupled electro-mechanical simulations of nano-scale structures (dynamics and mechanics of micromachines)
• Full plant optimization for complex processes (chemical, manufacturing and assembly problems)
• High-resolution reacting flow problems (combustion, chemical mixing and multiphase flow)
• High-realism immersive virtual reality based on real-time radiosity modeling and complex scenes
Green, Blue, Red
6/15, 6/15, 3/15