CGPOP Analysis and Optimization
-
Upload
hongtao-cai -
Category
Technology
-
view
41 -
download
2
Transcript of CGPOP Analysis and Optimization
Analysis and Optimization of CGPOP
Hongtao Cai, Xiaoxiang Hu, Haoruo Peng Department of CST, Tsinghua University
SIAM Annual Meeting, July 9, 2012
Acknowledgment
Prof. Xiaoge Wang , Prof. Wei Xue
Support from the State 863 Project Fund
Support from Explore-100, Tianhe-1A, Shenwei supercomputer systems
Support from SIAM
2
Outline
Background
Research
Analysis of original PCG Method in CGPOP
Optimizations:
Chebyshev
Richardson-PCG
Richardson-Chebyshev
Experiments
Future Work
3
Outline
Background
Research
Analysis of original PCG Method in CGPOP
Optimizations:
Chebyshev
Richardson-PCG
Richardson-Chebyshev
Experiments
Future Work
4
Parallel Ocean Program
The crucial role of Oceans in Global Climate
70% of earth surface
Water 1000 times higher the heat capacity of air
repository of carbon(93%)
Transport heat
POP : Surface Pressure of Oceans[1]
5
Conjugate Gradient Parallel Ocean Program (CGPOP)
Three computation parts: Barotropic, 3D-update, Baroclinic
Barotropic computation dominates when core number exceeds 10,000 [2]
CGPOP contains the core part of Barotropic compuation
6
Conjugate Gradient Parallel Ocean Program (CGPOP)
Linear equation system in every time step
𝛻 ∙ 𝐻𝛻 −1
𝑔𝛼𝜏∆𝑡𝜂𝑛+1 = 𝛻 ∙ 𝐻
𝑈
𝑔𝛼𝜏+ 𝛻𝜂𝑛−1 −
𝜂𝑛
𝑔𝛼𝜏∆𝑡−
𝑞𝑊𝑛
𝑔𝛼𝜏
Ax = b
(A is a real, sparse, symmetric, positive-definite matrix)
Our work: Exploring new algorithms in CGPOP. Experiments on top supercomputer in the world.
7
Outline
Background
Research
Analysis of original PCG Method in CGPOP
Optimizations:
Chebyshev
Richardson-PCG
Richardson-Chebyshev
Experiments
Future Work
8
Chron-Gear Preconditioned Conjugate Gradient Solver
Matrix-vector Multiplication, Dot Product, Daxpy : Communication 9
PCG Solver
(on Shenwei Supercomputer)
Percentage of Time consumed by Dot Product
10
Three Variants
1S1D /2S2D/2S1D
1-Sided MPI : put/get
2-Sided MPI : send/receive
2D : direct data access, more memory
1D : Ocean points stored compactly. Less memory, indirect data access
2D 1D
11
Three Variants
Total Time for 1 Time Step(on Tianhe-1A ) 12
Analysis Conclusions
Dot product consumes time
Three variants – 2s1d selected as the benchmark
13
Outline
Background
Research
Analysis of original PCG Method
Optimizations
Chebyshev
Richardson-PCG
Richardson-Chebyshev
Experiments
Future Work
14
Chebyshev
Mat-vec Mul, Daxpy, No Dot Product
15
Chebyshev
PCG 4 Daxpy + 1 MV + 3 DP
CBS
× 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟1
3 Daxpy + 1 MV × 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟2
Dot Product(DP) Daxpy Mat-Vec Mul(MV)
16
Chebyshev
17
Outline
Background
Research
Analysis of original PCG Method
Optimizations
Chebyshev
Richardson-PCG
Richardson-Chebyshev
Experiments
Future Work
18
Richardson-PCG
Single Precision: Faster[5]
A processor can take 2 double or 4 single at a time
Memory Pressure
Double Precision: More Accurate
Mix them up
19
Richardson-PCG
Richardson Method ( Splitting Method )
Iteration:
Our Motivation: Let 𝑀 = 𝐴𝑓𝑙𝑜𝑎𝑡 , s.t. 𝑀−1𝑁 = 𝐼 − 𝑀−1𝐴 ≈ 0
Our Method:
𝐴𝑥 = 𝑏, 𝐴 = 𝑀 − 𝑁 𝑥 = 𝑀−1𝑁𝑥 +𝑀−1𝑏
𝑥𝑘+1 ← 𝑀−1𝑁𝑥𝑘 +𝑀−1𝑏 𝜌 𝑀−1𝑁 < 1
𝑥𝑘+1 ← 𝑥𝑘 +𝑀−1(𝑏 − 𝐴𝑥𝑘)
Same as solving AfloatΔ𝑥 = (𝑏 − 𝐴𝑥𝑘)
Approximation : Tolerance
20
Richardson-PCG
𝑥𝑘+1 ← 𝑥𝑘 +𝑀−1(𝑏 − 𝐴𝑥𝑘)
21
Richardson-PCG
PCG 4 Daxpy + 1 DMV + 3 DDP × 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟1
Rich-PCG
× 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟3 + 4 Saxpy + 1 SMV + 3 SDP + 2 CV
2 Daxpy + 1 DMV + 1 DDP × 𝑜𝑢𝑡𝑒𝑟_𝑖𝑡𝑒𝑟
Double Mat-Vec Mul (DMV) Daxpy Double Dot Product(DDP)
Single Mat-Vec Mul (SMV) Saxpy Single Dot Product(SDP)
Convert Vector(CV) Convert Matrix(CM)
1CM +
22
Richardson-PCG
23
Outline
Background
Research
Analysis of original PCG Method
Optimizations
Chebyshev
Richardson-PCG
Richardson-Chebyshev
Experiments
Future Work
24
Richardson-Chebyshev
25
Richardson-Chebyshev
Rich-CBS
× 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟4 + 3 Saxpy + 1 SMV + 2 CV
2 Daxpy + 1 DMV + 1 DDP × 𝑜𝑢𝑡𝑒𝑟_𝑖𝑡𝑒𝑟4
Rich-PCG
× 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟3 + 4 Saxpy + 1 SMV + 3 SDP + 2 CV
2 Daxpy + 1 DMV + 1 DDP × 𝑜𝑢𝑡𝑒𝑟_𝑖𝑡𝑒𝑟3
1CM +
1CM +
26
Richardson-Chebyshev
27
Outline
Background
Research
Analysis of original PCG Method
Optimizations
Chebyshev
Richardson-PCG
Richardson-Chebyshev
Experiments
Future Work
28
Experiments
Our supercomputers:
Tianhe-1A
CPU: 2.93GHz Intel Xeon X5670
Memory: 32/48GB per node. Bandwidth: 40GB/s
Network: 160Gbps, 22ns. Fat tree structure
Shenwei
CPU: 1.1GHz Shenwei Processor
Memory: 32GB per node. Bandwidth: 68GB/s (List result)
Network: Crossbar for every 256 CPU. Fat tree structure
29
Experiments on Tianhe-1A
30
Experiments on Shenwei
31
Conclusion
Two techniques
Reducing dot-products
Effective in large core numbers ( more than 5000)
Mixed precision
Effective in small core numbers ( less than 1000)
32
Outline
Background
Research
Analysis of original PCG Method
Optimizations
Chebyshev
Richardson-PCG
Richardson-Chebyshev
Experiments
Future Work
33
Future Work
Complete the investigation of the current code
Integrate Optimization techniques into our ocean modeling programs
Apply our methods to other parallel programs
34
References
[1] R. Smith, P. Gent, “Reference Manual for the Parallel Ocean Program(POP)”, May, 2002, Page 1-74.
[2]A. Stone, J. M. Dennis, M. M. Strout, “The CGPOP Miniapp, Version 1.0”, July, 2011, Page 4-5.
[3] Y. Saad, A. Sameh, P. Saylor, “Solving elliptic difference equations on a linear array of processors”, SIAM J. Sci. Stat. Comput., Vol. 6, No. 4, October 1985, Page 1049-1063.
[4] E. Stiefel, “Kernel polynomials in linear algebra and their numerical applications”, Nat. Bur. Standards, Appl. Math. Series 49, 1958, page 1-22.
[5] A. Buttari, E. Lyon, J. Dongarra. “Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy”, ACM Transactions on Math. Software, Vol.34, No.4, Article 17, Page 1-8.
35