CGPOP Analysis and Optimization

35
Analysis and Optimization of CGPOP Hongtao Cai, Xiaoxiang Hu, Haoruo Peng Department of CST, Tsinghua University SIAM Annual Meeting, July 9, 2012

Transcript of CGPOP Analysis and Optimization

Page 1: CGPOP Analysis and Optimization

Analysis and Optimization of CGPOP

Hongtao Cai, Xiaoxiang Hu, Haoruo Peng Department of CST, Tsinghua University

SIAM Annual Meeting, July 9, 2012

Page 2: CGPOP Analysis and Optimization

Acknowledgment

Prof. Xiaoge Wang , Prof. Wei Xue

Support from the State 863 Project Fund

Support from Explore-100, Tianhe-1A, Shenwei supercomputer systems

Support from SIAM

2

Page 3: CGPOP Analysis and Optimization

Outline

Background

Research

Analysis of original PCG Method in CGPOP

Optimizations:

Chebyshev

Richardson-PCG

Richardson-Chebyshev

Experiments

Future Work

3

Page 4: CGPOP Analysis and Optimization

Outline

Background

Research

Analysis of original PCG Method in CGPOP

Optimizations:

Chebyshev

Richardson-PCG

Richardson-Chebyshev

Experiments

Future Work

4

Page 5: CGPOP Analysis and Optimization

Parallel Ocean Program

The crucial role of Oceans in Global Climate

70% of earth surface

Water 1000 times higher the heat capacity of air

repository of carbon(93%)

Transport heat

POP : Surface Pressure of Oceans[1]

5

Page 6: CGPOP Analysis and Optimization

Conjugate Gradient Parallel Ocean Program (CGPOP)

Three computation parts: Barotropic, 3D-update, Baroclinic

Barotropic computation dominates when core number exceeds 10,000 [2]

CGPOP contains the core part of Barotropic compuation

6

Page 7: CGPOP Analysis and Optimization

Conjugate Gradient Parallel Ocean Program (CGPOP)

Linear equation system in every time step

𝛻 ∙ 𝐻𝛻 −1

𝑔𝛼𝜏∆𝑡𝜂𝑛+1 = 𝛻 ∙ 𝐻

𝑈

𝑔𝛼𝜏+ 𝛻𝜂𝑛−1 −

𝜂𝑛

𝑔𝛼𝜏∆𝑡−

𝑞𝑊𝑛

𝑔𝛼𝜏

Ax = b

(A is a real, sparse, symmetric, positive-definite matrix)

Our work: Exploring new algorithms in CGPOP. Experiments on top supercomputer in the world.

7

Page 8: CGPOP Analysis and Optimization

Outline

Background

Research

Analysis of original PCG Method in CGPOP

Optimizations:

Chebyshev

Richardson-PCG

Richardson-Chebyshev

Experiments

Future Work

8

Page 9: CGPOP Analysis and Optimization

Chron-Gear Preconditioned Conjugate Gradient Solver

Matrix-vector Multiplication, Dot Product, Daxpy : Communication 9

Page 10: CGPOP Analysis and Optimization

PCG Solver

(on Shenwei Supercomputer)

Percentage of Time consumed by Dot Product

10

Page 11: CGPOP Analysis and Optimization

Three Variants

1S1D /2S2D/2S1D

1-Sided MPI : put/get

2-Sided MPI : send/receive

2D : direct data access, more memory

1D : Ocean points stored compactly. Less memory, indirect data access

2D 1D

11

Page 12: CGPOP Analysis and Optimization

Three Variants

Total Time for 1 Time Step(on Tianhe-1A ) 12

Page 13: CGPOP Analysis and Optimization

Analysis Conclusions

Dot product consumes time

Three variants – 2s1d selected as the benchmark

13

Page 14: CGPOP Analysis and Optimization

Outline

Background

Research

Analysis of original PCG Method

Optimizations

Chebyshev

Richardson-PCG

Richardson-Chebyshev

Experiments

Future Work

14

Page 15: CGPOP Analysis and Optimization

Chebyshev

Mat-vec Mul, Daxpy, No Dot Product

15

Page 16: CGPOP Analysis and Optimization

Chebyshev

PCG 4 Daxpy + 1 MV + 3 DP

CBS

× 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟1

3 Daxpy + 1 MV × 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟2

Dot Product(DP) Daxpy Mat-Vec Mul(MV)

16

Page 17: CGPOP Analysis and Optimization

Chebyshev

17

Page 18: CGPOP Analysis and Optimization

Outline

Background

Research

Analysis of original PCG Method

Optimizations

Chebyshev

Richardson-PCG

Richardson-Chebyshev

Experiments

Future Work

18

Page 19: CGPOP Analysis and Optimization

Richardson-PCG

Single Precision: Faster[5]

A processor can take 2 double or 4 single at a time

Memory Pressure

Double Precision: More Accurate

Mix them up

19

Page 20: CGPOP Analysis and Optimization

Richardson-PCG

Richardson Method ( Splitting Method )

Iteration:

Our Motivation: Let 𝑀 = 𝐴𝑓𝑙𝑜𝑎𝑡 , s.t. 𝑀−1𝑁 = 𝐼 − 𝑀−1𝐴 ≈ 0

Our Method:

𝐴𝑥 = 𝑏, 𝐴 = 𝑀 − 𝑁 𝑥 = 𝑀−1𝑁𝑥 +𝑀−1𝑏

𝑥𝑘+1 ← 𝑀−1𝑁𝑥𝑘 +𝑀−1𝑏 𝜌 𝑀−1𝑁 < 1

𝑥𝑘+1 ← 𝑥𝑘 +𝑀−1(𝑏 − 𝐴𝑥𝑘)

Same as solving AfloatΔ𝑥 = (𝑏 − 𝐴𝑥𝑘)

Approximation : Tolerance

20

Page 21: CGPOP Analysis and Optimization

Richardson-PCG

𝑥𝑘+1 ← 𝑥𝑘 +𝑀−1(𝑏 − 𝐴𝑥𝑘)

21

Page 22: CGPOP Analysis and Optimization

Richardson-PCG

PCG 4 Daxpy + 1 DMV + 3 DDP × 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟1

Rich-PCG

× 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟3 + 4 Saxpy + 1 SMV + 3 SDP + 2 CV

2 Daxpy + 1 DMV + 1 DDP × 𝑜𝑢𝑡𝑒𝑟_𝑖𝑡𝑒𝑟

Double Mat-Vec Mul (DMV) Daxpy Double Dot Product(DDP)

Single Mat-Vec Mul (SMV) Saxpy Single Dot Product(SDP)

Convert Vector(CV) Convert Matrix(CM)

1CM +

22

Page 23: CGPOP Analysis and Optimization

Richardson-PCG

23

Page 24: CGPOP Analysis and Optimization

Outline

Background

Research

Analysis of original PCG Method

Optimizations

Chebyshev

Richardson-PCG

Richardson-Chebyshev

Experiments

Future Work

24

Page 25: CGPOP Analysis and Optimization

Richardson-Chebyshev

25

Page 26: CGPOP Analysis and Optimization

Richardson-Chebyshev

Rich-CBS

× 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟4 + 3 Saxpy + 1 SMV + 2 CV

2 Daxpy + 1 DMV + 1 DDP × 𝑜𝑢𝑡𝑒𝑟_𝑖𝑡𝑒𝑟4

Rich-PCG

× 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟3 + 4 Saxpy + 1 SMV + 3 SDP + 2 CV

2 Daxpy + 1 DMV + 1 DDP × 𝑜𝑢𝑡𝑒𝑟_𝑖𝑡𝑒𝑟3

1CM +

1CM +

26

Page 27: CGPOP Analysis and Optimization

Richardson-Chebyshev

27

Page 28: CGPOP Analysis and Optimization

Outline

Background

Research

Analysis of original PCG Method

Optimizations

Chebyshev

Richardson-PCG

Richardson-Chebyshev

Experiments

Future Work

28

Page 29: CGPOP Analysis and Optimization

Experiments

Our supercomputers:

Tianhe-1A

CPU: 2.93GHz Intel Xeon X5670

Memory: 32/48GB per node. Bandwidth: 40GB/s

Network: 160Gbps, 22ns. Fat tree structure

Shenwei

CPU: 1.1GHz Shenwei Processor

Memory: 32GB per node. Bandwidth: 68GB/s (List result)

Network: Crossbar for every 256 CPU. Fat tree structure

29

Page 30: CGPOP Analysis and Optimization

Experiments on Tianhe-1A

30

Page 31: CGPOP Analysis and Optimization

Experiments on Shenwei

31

Page 32: CGPOP Analysis and Optimization

Conclusion

Two techniques

Reducing dot-products

Effective in large core numbers ( more than 5000)

Mixed precision

Effective in small core numbers ( less than 1000)

32

Page 33: CGPOP Analysis and Optimization

Outline

Background

Research

Analysis of original PCG Method

Optimizations

Chebyshev

Richardson-PCG

Richardson-Chebyshev

Experiments

Future Work

33

Page 34: CGPOP Analysis and Optimization

Future Work

Complete the investigation of the current code

Integrate Optimization techniques into our ocean modeling programs

Apply our methods to other parallel programs

34

Page 35: CGPOP Analysis and Optimization

References

[1] R. Smith, P. Gent, “Reference Manual for the Parallel Ocean Program(POP)”, May, 2002, Page 1-74.

[2]A. Stone, J. M. Dennis, M. M. Strout, “The CGPOP Miniapp, Version 1.0”, July, 2011, Page 4-5.

[3] Y. Saad, A. Sameh, P. Saylor, “Solving elliptic difference equations on a linear array of processors”, SIAM J. Sci. Stat. Comput., Vol. 6, No. 4, October 1985, Page 1049-1063.

[4] E. Stiefel, “Kernel polynomials in linear algebra and their numerical applications”, Nat. Bur. Standards, Appl. Math. Series 49, 1958, page 1-22.

[5] A. Buttari, E. Lyon, J. Dongarra. “Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy”, ACM Transactions on Math. Software, Vol.34, No.4, Article 17, Page 1-8.

35