CGPOP Analysis and Optimization

Analysis and Optimization of CGPOP

Hongtao Cai, Xiaoxiang Hu, Haoruo Peng Department of CST, Tsinghua University

SIAM Annual Meeting, July 9, 2012

Acknowledgment

Prof. Xiaoge Wang , Prof. Wei Xue

Support from the State 863 Project Fund

Support from Explore-100, Tianhe-1A, Shenwei supercomputer systems

Support from SIAM

2

Outline

Background

Research

Analysis of original PCG Method in CGPOP

Optimizations:

Chebyshev

Richardson-PCG

Richardson-Chebyshev

Experiments

Future Work

3

Outline

Background

Research


Optimizations:

Chebyshev

Richardson-PCG


Experiments

Future Work

4

Parallel Ocean Program

The crucial role of Oceans in Global Climate

70% of earth surface

Water 1000 times higher the heat capacity of air

repository of carbon(93%)

Transport heat

POP : Surface Pressure of Oceans[1]

5

Conjugate Gradient Parallel Ocean Program (CGPOP)

Three computation parts: Barotropic, 3D-update, Baroclinic

Barotropic computation dominates when core number exceeds 10,000 [2]

CGPOP contains the core part of Barotropic compuation

6

Conjugate Gradient Parallel Ocean Program (CGPOP)

Linear equation system in every time step

𝛻 ∙ 𝐻𝛻 −1

𝑔𝛼𝜏∆𝑡𝜂𝑛+1 = 𝛻 ∙ 𝐻

𝑈

𝑔𝛼𝜏+ 𝛻𝜂𝑛−1 −

𝜂𝑛

𝑔𝛼𝜏∆𝑡−

𝑞𝑊𝑛

𝑔𝛼𝜏

Ax = b

(A is a real, sparse, symmetric, positive-definite matrix)

Our work: Exploring new algorithms in CGPOP. Experiments on top supercomputer in the world.

7

Outline

Background

Research


Optimizations:

Chebyshev

Richardson-PCG


Experiments

Future Work

8

Chron-Gear Preconditioned Conjugate Gradient Solver

Matrix-vector Multiplication, Dot Product, Daxpy : Communication 9

PCG Solver

(on Shenwei Supercomputer)

Percentage of Time consumed by Dot Product

10

Three Variants

1S1D /2S2D/2S1D

1-Sided MPI : put/get

2-Sided MPI : send/receive

2D : direct data access, more memory

1D : Ocean points stored compactly. Less memory, indirect data access

2D 1D

11

Three Variants

Total Time for 1 Time Step(on Tianhe-1A ) 12

Analysis Conclusions

Dot product consumes time

Three variants – 2s1d selected as the benchmark

13

Outline

Background

Research

Analysis of original PCG Method

Optimizations

Chebyshev

Richardson-PCG


Experiments

Future Work

14

Chebyshev

Mat-vec Mul, Daxpy, No Dot Product

15

Chebyshev

PCG 4 Daxpy + 1 MV + 3 DP

CBS

× 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟1

3 Daxpy + 1 MV × 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟2

Dot Product(DP) Daxpy Mat-Vec Mul(MV)

16

Chebyshev

17

Outline

Background

Research


Optimizations

Chebyshev

Richardson-PCG


Experiments

Future Work

18

Richardson-PCG

Single Precision: Faster[5]

A processor can take 2 double or 4 single at a time

Memory Pressure

Double Precision: More Accurate

Mix them up

19

Richardson-PCG

Richardson Method ( Splitting Method )

Iteration:

Our Motivation: Let 𝑀 = 𝐴𝑓𝑙𝑜𝑎𝑡 , s.t. 𝑀−1𝑁 = 𝐼 − 𝑀−1𝐴 ≈ 0

Our Method:

𝐴𝑥 = 𝑏, 𝐴 = 𝑀 − 𝑁 𝑥 = 𝑀−1𝑁𝑥 +𝑀−1𝑏

𝑥𝑘+1 ← 𝑀−1𝑁𝑥𝑘 +𝑀−1𝑏 𝜌 𝑀−1𝑁 < 1

𝑥𝑘+1 ← 𝑥𝑘 +𝑀−1(𝑏 − 𝐴𝑥𝑘)

Same as solving AfloatΔ𝑥 = (𝑏 − 𝐴𝑥𝑘)

Approximation : Tolerance

20

Richardson-PCG

𝑥𝑘+1 ← 𝑥𝑘 +𝑀−1(𝑏 − 𝐴𝑥𝑘)

21

Richardson-PCG

PCG 4 Daxpy + 1 DMV + 3 DDP × 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟1

Rich-PCG

× 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟3 + 4 Saxpy + 1 SMV + 3 SDP + 2 CV

2 Daxpy + 1 DMV + 1 DDP × 𝑜𝑢𝑡𝑒𝑟_𝑖𝑡𝑒𝑟

Double Mat-Vec Mul (DMV) Daxpy Double Dot Product(DDP)

Single Mat-Vec Mul (SMV) Saxpy Single Dot Product(SDP)

Convert Vector(CV) Convert Matrix(CM)

1CM +

22

Richardson-PCG

23

Outline

Background

Research


Optimizations

Chebyshev

Richardson-PCG


Experiments

Future Work

24


25


Rich-CBS

× 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟4 + 3 Saxpy + 1 SMV + 2 CV

2 Daxpy + 1 DMV + 1 DDP × 𝑜𝑢𝑡𝑒𝑟_𝑖𝑡𝑒𝑟4

Rich-PCG

× 𝑡𝑜𝑡𝑎𝑙_𝑖𝑡𝑒𝑟3 + 4 Saxpy + 1 SMV + 3 SDP + 2 CV

2 Daxpy + 1 DMV + 1 DDP × 𝑜𝑢𝑡𝑒𝑟_𝑖𝑡𝑒𝑟3

1CM +

1CM +

26


27

Outline

Background

Research


Optimizations

Chebyshev

Richardson-PCG


Experiments

Future Work

28

Experiments

Our supercomputers:

Tianhe-1A

CPU: 2.93GHz Intel Xeon X5670

Memory: 32/48GB per node. Bandwidth: 40GB/s

Network: 160Gbps, 22ns. Fat tree structure

Shenwei

CPU: 1.1GHz Shenwei Processor

Memory: 32GB per node. Bandwidth: 68GB/s (List result)

Network: Crossbar for every 256 CPU. Fat tree structure

29

Experiments on Tianhe-1A

30

Experiments on Shenwei

31

Conclusion

Two techniques

Reducing dot-products

Effective in large core numbers ( more than 5000)

Mixed precision

Effective in small core numbers ( less than 1000)

32

Outline

Background

Research


Optimizations

Chebyshev

Richardson-PCG


Experiments

Future Work

33

Future Work

Complete the investigation of the current code

Integrate Optimization techniques into our ocean modeling programs

Apply our methods to other parallel programs

34

References

[1] R. Smith, P. Gent, “Reference Manual for the Parallel Ocean Program(POP)”, May, 2002, Page 1-74.

[2]A. Stone, J. M. Dennis, M. M. Strout, “The CGPOP Miniapp, Version 1.0”, July, 2011, Page 4-5.

[3] Y. Saad, A. Sameh, P. Saylor, “Solving elliptic difference equations on a linear array of processors”, SIAM J. Sci. Stat. Comput., Vol. 6, No. 4, October 1985, Page 1049-1063.

[4] E. Stiefel, “Kernel polynomials in linear algebra and their numerical applications”, Nat. Bur. Standards, Appl. Math. Series 49, 1958, page 1-22.

[5] A. Buttari, E. Lyon, J. Dongarra. “Using Mixed Precision for Sparse Matrix Computations to Enhance the Performance while Achieving 64-bit Accuracy”, ACM Transactions on Math. Software, Vol.34, No.4, Article 17, Page 1-8.

35

CGPOP Analysis and Optimization

Technology

Transcript of CGPOP Analysis and Optimization