1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program...

1

Applying Automated Memory Analysis to

improve the iterative solver in the Parallel Ocean

ProgramJohn M. Dennis: [email protected]

Elizabeth R. Jessup: [email protected]

April 5, 2006

mailto:[email protected]

April 5, 2006 Petascale Computation for the Geosciences Workshop

2

MotivationMotivation

Outgrowth of PhD thesis Memory efficient iterative solversData movement is expensiveDeveloped techniques to improve memory efficiency

Apply Automated Memory Analysis to POP

Parallel Ocean Program (POP) solverLarge % of timeScalability issues

Outgrowth of PhD thesis Memory efficient iterative solversData movement is expensiveDeveloped techniques to improve memory efficiency

Apply Automated Memory Analysis to POP

Parallel Ocean Program (POP) solverLarge % of timeScalability issues


3

Outline:Outline:

MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling Curves Conclusions

MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling Curves Conclusions


4

Automated Memory Analysis?

Automated Memory Analysis?

Analyze algorithm written in Matlab

Predicts data movement if algorithm written in C/C++ or Fortran-> Minimum Required

Predictions allow:Evaluate design choicesGuide performance tuning

Analyze algorithm written in Matlab

Predicts data movement if algorithm written in C/C++ or Fortran-> Minimum Required

Predictions allow:Evaluate design choicesGuide performance tuning


5

POP using 20x24 blocks (gx1v3)


POP data structure Flexible block structure land ‘block’ elimination Small blocks

Better {load balanced, land block elimination}

Larger halo overhead Larger blocks

Smaller halo overheadLoad imbalancedNo land block elimination

Grid resolutions: test: (128x192) gx1v3 (320x384)

POP data structure Flexible block structure land ‘block’ elimination Small blocks

Better {load balanced, land block elimination}

Larger halo overhead Larger blocks

Smaller halo overheadLoad imbalancedNo land block elimination

Grid resolutions: test: (128x192) gx1v3 (320x384)


6

Alternate Data Structure

Alternate Data Structure

2D data structure Advantages

Regular stride-1 access

Compact form of stencil operator

Disadvantages Includes land points

Problem specific data structure


Regular stride-1 access

Compact form of stencil operator

Disadvantages Includes land points

Problem specific data structure


No more land points General data structure

Disadvantages Indirect addressing Larger stencil operator


No more land points General data structure

Disadvantages Indirect addressing Larger stencil operator


7

Outline:Outline:

MotivationBackgroundData movementSerial PerformanceParallel PerformanceSpace-Filling CurvesConclusions



8

Data movementData movement

Working set load size (WSL)(MM --> L1 cache)Measure using PAPI (WSLM)Compute platforms:

Sun Ultra II (400Mhz)IBM POWER4 (1.3 Ghz)SGI R14K (500Mhz)

Compare with prediction (WSLP)

Working set load size (WSL)(MM --> L1 cache)Measure using PAPI (WSLM)Compute platforms:

Sun Ultra II (400Mhz)IBM POWER4 (1.3 Ghz)SGI R14K (500Mhz)

Compare with prediction (WSLP)


9

Predicting Data Movement

Predicting Data Movement

solver w/2D (Matlab) solver w/1D (Matlab)

4902 Kbytes 3218 Kbytes

1D data structure --> 34% reduction in data movement

>

Predicts WSLP


10

Measured versus Predicted data

movement


movementSolver Ultra II POWER4 R14K

WSLP WSLM err WSLM err WSLM err

PCG2+2D v1

4902 5163 5% 5068 3% 5728 17%

PCG2+2D v2

4902 4905 0% 4865 -1% 4854 -1%

PCG2+1D 3218 3164 -2% 3335 4% 3473 8%


11


movement




PCG2+2D v1

4902 5163 5% 5068 3% 5728 17%

PCG2+2D v2

4905 0% 4865 -1% 4854 -1%

PCG2+1D 3218 3164 -2% 3335 4% 3473 8%

Excessive data movement


12

Two blocks of source code

Two blocks of source code

do i=1,nblocksp(:,:,i)=z(:,:,i) + p(:,:,i)*ß

q(:,:,i) = A*p(:,:,i)

w0(:,:,i)=Q(:,:,i)*P(:,:,i)

enddodelta = gsum(w0,lmask)

do i=1,nblocksp(:,:,i)=z(:,:,i) + p(:,:,i)*ß

q(:,:,i) = A*p(:,:,i)

w0(:,:,i)=Q(:,:,i)*P(:,:,i)

enddodelta = gsum(w0,lmask)

ldelta=0do i=1,nblocks

p(:,:,i) = z(:,:,i) + p(:,:,i)* ß

q(:,:,i) = A*p(:,:,i)w0=q(:,:,i)*P(:,:,i)ldelta = ldelta + lsum(w0,lmask)

enddodelta=gsum(ldelta)

ldelta=0do i=1,nblocks

p(:,:,i) = z(:,:,i) + p(:,:,i)* ß

q(:,:,i) = A*p(:,:,i)w0=q(:,:,i)*P(:,:,i)ldelta = ldelta + lsum(w0,lmask)

enddodelta=gsum(ldelta)

PCG2+2D v1 PCG2+2D v2

w0 array accessed after loop!extra access of w0 eliminated


13


movement




PCG2+2D v1

4902 5163 5% 5068 3% 5728 17%

PCG2+2D v2

4902 4905 0% 4865 -1% 4854 -1%

PCG2+1D 3218 3164 -2% 3335 4% 3473 8%

Data movement matches predicted!


14

Outline:Outline:




15

Using 1D data structures in POP2

solver (serial)

Using 1D data structures in POP2

solver (serial)Replace solvers.F90Execution time on cache microprocessors

Examine two CG algorithms w/Diagonal precondPCG2 ( 2 inner products)PCG1 ( 1 inner product) [D’Azevedo 93]

Grid: test [128x192 grid points]w/(16x16)

Replace solvers.F90Execution time on cache microprocessors

Examine two CG algorithms w/Diagonal precondPCG2 ( 2 inner products)PCG1 ( 1 inner product) [D’Azevedo 93]

Grid: test [128x192 grid points]w/(16x16)


16

0

1

2

3

4

5

6

POWER4 1.3 Ghz

Compute Platform

secon

ds f

or 2

0 t

imestep

s

PCG2+2D

PCG1+2D

PCG2+1D

PCG1+1D

0

1

2

3

4

5

6

POWER4 1.3 Ghz

Compute Platform

secon

ds f

or 2

0 t

imestep

s

PCG2+2D

PCG1+2D

PCG2+1D

PCG1+1D

Serial execution time on IBM POWER4 (test)

Serial execution time on IBM POWER4 (test)

56% reduction in cost/iteration


17

Outline:Outline:




18

Using 1D data structure in POP2 solver (parallel)

Using 1D data structure in POP2 solver (parallel) New parallel halo update

Examine several CG algorithms w/Diagonal precond PCG2 ( 2 inner products) PCG1 ( 1 inner product)

Existing solver/preconditioner technology: Hypre (LLNL)

http://www.llnl.gov/CASC/linear_solvers PCG solver Preconditioners:

Diagonal Hypre integration -> Work in progress

New parallel halo update Examine several CG algorithms w/Diagonal

precond PCG2 ( 2 inner products) PCG1 ( 1 inner product)

Existing solver/preconditioner technology: Hypre (LLNL)

http://www.llnl.gov/CASC/linear_solvers PCG solver Preconditioners:

Diagonal Hypre integration -> Work in progress

http://www.llnl.gov/CASC/linear_solvers

http://www.llnl.gov/CASC/linear_solvers


19

Solver execution time for POP2 (20x24) on

BG/L (gx1v3)

Solver execution time for POP2 (20x24) on

BG/L (gx1v3)

0

5

10

15

20

25

30

35

40

64

# processors

Secon

ds f

or 2

00 t

imestep

s

PCG2+2D

PCG1+2D

PCG2+1D

PCG1+1D

Hypre (PCG+Diag)

0

5

10

15

20

25

30

35

40

64

# processors

Secon

ds f

or 2

00 t

imestep

s

PCG2+2D

PCG1+2D

PCG2+1D

PCG1+1D

Hypre (PCG+Diag)

48% cost/iteration

27% cost/iteration

20

64 processors != PetaScale


21

Outline:Outline:




22

0.1 degree POP0.1 degree POP

Global eddy-resolving Computational grid:

3600 x 2400 x 40Land creates problems:

load imbalancesscalability

Alternative partitioning algorithm:Space-filling curves

Evaluate using Benchmark:1 day/ Internal grid / 7 minute timestep

Global eddy-resolving Computational grid:

3600 x 2400 x 40Land creates problems:

load imbalancesscalability

Alternative partitioning algorithm:Space-filling curves

Evaluate using Benchmark:1 day/ Internal grid / 7 minute timestep


23

Partitioning with Space-filling Curves

Partitioning with Space-filling Curves

Map 2D -> 1DVariety of sizes

Hilbert (Nb=2n)

Peano (Nb=3m)

Cinco (Nb=5p) [New]Hilbert-Peano (Nb=2n3m)Hilbert-Peano-Cinco (Nb=2n3m5p) [New]

Partitioning 1D array

Map 2D -> 1DVariety of sizes

Hilbert (Nb=2n)

Peano (Nb=3m)

Cinco (Nb=5p) [New]Hilbert-Peano (Nb=2n3m)Hilbert-Peano-Cinco (Nb=2n3m5p) [New]

Partitioning 1D array

Nb


24

Partitioning with SFCPartitioning with SFC

Partition for 3 processors


25




26

POP (gx1v3) + Space-filling curve

POP (gx1v3) + Space-filling curve


27

Space-filling curve (Hilbert Nb=24)

Space-filling curve (Hilbert Nb=24)


28

Remove Land blocksRemove Land blocks


29

Space-filling curve partition for 8

processors

Space-filling curve partition for 8

processors


30

POP 0.1 degree benchmark on Blue

Gene/L

POP 0.1 degree benchmark on Blue

Gene/L


31

POP 0.1 degree benchmark

POP 0.1 degree benchmark

Courtesy of Y. Yoshida, M. Taylor, P. Worley


32

ConclusionsConclusions

1D data structures in Barotropic SolverNo more land pointsReduces execution time vs 2D data structure

48% reduction in Solver time! (64 procs BG/L) 9.5% reduction in Total time! (64 procs POWER4)

Allows use of solver/preconditioner packagesImplementation quality critical!

Automated Memory Analysis (SLAMM)Evaluate design choicesGuide performance tuning

1D data structures in Barotropic SolverNo more land pointsReduces execution time vs 2D data structure

48% reduction in Solver time! (64 procs BG/L) 9.5% reduction in Total time! (64 procs POWER4)

Allows use of solver/preconditioner packagesImplementation quality critical!

Automated Memory Analysis (SLAMM)Evaluate design choicesGuide performance tuning


33

Conclusions (con’t)Conclusions (con’t) Good scalability to 32K processors on BG/L Increase simulation rate by 2x on 32K processors SFC partitioning 1D data structure in solver Modify 7 source files

Future work Improve scalability

55% Efficiency 1K => 32K Better preconditioners Improve load-balance

Different block sizesImprove partitioning algorithm

Good scalability to 32K processors on BG/L Increase simulation rate by 2x on 32K processors SFC partitioning 1D data structure in solver Modify 7 source files

Future work Improve scalability

55% Efficiency 1K => 32K Better preconditioners Improve load-balance

Different block sizesImprove partitioning algorithm


34

Acknowledgements/Questions?

Acknowledgements/Questions?

Thanks to: F. Bryan (NCAR)J. Edwards (IBM) P. Jones (LANL)K. Lindsay (NCAR)M. Taylor (SNL)H. Tufo (NCAR)W. Waite (CU)S. Weese (NCAR)

Thanks to: F. Bryan (NCAR)J. Edwards (IBM) P. Jones (LANL)K. Lindsay (NCAR)M. Taylor (SNL)H. Tufo (NCAR)W. Waite (CU)S. Weese (NCAR)

Blue Gene/L time:NSF MRI GrantNCARUniversity of ColoradoIBM (SUR) program

BGW Consortium DaysIBM research (Watson)

Blue Gene/L time:NSF MRI GrantNCARUniversity of ColoradoIBM (SUR) program

BGW Consortium DaysIBM research (Watson)


35

Serial Execution time on Multiple platforms

(test)

Serial Execution time on Multiple platforms

(test)

0

1

2

3

4

5

6

7

8

9

10

IBM POWER4 IBM POWER5 IBM PPC 440 AMD Opteron Intel P4

(1.3 Ghz) (1.9 Ghz) (700 Mhz) (2.2 Ghz) (2.0 Ghz)

Compute Platform

secon

ds f

or 2

0 t

imestep

s

PCG2+2D

PCG1+2D

PCG2+1D

PCG1+1D

0

1

2

3

4

5

6

7

8

9

10

IBM POWER4 IBM POWER5 IBM PPC 440 AMD Opteron Intel P4

(1.3 Ghz) (1.9 Ghz) (700 Mhz) (2.2 Ghz) (2.0 Ghz)

Compute Platform

secon

ds f

or 2

0 t

imestep

s

PCG2+2D

PCG1+2D

PCG2+1D

PCG1+1D


36

Total execution time for POP2 (40x48) on

POWER4 (gx1v3)

Total execution time for POP2 (40x48) on

POWER4 (gx1v3)

66

68

70

72

74

76

78

80

82

84

86

88

64

# of processors

secon

ds f

or 2

00 t

imestep

s

PCG2+2D

PCG1+2D

PCG2+1D

PCG1+1D

66

68

70

72

74

76

78

80

82

84

86

88

64

# of processors

secon

ds f

or 2

00 t

imestep

s

PCG2+2D

PCG1+2D

PCG2+1D

PCG1+1D

9.5% reduction

Eliminate need for ~216,000 CPU hours per year @ NCAR


37

POP 0.1 degreePOP 0.1 degreeblocksize

Nb Nb2 Max ||

36x24 100 10000 7545

30x20 120 14400 10705

24x16 150 22500 16528

18x12 200 40000 28972

15x10 240 57600 41352

12x8 300 90000 64074

Increasing || -->D

ecre

asin

g ov

erhe

ad -

->

1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program...

Documents

Transcript of 1 Applying Automated Memory Analysis to improve the iterative solver in the Parallel Ocean Program...