Download - Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns,

Transcript
Page 1: Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns,

Achieving Efficient Strong Scaling with PETScUsing Hybrid MPI/OpenMP Optimisation

Michael Lange1 Gerard Gorman1 Michele Weiland2

Lawrence Mitchell2 Xiaohu Guo3 James Southern4

1AMCG, Imperial College London2EPCC, University of Edinburgh

3STFC, Daresbury Laboratory4Fujitsu Laboratories of Europe Ltd.

9 July, 2013

Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc

Page 2: Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns,

Motivation

Fluidity

I Unstructured finite elementcode

I Anisotropic mesh adaptivityI Applications:

I CFD, geophysical flows,ocean modelling,reservoir modelling,mining, nuclear safety,renewable energies, etc.

PETSc

I Linear solver engineI Hybrid MPI/OpenMP

version

Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc

Page 3: Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns,

Programming for Exascale

Three levels of parallelism in modern HPC architectures1:

I Between nodes: Message passing via MPII Between cores: Shared memory communicationI Within core: SIMD

Hybrid MPI/OpenMP parallelism:

I Memory argumentI MPI memory footprint not scalableI Replication of halo data

I Speed argumentI Message passing overheadI Improved load balance with fewer MPI ranks

1A. D. Robison and R. E. Johnson. Three layer cake for shared-memory programming. In Proceedings of the2010 Workshop on Parallel Programming Patterns, ParaPLoP ’10, pages 5:1–5:8, New York, NY, USA, 2010. ACM

Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc

Page 4: Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns,

PETSc Overview

Matrix and Vector classes areused in all other components

Added OpenMP threading tolow-level implementations:

I Vector operationsI CSR matricesI Block-CSR matrices

Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc

Page 5: Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns,

Sparse Matrix-Vector Multiplication

Matrix-Multiply is most expensive component of the solve

P1

P2

P3

P4

P5

P6

P7

P8

P1

P2

P3

P4

P5

P6

P7

P8

Parallel Matrix-Multiply:

I Multiply diagonal submatrixI Scatter/gather remote vector elementsI Multiply-add off-diagonal submatrices

Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc

Page 6: Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns,

Sparse Matrix-Vector Multiplication

Input vector elements require MPI communication

I Hide MPI latency by overlapping with local computationI Not all MPI implementations work asynchronously2

Task-based Matrix-Multiply

I Dedicated thread for MPI communicationI Advances communication protocolI Copy data to/from buffer

I In contrast to Vector-based threadingI Lift parallel section to include scatter/gather operationI Cannot use parallel for pragmaI N − 1 threads enough to saturate memory bandwidth

2G. Schubert, H. Fehske, G. Hager, and G. Wellein. Hybrid-parallel sparse matrix-vector multiplication withexplicit communication overlap on current multicore-based systems. Parallel Processing Letters, 21(3):339–358, 2011

Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc

Page 7: Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns,

Sparse Matrix-Vector Multiplication

Thread-level Load Balance

I Matrix rows partitioned in blocksI Create partitioning based on non-zero

elements per row3

I Cache partitioning with matrix object

Explicit thread-balancing scheme

I Initial greedy allocationI Local diffusion algorithm

3S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vectormultiplication on emerging multicore platforms. Parallel Computing, 35(3):178 – 194, 2009

Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc

Page 8: Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns,

Benchmark

Global baroclinic ocean simulation4

I Mesh based on extruded bathymetry data

I Pressure matrix:I 371,102,769 non-zero elementsI 13,491,933 Degrees of Freedom

I Solver options:I Conjugate Gradient methodI Jacobi preconditionerI 10,000 iterations

4M. D. Piggott, G. J. Gorman, C. C. Pain, P. A. Allison, A. S. Candy, B. T. Martin, and M. R. Wells. A newcomputational framework for multi-scale ocean modelling based on adapting unstructured meshes. InternationalJournal for Numerical Methods in Fluids, 56(8):1003–1015, 2008

Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc

Page 9: Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns,

Architecture Overview

Cray XE6 (HECToR)

I NUMA architectureI 32 cores per nodeI 4 NUMA domains, 8 cores each

Fujitsu PRIMEHPC FX10

I UMA architectureI 16 cores per node

IBM BlueGene/Q

I UMA architectureI 16 cores per nodeI 4-way hardware threading (SMT)

Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc

Page 10: Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns,

Hardware Utilisation: 128 cores

1 2 4 8 16 32No. of Threads / MPI process

100

150

200

250

300

350Ru

ntim

e (s

)XE6: VectorXE6: TaskXE6: Task, NZ-balanced

FX10: VectorFX10: TaskFX10: Task, NZ-balanced

I On XE6 slowdown with multiple NUMA domainsI Performance bound by memory-latency

Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc

Page 11: Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns,

Hardware Utilisation: 1024 cores

1 2 4 8 16 32No. of Threads / MPI process

20

25

30

35

40

45

50

55

60Ru

ntim

e (s

)XE6: VectorXE6: TaskXE6: Task, NZ-balanced

FX10: VectorFX10: TaskFX10: Task, NZ-balanced

I Both task-based algorithms improveI NZ-based load balancing now faster

Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc

Page 12: Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns,

Hardware Utilisation: 4096 cores

1 2 4 8 16 32No. of Threads / MPI process

5

6

7

8

9

10

11

12Ru

ntim

e (s

)XE6: VectorXE6: TaskXE6: Task, NZ-balanced

I Vector-based approach bound by MPI communicationI Explicit thread-balancing improves memory bandwidth utilisation,

but worsens latency effects!

Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc

Page 13: Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns,

Strong Scaling: Cray XE6

32 64 128 256 512 1024 2048 4096 8192No. of Cores

101

102

103

Runt

ime

(s)

XE6: Vector-basedXE6: Task-based

XE6: Task-based, NZ-balancedXE6: Pure-MPI

32 64 128 256 512 1024 2048 4096 8192No. of Cores

20

40

60

80

100

120

140

Para

llel E

ffici

ency

(%) XE6: Vector-based

XE6: Task-basedXE6: Task-based, NZ-balancedXE6: Pure-MPI

Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc

Page 14: Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns,

Strong Scaling: Cray XE6

256 512 1024 2048 4096 8192 16384 32768No. of Cores

101

102

103

Runt

ime

(s)

XE6: Vector-basedXE6: Task-based

XE6: Task-based, NZ-balancedXE6: Pure-MPI

256 512 1024 2048 4096 8192 16384 32768No. of Cores

30405060708090

100110120

Para

llel E

ffici

ency

(%)

XE6: Vector-basedXE6: Task-based

XE6: Task-based, NZ-balancedXE6: Pure-MPI

Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc

Page 15: Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns,

Strong Scaling: BlueGene/Q

128 256 512 1024 2048 4096 8192No. of Cores

101

102

103

Runt

ime

(s)

BGQ: Pure-MPIBGQ (SMT=4): Vector-based

BGQ (SMT=4): Task-basedBGQ (SMT=4): Task-based, NZ-balanced

128 256 512 1024 2048 4096 8192No. of Cores

30405060708090

100110

Para

llel E

ffici

ency

(%)

BGQ: Pure-MPIBGQ (SMT=4): Vector-based

BGQ (SMT=4): Task-basedBGQ (SMT=4): Task-based, NZ-balanced

Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc

Page 16: Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns,

Conclusion

OpenMP threaded PETSc version

I Threaded vector and matrix operatorsI Task-based sparse matrix multiplicationI Non-zero-based thread partitioning

Strong scaling optimisation

I Performance deficit on small numbers of nodes (latency-bound)I Increased performance in the strong limit (bandwidth-bound)I Marshalling load imbalance

I Inter-process balance improved with less MPI ranksI Load imbalance among threads handled explicitly

Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc

Page 17: Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns,

Acknowledgements

Threaded PETSc version is available at:

I Open Petascale Librarieshttp://www.openpetascale.org/

The work presented here was funded by:

I Fujitsu Laboratories of Europe Ltd.I European Commission in FP7 as part of the APOS-EU project

Many thanks to:

I EPCCI Hartree CentreI PETSc development team

Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc