Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory...

Achieving Efficient Strong Scaling with PETScUsing Hybrid MPI/OpenMP Optimisation

Michael Lange1 Gerard Gorman1 Michele Weiland2

Lawrence Mitchell2 Xiaohu Guo3 James Southern4

1AMCG, Imperial College London2EPCC, University of Edinburgh

3STFC, Daresbury Laboratory4Fujitsu Laboratories of Europe Ltd.

9 July, 2013

Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc

Motivation

Fluidity

I Unstructured finite elementcode

I Anisotropic mesh adaptivityI Applications:

I CFD, geophysical flows,ocean modelling,reservoir modelling,mining, nuclear safety,renewable energies, etc.

PETSc

I Linear solver engineI Hybrid MPI/OpenMP

version


Programming for Exascale

Three levels of parallelism in modern HPC architectures1:

I Between nodes: Message passing via MPII Between cores: Shared memory communicationI Within core: SIMD

Hybrid MPI/OpenMP parallelism:

I Memory argumentI MPI memory footprint not scalableI Replication of halo data

I Speed argumentI Message passing overheadI Improved load balance with fewer MPI ranks

1A. D. Robison and R. E. Johnson. Three layer cake for shared-memory programming. In Proceedings of the2010 Workshop on Parallel Programming Patterns, ParaPLoP ’10, pages 5:1–5:8, New York, NY, USA, 2010. ACM


PETSc Overview

Matrix and Vector classes areused in all other components

Added OpenMP threading tolow-level implementations:

I Vector operationsI CSR matricesI Block-CSR matrices


Sparse Matrix-Vector Multiplication

Matrix-Multiply is most expensive component of the solve

P1

P2

P3

P4

P5

P6

P7

P8

P1

P2

P3

P4

P5

P6

P7

P8

Parallel Matrix-Multiply:

I Multiply diagonal submatrixI Scatter/gather remote vector elementsI Multiply-add off-diagonal submatrices



Input vector elements require MPI communication

I Hide MPI latency by overlapping with local computationI Not all MPI implementations work asynchronously2

Task-based Matrix-Multiply

I Dedicated thread for MPI communicationI Advances communication protocolI Copy data to/from buffer

I In contrast to Vector-based threadingI Lift parallel section to include scatter/gather operationI Cannot use parallel for pragmaI N − 1 threads enough to saturate memory bandwidth

2G. Schubert, H. Fehske, G. Hager, and G. Wellein. Hybrid-parallel sparse matrix-vector multiplication withexplicit communication overlap on current multicore-based systems. Parallel Processing Letters, 21(3):339–358, 2011



Thread-level Load Balance

I Matrix rows partitioned in blocksI Create partitioning based on non-zero

elements per row3

I Cache partitioning with matrix object

Explicit thread-balancing scheme

I Initial greedy allocationI Local diffusion algorithm

3S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vectormultiplication on emerging multicore platforms. Parallel Computing, 35(3):178 – 194, 2009


Benchmark

Global baroclinic ocean simulation4

I Mesh based on extruded bathymetry data

I Pressure matrix:I 371,102,769 non-zero elementsI 13,491,933 Degrees of Freedom

I Solver options:I Conjugate Gradient methodI Jacobi preconditionerI 10,000 iterations

4M. D. Piggott, G. J. Gorman, C. C. Pain, P. A. Allison, A. S. Candy, B. T. Martin, and M. R. Wells. A newcomputational framework for multi-scale ocean modelling based on adapting unstructured meshes. InternationalJournal for Numerical Methods in Fluids, 56(8):1003–1015, 2008


Architecture Overview

Cray XE6 (HECToR)

I NUMA architectureI 32 cores per nodeI 4 NUMA domains, 8 cores each

Fujitsu PRIMEHPC FX10

I UMA architectureI 16 cores per node

IBM BlueGene/Q

I UMA architectureI 16 cores per nodeI 4-way hardware threading (SMT)


Hardware Utilisation: 128 cores

1 2 4 8 16 32No. of Threads / MPI process

100

150

200

250

300

350Ru

ntim

e (s

)XE6: VectorXE6: TaskXE6: Task, NZ-balanced

FX10: VectorFX10: TaskFX10: Task, NZ-balanced

I On XE6 slowdown with multiple NUMA domainsI Performance bound by memory-latency




20

25

30

35

40

45

50

55

60Ru

ntim

e (s


FX10: VectorFX10: TaskFX10: Task, NZ-balanced

I Both task-based algorithms improveI NZ-based load balancing now faster




5

6

7

8

9

10

11

12Ru

ntim

e (s


I Vector-based approach bound by MPI communicationI Explicit thread-balancing improves memory bandwidth utilisation,

but worsens latency effects!


Strong Scaling: Cray XE6

32 64 128 256 512 1024 2048 4096 8192No. of Cores

101

102

103

Runt

ime

(s)

XE6: Vector-basedXE6: Task-based

XE6: Task-based, NZ-balancedXE6: Pure-MPI

32 64 128 256 512 1024 2048 4096 8192No. of Cores

20

40

60

80

100

120

140

Para

llel E

ffici

ency

(%) XE6: Vector-based

XE6: Task-basedXE6: Task-based, NZ-balancedXE6: Pure-MPI


Strong Scaling: Cray XE6

256 512 1024 2048 4096 8192 16384 32768No. of Cores

101

102

103

Runt

ime

(s)



256 512 1024 2048 4096 8192 16384 32768No. of Cores

30405060708090

100110120

Para

llel E

ffici

ency

(%)




Strong Scaling: BlueGene/Q

128 256 512 1024 2048 4096 8192No. of Cores

101

102

103

Runt

ime

(s)

BGQ: Pure-MPIBGQ (SMT=4): Vector-based

BGQ (SMT=4): Task-basedBGQ (SMT=4): Task-based, NZ-balanced

128 256 512 1024 2048 4096 8192No. of Cores

30405060708090

100110

Para

llel E

ffici

ency

(%)

BGQ: Pure-MPIBGQ (SMT=4): Vector-based

BGQ (SMT=4): Task-basedBGQ (SMT=4): Task-based, NZ-balanced


Conclusion

OpenMP threaded PETSc version

I Threaded vector and matrix operatorsI Task-based sparse matrix multiplicationI Non-zero-based thread partitioning

Strong scaling optimisation

I Performance deficit on small numbers of nodes (latency-bound)I Increased performance in the strong limit (bandwidth-bound)I Marshalling load imbalance

I Inter-process balance improved with less MPI ranksI Load imbalance among threads handled explicitly


Acknowledgements

Threaded PETSc version is available at:

I Open Petascale Librarieshttp://www.openpetascale.org/

The work presented here was funded by:

I Fujitsu Laboratories of Europe Ltd.I European Commission in FP7 as part of the APOS-EU project

Many thanks to:

I EPCCI Hartree CentreI PETSc development team


http://www.openpetascale.org/index.php/public/page/download

Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory...

Documents

Transcript of Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory...