Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory...
Transcript of Achieving Efficient Strong Scaling with PETSc Using Hybrid ......Three layer cake for shared-memory...
Achieving Efficient Strong Scaling with PETScUsing Hybrid MPI/OpenMP Optimisation
Michael Lange1 Gerard Gorman1 Michele Weiland2
Lawrence Mitchell2 Xiaohu Guo3 James Southern4
1AMCG, Imperial College London2EPCC, University of Edinburgh
3STFC, Daresbury Laboratory4Fujitsu Laboratories of Europe Ltd.
9 July, 2013
Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc
Motivation
Fluidity
I Unstructured finite elementcode
I Anisotropic mesh adaptivityI Applications:
I CFD, geophysical flows,ocean modelling,reservoir modelling,mining, nuclear safety,renewable energies, etc.
PETSc
I Linear solver engineI Hybrid MPI/OpenMP
version
Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc
Programming for Exascale
Three levels of parallelism in modern HPC architectures1:
I Between nodes: Message passing via MPII Between cores: Shared memory communicationI Within core: SIMD
Hybrid MPI/OpenMP parallelism:
I Memory argumentI MPI memory footprint not scalableI Replication of halo data
I Speed argumentI Message passing overheadI Improved load balance with fewer MPI ranks
1A. D. Robison and R. E. Johnson. Three layer cake for shared-memory programming. In Proceedings of the2010 Workshop on Parallel Programming Patterns, ParaPLoP ’10, pages 5:1–5:8, New York, NY, USA, 2010. ACM
Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc
PETSc Overview
Matrix and Vector classes areused in all other components
Added OpenMP threading tolow-level implementations:
I Vector operationsI CSR matricesI Block-CSR matrices
Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc
Sparse Matrix-Vector Multiplication
Matrix-Multiply is most expensive component of the solve
P1
P2
P3
P4
P5
P6
P7
P8
P1
P2
P3
P4
P5
P6
P7
P8
Parallel Matrix-Multiply:
I Multiply diagonal submatrixI Scatter/gather remote vector elementsI Multiply-add off-diagonal submatrices
Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc
Sparse Matrix-Vector Multiplication
Input vector elements require MPI communication
I Hide MPI latency by overlapping with local computationI Not all MPI implementations work asynchronously2
Task-based Matrix-Multiply
I Dedicated thread for MPI communicationI Advances communication protocolI Copy data to/from buffer
I In contrast to Vector-based threadingI Lift parallel section to include scatter/gather operationI Cannot use parallel for pragmaI N − 1 threads enough to saturate memory bandwidth
2G. Schubert, H. Fehske, G. Hager, and G. Wellein. Hybrid-parallel sparse matrix-vector multiplication withexplicit communication overlap on current multicore-based systems. Parallel Processing Letters, 21(3):339–358, 2011
Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc
Sparse Matrix-Vector Multiplication
Thread-level Load Balance
I Matrix rows partitioned in blocksI Create partitioning based on non-zero
elements per row3
I Cache partitioning with matrix object
Explicit thread-balancing scheme
I Initial greedy allocationI Local diffusion algorithm
3S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vectormultiplication on emerging multicore platforms. Parallel Computing, 35(3):178 – 194, 2009
Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc
Benchmark
Global baroclinic ocean simulation4
I Mesh based on extruded bathymetry data
I Pressure matrix:I 371,102,769 non-zero elementsI 13,491,933 Degrees of Freedom
I Solver options:I Conjugate Gradient methodI Jacobi preconditionerI 10,000 iterations
4M. D. Piggott, G. J. Gorman, C. C. Pain, P. A. Allison, A. S. Candy, B. T. Martin, and M. R. Wells. A newcomputational framework for multi-scale ocean modelling based on adapting unstructured meshes. InternationalJournal for Numerical Methods in Fluids, 56(8):1003–1015, 2008
Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc
Architecture Overview
Cray XE6 (HECToR)
I NUMA architectureI 32 cores per nodeI 4 NUMA domains, 8 cores each
Fujitsu PRIMEHPC FX10
I UMA architectureI 16 cores per node
IBM BlueGene/Q
I UMA architectureI 16 cores per nodeI 4-way hardware threading (SMT)
Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc
Hardware Utilisation: 128 cores
1 2 4 8 16 32No. of Threads / MPI process
100
150
200
250
300
350Ru
ntim
e (s
)XE6: VectorXE6: TaskXE6: Task, NZ-balanced
FX10: VectorFX10: TaskFX10: Task, NZ-balanced
I On XE6 slowdown with multiple NUMA domainsI Performance bound by memory-latency
Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc
Hardware Utilisation: 1024 cores
1 2 4 8 16 32No. of Threads / MPI process
20
25
30
35
40
45
50
55
60Ru
ntim
e (s
)XE6: VectorXE6: TaskXE6: Task, NZ-balanced
FX10: VectorFX10: TaskFX10: Task, NZ-balanced
I Both task-based algorithms improveI NZ-based load balancing now faster
Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc
Hardware Utilisation: 4096 cores
1 2 4 8 16 32No. of Threads / MPI process
5
6
7
8
9
10
11
12Ru
ntim
e (s
)XE6: VectorXE6: TaskXE6: Task, NZ-balanced
I Vector-based approach bound by MPI communicationI Explicit thread-balancing improves memory bandwidth utilisation,
but worsens latency effects!
Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc
Strong Scaling: Cray XE6
32 64 128 256 512 1024 2048 4096 8192No. of Cores
101
102
103
Runt
ime
(s)
XE6: Vector-basedXE6: Task-based
XE6: Task-based, NZ-balancedXE6: Pure-MPI
32 64 128 256 512 1024 2048 4096 8192No. of Cores
20
40
60
80
100
120
140
Para
llel E
ffici
ency
(%) XE6: Vector-based
XE6: Task-basedXE6: Task-based, NZ-balancedXE6: Pure-MPI
Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc
Strong Scaling: Cray XE6
256 512 1024 2048 4096 8192 16384 32768No. of Cores
101
102
103
Runt
ime
(s)
XE6: Vector-basedXE6: Task-based
XE6: Task-based, NZ-balancedXE6: Pure-MPI
256 512 1024 2048 4096 8192 16384 32768No. of Cores
30405060708090
100110120
Para
llel E
ffici
ency
(%)
XE6: Vector-basedXE6: Task-based
XE6: Task-based, NZ-balancedXE6: Pure-MPI
Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc
Strong Scaling: BlueGene/Q
128 256 512 1024 2048 4096 8192No. of Cores
101
102
103
Runt
ime
(s)
BGQ: Pure-MPIBGQ (SMT=4): Vector-based
BGQ (SMT=4): Task-basedBGQ (SMT=4): Task-based, NZ-balanced
128 256 512 1024 2048 4096 8192No. of Cores
30405060708090
100110
Para
llel E
ffici
ency
(%)
BGQ: Pure-MPIBGQ (SMT=4): Vector-based
BGQ (SMT=4): Task-basedBGQ (SMT=4): Task-based, NZ-balanced
Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc
Conclusion
OpenMP threaded PETSc version
I Threaded vector and matrix operatorsI Task-based sparse matrix multiplicationI Non-zero-based thread partitioning
Strong scaling optimisation
I Performance deficit on small numbers of nodes (latency-bound)I Increased performance in the strong limit (bandwidth-bound)I Marshalling load imbalance
I Inter-process balance improved with less MPI ranksI Load imbalance among threads handled explicitly
Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc
Acknowledgements
Threaded PETSc version is available at:
I Open Petascale Librarieshttp://www.openpetascale.org/
The work presented here was funded by:
I Fujitsu Laboratories of Europe Ltd.I European Commission in FP7 as part of the APOS-EU project
Many thanks to:
I EPCCI Hartree CentreI PETSc development team
Lange, Gorman, Weiland, Mitchell, SouthernEfficient Strong Scaling with Hybrid PETSc