Achieving Efficient Strong Scaling with PETSc Using Hybrid ... ... Three layer cake for...
date post
23-Jan-2021Category
Documents
view
3download
0
Embed Size (px)
Transcript of Achieving Efficient Strong Scaling with PETSc Using Hybrid ... ... Three layer cake for...
Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation
Michael Lange1 Gerard Gorman1 Michele Weiland2
Lawrence Mitchell2 Xiaohu Guo3 James Southern4
1AMCG, Imperial College London 2EPCC, University of Edinburgh
3STFC, Daresbury Laboratory 4Fujitsu Laboratories of Europe Ltd.
9 July, 2013
Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc
Motivation
Fluidity
I Unstructured finite element code
I Anisotropic mesh adaptivity I Applications:
I CFD, geophysical flows, ocean modelling, reservoir modelling, mining, nuclear safety, renewable energies, etc.
PETSc
I Linear solver engine I Hybrid MPI/OpenMP
version
Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc
Programming for Exascale
Three levels of parallelism in modern HPC architectures1:
I Between nodes: Message passing via MPI I Between cores: Shared memory communication I Within core: SIMD
Hybrid MPI/OpenMP parallelism:
I Memory argument I MPI memory footprint not scalable I Replication of halo data
I Speed argument I Message passing overhead I Improved load balance with fewer MPI ranks
1A. D. Robison and R. E. Johnson. Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns, ParaPLoP ’10, pages 5:1–5:8, New York, NY, USA, 2010. ACM
Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc
PETSc Overview
Matrix and Vector classes are used in all other components
Added OpenMP threading to low-level implementations:
I Vector operations I CSR matrices I Block-CSR matrices
Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc
Sparse Matrix-Vector Multiplication
Matrix-Multiply is most expensive component of the solve
P1 P2 P3 P4 P5 P6 P7 P8
P1 P2 P3 P4 P5 P6 P7 P8
Parallel Matrix-Multiply:
I Multiply diagonal submatrix I Scatter/gather remote vector elements I Multiply-add off-diagonal submatrices
Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc
Sparse Matrix-Vector Multiplication
Input vector elements require MPI communication
I Hide MPI latency by overlapping with local computation I Not all MPI implementations work asynchronously2
Task-based Matrix-Multiply
I Dedicated thread for MPI communication I Advances communication protocol I Copy data to/from buffer
I In contrast to Vector-based threading I Lift parallel section to include scatter/gather operation I Cannot use parallel for pragma I N − 1 threads enough to saturate memory bandwidth
2G. Schubert, H. Fehske, G. Hager, and G. Wellein. Hybrid-parallel sparse matrix-vector multiplication with explicit communication overlap on current multicore-based systems. Parallel Processing Letters, 21(3):339–358, 2011
Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc
Sparse Matrix-Vector Multiplication
Thread-level Load Balance
I Matrix rows partitioned in blocks I Create partitioning based on non-zero
elements per row3
I Cache partitioning with matrix object
Explicit thread-balancing scheme
I Initial greedy allocation I Local diffusion algorithm
3S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing, 35(3):178 – 194, 2009
Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc
Benchmark
Global baroclinic ocean simulation4
I Mesh based on extruded bathymetry data
I Pressure matrix: I 371,102,769 non-zero elements I 13,491,933 Degrees of Freedom
I Solver options: I Conjugate Gradient method I Jacobi preconditioner I 10,000 iterations
4M. D. Piggott, G. J. Gorman, C. C. Pain, P. A. Allison, A. S. Candy, B. T. Martin, and M. R. Wells. A new computational framework for multi-scale ocean modelling based on adapting unstructured meshes. International Journal for Numerical Methods in Fluids, 56(8):1003–1015, 2008
Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc
Architecture Overview
Cray XE6 (HECToR)
I NUMA architecture I 32 cores per node I 4 NUMA domains, 8 cores each
Fujitsu PRIMEHPC FX10
I UMA architecture I 16 cores per node
IBM BlueGene/Q
I UMA architecture I 16 cores per node I 4-way hardware threading (SMT)
Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc
Hardware Utilisation: 128 cores
1 2 4 8 16 32 No. of Threads / MPI process
100
150
200
250
300
350 Ru
nt im
e (s
) XE6: Vector XE6: Task XE6: Task, NZ-balanced
FX10: Vector FX10: Task FX10: Task, NZ-balanced
I On XE6 slowdown with multiple NUMA domains I Performance bound by memory-latency
Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc
Hardware Utilisation: 1024 cores
1 2 4 8 16 32 No. of Threads / MPI process
20
25
30
35
40
45
50
55
60 Ru
nt im
e (s
) XE6: Vector XE6: Task XE6: Task, NZ-balanced
FX10: Vector FX10: Task FX10: Task, NZ-balanced
I Both task-based algorithms improve I NZ-based load balancing now faster
Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc
Hardware Utilisation: 4096 cores
1 2 4 8 16 32 No. of Threads / MPI process
5
6
7
8
9
10
11
12 Ru
nt im
e (s
) XE6: Vector XE6: Task XE6: Task, NZ-balanced
I Vector-based approach bound by MPI communication I Explicit thread-balancing improves memory bandwidth utilisation,
but worsens latency effects!
Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc
Strong Scaling: Cray XE6
32 64 128 256 512 1024 2048 4096 8192 No. of Cores
101
102
103
Ru nt
im e
(s )
XE6: Vector-based XE6: Task-based
XE6: Task-based, NZ-balanced XE6: Pure-MPI
32 64 128 256 512 1024 2048 4096 8192 No. of Cores
20
40
60
80
100
120
140
Pa ra
lle l E
ffi ci
en cy
(% ) XE6: Vector-based
XE6: Task-based XE6: Task-based, NZ-balanced XE6: Pure-MPI
Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc
Strong Scaling: Cray XE6
256 512 1024 2048 4096 8192 16384 32768 No. of Cores
101
102
103
Ru nt
im e
(s )
XE6: Vector-based XE6: Task-based
XE6: Task-based, NZ-balanced XE6: Pure-MPI
256 512 1024 2048 4096 8192 16384 32768 No. of Cores
30 40 50 60 70 80 90
100 110 120
Pa ra
lle l E
ffi ci
en cy
(% )
XE6: Vector-based XE6: Task-based
XE6: Task-based, NZ-balanced XE6: Pure-MPI
Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc
Strong Scaling: BlueGene/Q
128 256 512 1024 2048 4096 8192 No. of Cores
101
102
103
Ru nt
im e
(s )
BGQ: Pure-MPI BGQ (SMT=4): Vector-based
BGQ (SMT=4): Task-based BGQ (SMT=4): Task-based, NZ-balanced
128 256 512 1024 2048 4096 8192 No. of Cores
30 40 50 60 70 80 90
100 110
Pa ra
lle l E
ffi ci
en cy
(% )
BGQ: Pure-MPI BGQ (SMT=4): Vector-based
BGQ (SMT=4): Task-based BGQ (SMT=4): Task-based, NZ-balanced
Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc
Conclusion
OpenMP threaded PETSc version
I Threaded vector and matrix operators I Task-based sparse matrix multiplication I Non-zero-based thread partitioning
Strong scaling optimisation
I Performance deficit on small numbers of nodes (latency-bound) I Increased performance in the strong limit (bandwidth-bound) I Marshalling load imbalance
I Inter-process balance improved with less MPI ranks I Load imbalance among threads handled explicitly
Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc
Acknowledgements
Threaded PETSc version is available at:
I Open Petascale Libraries http://www.openpetascale.org/
The work presented here was funded by:
I Fujitsu Laboratories of Europe Ltd. I European Commission in FP7 as part of the APOS-EU project
Many thanks to:
I EPCC I Hartree Centre I PETSc development team
Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc
http://www.openpetascale.org/index.php/public/page/download