Achieving Efficient Strong Scaling with PETSc Using Hybrid ... ... Three layer cake for...

Click here to load reader

  • date post

    23-Jan-2021
  • Category

    Documents

  • view

    3
  • download

    0

Embed Size (px)

Transcript of Achieving Efficient Strong Scaling with PETSc Using Hybrid ... ... Three layer cake for...

  • Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

    Michael Lange1 Gerard Gorman1 Michele Weiland2

    Lawrence Mitchell2 Xiaohu Guo3 James Southern4

    1AMCG, Imperial College London 2EPCC, University of Edinburgh

    3STFC, Daresbury Laboratory 4Fujitsu Laboratories of Europe Ltd.

    9 July, 2013

    Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc

  • Motivation

    Fluidity

    I Unstructured finite element code

    I Anisotropic mesh adaptivity I Applications:

    I CFD, geophysical flows, ocean modelling, reservoir modelling, mining, nuclear safety, renewable energies, etc.

    PETSc

    I Linear solver engine I Hybrid MPI/OpenMP

    version

    Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc

  • Programming for Exascale

    Three levels of parallelism in modern HPC architectures1:

    I Between nodes: Message passing via MPI I Between cores: Shared memory communication I Within core: SIMD

    Hybrid MPI/OpenMP parallelism:

    I Memory argument I MPI memory footprint not scalable I Replication of halo data

    I Speed argument I Message passing overhead I Improved load balance with fewer MPI ranks

    1A. D. Robison and R. E. Johnson. Three layer cake for shared-memory programming. In Proceedings of the 2010 Workshop on Parallel Programming Patterns, ParaPLoP ’10, pages 5:1–5:8, New York, NY, USA, 2010. ACM

    Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc

  • PETSc Overview

    Matrix and Vector classes are used in all other components

    Added OpenMP threading to low-level implementations:

    I Vector operations I CSR matrices I Block-CSR matrices

    Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc

  • Sparse Matrix-Vector Multiplication

    Matrix-Multiply is most expensive component of the solve

    P1 P2 P3 P4 P5 P6 P7 P8

    P1 P2 P3 P4 P5 P6 P7 P8

    Parallel Matrix-Multiply:

    I Multiply diagonal submatrix I Scatter/gather remote vector elements I Multiply-add off-diagonal submatrices

    Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc

  • Sparse Matrix-Vector Multiplication

    Input vector elements require MPI communication

    I Hide MPI latency by overlapping with local computation I Not all MPI implementations work asynchronously2

    Task-based Matrix-Multiply

    I Dedicated thread for MPI communication I Advances communication protocol I Copy data to/from buffer

    I In contrast to Vector-based threading I Lift parallel section to include scatter/gather operation I Cannot use parallel for pragma I N − 1 threads enough to saturate memory bandwidth

    2G. Schubert, H. Fehske, G. Hager, and G. Wellein. Hybrid-parallel sparse matrix-vector multiplication with explicit communication overlap on current multicore-based systems. Parallel Processing Letters, 21(3):339–358, 2011

    Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc

  • Sparse Matrix-Vector Multiplication

    Thread-level Load Balance

    I Matrix rows partitioned in blocks I Create partitioning based on non-zero

    elements per row3

    I Cache partitioning with matrix object

    Explicit thread-balancing scheme

    I Initial greedy allocation I Local diffusion algorithm

    3S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. Parallel Computing, 35(3):178 – 194, 2009

    Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc

  • Benchmark

    Global baroclinic ocean simulation4

    I Mesh based on extruded bathymetry data

    I Pressure matrix: I 371,102,769 non-zero elements I 13,491,933 Degrees of Freedom

    I Solver options: I Conjugate Gradient method I Jacobi preconditioner I 10,000 iterations

    4M. D. Piggott, G. J. Gorman, C. C. Pain, P. A. Allison, A. S. Candy, B. T. Martin, and M. R. Wells. A new computational framework for multi-scale ocean modelling based on adapting unstructured meshes. International Journal for Numerical Methods in Fluids, 56(8):1003–1015, 2008

    Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc

  • Architecture Overview

    Cray XE6 (HECToR)

    I NUMA architecture I 32 cores per node I 4 NUMA domains, 8 cores each

    Fujitsu PRIMEHPC FX10

    I UMA architecture I 16 cores per node

    IBM BlueGene/Q

    I UMA architecture I 16 cores per node I 4-way hardware threading (SMT)

    Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc

  • Hardware Utilisation: 128 cores

    1 2 4 8 16 32 No. of Threads / MPI process

    100

    150

    200

    250

    300

    350 Ru

    nt im

    e (s

    ) XE6: Vector XE6: Task XE6: Task, NZ-balanced

    FX10: Vector FX10: Task FX10: Task, NZ-balanced

    I On XE6 slowdown with multiple NUMA domains I Performance bound by memory-latency

    Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc

  • Hardware Utilisation: 1024 cores

    1 2 4 8 16 32 No. of Threads / MPI process

    20

    25

    30

    35

    40

    45

    50

    55

    60 Ru

    nt im

    e (s

    ) XE6: Vector XE6: Task XE6: Task, NZ-balanced

    FX10: Vector FX10: Task FX10: Task, NZ-balanced

    I Both task-based algorithms improve I NZ-based load balancing now faster

    Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc

  • Hardware Utilisation: 4096 cores

    1 2 4 8 16 32 No. of Threads / MPI process

    5

    6

    7

    8

    9

    10

    11

    12 Ru

    nt im

    e (s

    ) XE6: Vector XE6: Task XE6: Task, NZ-balanced

    I Vector-based approach bound by MPI communication I Explicit thread-balancing improves memory bandwidth utilisation,

    but worsens latency effects!

    Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc

  • Strong Scaling: Cray XE6

    32 64 128 256 512 1024 2048 4096 8192 No. of Cores

    101

    102

    103

    Ru nt

    im e

    (s )

    XE6: Vector-based XE6: Task-based

    XE6: Task-based, NZ-balanced XE6: Pure-MPI

    32 64 128 256 512 1024 2048 4096 8192 No. of Cores

    20

    40

    60

    80

    100

    120

    140

    Pa ra

    lle l E

    ffi ci

    en cy

    (% ) XE6: Vector-based

    XE6: Task-based XE6: Task-based, NZ-balanced XE6: Pure-MPI

    Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc

  • Strong Scaling: Cray XE6

    256 512 1024 2048 4096 8192 16384 32768 No. of Cores

    101

    102

    103

    Ru nt

    im e

    (s )

    XE6: Vector-based XE6: Task-based

    XE6: Task-based, NZ-balanced XE6: Pure-MPI

    256 512 1024 2048 4096 8192 16384 32768 No. of Cores

    30 40 50 60 70 80 90

    100 110 120

    Pa ra

    lle l E

    ffi ci

    en cy

    (% )

    XE6: Vector-based XE6: Task-based

    XE6: Task-based, NZ-balanced XE6: Pure-MPI

    Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc

  • Strong Scaling: BlueGene/Q

    128 256 512 1024 2048 4096 8192 No. of Cores

    101

    102

    103

    Ru nt

    im e

    (s )

    BGQ: Pure-MPI BGQ (SMT=4): Vector-based

    BGQ (SMT=4): Task-based BGQ (SMT=4): Task-based, NZ-balanced

    128 256 512 1024 2048 4096 8192 No. of Cores

    30 40 50 60 70 80 90

    100 110

    Pa ra

    lle l E

    ffi ci

    en cy

    (% )

    BGQ: Pure-MPI BGQ (SMT=4): Vector-based

    BGQ (SMT=4): Task-based BGQ (SMT=4): Task-based, NZ-balanced

    Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc

  • Conclusion

    OpenMP threaded PETSc version

    I Threaded vector and matrix operators I Task-based sparse matrix multiplication I Non-zero-based thread partitioning

    Strong scaling optimisation

    I Performance deficit on small numbers of nodes (latency-bound) I Increased performance in the strong limit (bandwidth-bound) I Marshalling load imbalance

    I Inter-process balance improved with less MPI ranks I Load imbalance among threads handled explicitly

    Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc

  • Acknowledgements

    Threaded PETSc version is available at:

    I Open Petascale Libraries http://www.openpetascale.org/

    The work presented here was funded by:

    I Fujitsu Laboratories of Europe Ltd. I European Commission in FP7 as part of the APOS-EU project

    Many thanks to:

    I EPCC I Hartree Centre I PETSc development team

    Lange, Gorman, Weiland, Mitchell, Southern Efficient Strong Scaling with Hybrid PETSc

    http://www.openpetascale.org/index.php/public/page/download