Porting VASP to GPU using OpenACC · VASP on GPU •VASP has organically grown over more than 25...

15
Porting VASP to GPU using OpenACC Martijn Marsman, Stefan Maintz, Andreas Hehn, Markus Wetzstein, and Georg Kresse SC19, Denver, 19 th Nov. 2019

Transcript of Porting VASP to GPU using OpenACC · VASP on GPU •VASP has organically grown over more than 25...

  • Porting VASP to GPU using OpenACC

    Martijn Marsman, Stefan Maintz, Andreas Hehn, Markus Wetzstein,and Georg Kresse

    SC19, Denver, 19th Nov. 2019

  • The Vienna Ab-initioSimulation Package: VASP

    Electronic structure from first principles:

    𝐻𝜓 = 𝐸𝜓

    • Approximations:

    • Density Functional Theory (DFT)

    • Hartree-Fock/DFT-HF hybrid functionals

    • Random-Phase-Approximation(GW, ACFDT)

    • 3500+ licensed academic and industrial groups world wide.

    • 10k+ publications in 2015 (Google Scholar),and rising.

    • Developed in the group of Prof. G. Kresse at the University Vienna.

    NVIDIA BOOTH @ SC19

  • NVIDIA BOOTH @ SC19

    VASP: Computational Characteristics

    VASP does:

    • Lots of “smallish” FFTs:(e.g. 100⨉100⨉100)

    • Matrix-Matrix multiplication(DGEMM and ZGEMM)

    • Matrix diagonalization: 𝒪 𝑁3

    (𝑁 ≈ #-of-electrons)

    • All-2-all communication

    Using:

    • fftw3d (or fftw-wrappers to mkl-ffts)

    • LAPACK BLAS3 (mkl, OpenBLAS)

    • scaLAPACK (or ELPA)

    • MPI (OpenMPI, impi, …) [+ OpenMP]

    VASP is pretty well characterized by the SPECfp2006 benchmark

  • NVIDIA BOOTH @ SC19

    VASP on GPU

    • VASP has organically grown over more than 25 years(450k+ lines of Fortran 77/90/2003/2008/… code)

    • Current release: some features were ported with CUDA C(DFT and hybrid functionals)

    • Upcoming VASP6 release: re-ported to GPU using OpenACC

    • The OpenACC port is more complete already than the CUDA port(Gamma-only version and support for reciprocal space projectors)

  • NVIDIA BOOTH @ SC19

    Porting VASP to GPU using OpenACC

    • Compiler-directive based: single source, readability, maintainability, …

    • cuFFT, cuBLAS, cuSOLVER, CUDA aware MPI, NCCL

    • Some dedicated kernel versions: e.g. batching FFTs, loop re-ordering

    • “Manual” deep copies of derived types(nested and/or with pointer members)

    • Multiple MPI ranks sharing a GPU (using MPS)

    • Combine OpenACC and OpenMP(OpenMP threads driving asynchronous execution queues)

  • NVIDIA BOOTH @ SC19

    OpenACC directives

    25

    OPENACC DIRECTIVESData directives are designed to be optional

    Manage

    Data

    Movement

    Initiate

    Parallel

    Execution

    Optimize

    Loop

    Mappings

    !$acc data copyin(a,b) copyout(c)

    ...!$acc parallel

    !$acc loop gang vectordo i=1, n

    c(i) = a(i) + b(i)...

    enddo!$acc end parallel...

    !$acc end data

  • NVIDIA BOOTH @ SC19

    Nested derived types

    • OpenACC + Unified Memory not an option yet

    • OpenACC 2.6 manual deep copy was key

    • Requires large numbers of directives in some cases,

    • ... but well encapsulated

    • Future versions of OpenACC (3.0) will add true deep copy, require far fewer data directives

    • When CUDA Unified Memory + HMM supports all classes of data, potential for a VASP port with no data directives at all

    11

    MANAGING VASP AGGREGATEDATA STRUCTURES

    • OpenACC + Unified Memory not an option today, some aggregates have static members

    • OpenACC 2.6 manual deep copy was key

    • Requires large numbers of directives in some cases, but well encapsulated (107 lines for COPYIN)

    • Future versions of OpenACC (3.0) will add true deep copy, require far fewer data directives

    • When CUDA Unified Memory + HMM supports all classes of data, potential for a VASP port with no data directives at all

    Derived Type 1Members:3 dynamic

    1 derived type 2

    Derived Type 2Members:

    21 dynamic1 derived type 31 derived type 4

    Derived Type 3Members:only static

    Derived Type 4Members:8 dynamic

    4 derived type 52 derived type 6

    Derived Type 5Members:3 dynamic

    Derived Type 6Members:8 dynamic

    +12 lines of code

    +48 lines of code

    +26 lines of code

    +8 lines of code

    +13 lines of code

  • NVIDIA BOOTH @ SC19

    VASP on GPU benchmarks

    CuC_vdW

    • C@Cu surface (Ω≅ 2800 Å3)

    • 96 Cu + 2 C atoms (1064 e−)

    • vdW-DFT

    • RMM-DIIIS

    • OpenACC port outperformsthe previous CUDA port …

    1.0 1.0 1.0

    1.7

    2.32.5

    2.2

    3.3

    3.7

    2.9

    4.1

    4.7

    3.3

    5.4

    6.6

    0

    1

    2

    3

    4

    5

    6

    7

    VASP 5 VASP 6 VASP 6+

    Spee

    du

    p v

    s. C

    PU

    CPU 1 V100 2 V100 4 V100 8 V100

    • CPU: 2⨉ E5-2698 v4 @ 2.20 GHz: 40 physical cores

  • NVIDIA BOOTH @ SC19

    CUDA C vs. OpenACC port

    • Full benchmark timings are interesting for time-to-solution, but are not an ‘apples- to-apples’ comparison between the CUDA and OpenACC versions:

    • Amdahl’s law for non-GPU accelerated parts of code affects both implementations, but blurs differences

    • Using OpenACC allowed to port additional kernels with minimal effort, has not been undertaken with CUDA version

    • OpenACC version uses GPU-aware MPI to help more communication heavy parts, like orthonormalization

    • OpenACC version was forked out of a more recent version of CPU code, while CUDA implementation is older

    Can we find a fairer comparison? Let’s look at the RMM-DIIS algorithm …

  • NVIDIA BOOTH @ SC19

    Iterative diagonalization: RMM-DIIS (EDDRMM)

    • EDDRMM part has comparable GPU- coverage for CUDA and OpenACC versions

    • CUDA version uses kernel fusing, OpenACC version uses two refactored kernels

    • minimal amount of MPI communication

    • OpenACC version improves scaling with number of GPUs

    18

    0

    4

    8

    12

    16

    20

    1 2 4 8Sp

    ee

    du

    pNumber of V100 GPUs

    EDDRMM section (silica_IFPEN), speedup over CPU

    VASP 5.4.4

    dev_OpenACC

    VASP OPENACC PERFORMANCEEDDRMM section of silica_IFPEN on V100

    • EDDRMM takes 17% of total runtime

    • benefits for expectation values included

    • These high speedups are not the single aspect for overall improvement, but an important contribution

    • OpenACC improves scaling yet again

    • MPS always helps, but does not pay off in total time due to start-up overhead

    CPU: dual socket Broadwell E5-2698 v4, compiler Intel 17.0.1GPU: 5.4.4 compiler Intel 17.0.1; dev_OpenACC compiler: PGI 18.3 (CUDA 9.1)

  • NVIDIA BOOTH @ SC19

    Orthonormalization

    • GPU-aware MPI benefits from NVLink latency and B/W

    • Data remains on GPU, CUDA port streamed data for GEMMs

    • Cholesky on CPU saves a (smaller) mem-transfer

    • 180 ms (40%) are saved by GPU-aware MPI alone

    • 33 ms (7.5%) by others

    15

    VASP OPENACC PERFORMANCESection-level comparison for orthonormalization

    CUDA CPORT

    OPENACCPORT

    Redistributing wavefunctions

    Host-only MPI

    (185 ms)

    GPU-aware MPI

    (110 ms)

    Matrix-Matrix-MulsStreamed data

    (19 ms)

    GPU local data

    (15 ms)

    Cholesky decomposition

    CPU-only

    (24 ms)

    cuSolver

    (12 ms)

    Matrix-Matrix-MulsDefault scheme

    (30 ms)

    better blocking

    (13 ms)

    Redistributing wavefunctions

    Host-only MPI

    (185 ms)

    GPU-aware MPI

    (80 ms)

    • GPU-aware MPI benefits from NVLink latency and B/W

    • Data remains on GPU, CUDA port streamed data for GEMMs

    • Cholesky on CPU saves a (smaller) mem-transfer

    • 180 ms (40%) are saved by GPU-aware MPI alone

    • 33 ms (7.5%) by others

  • NVIDIA BOOTH @ SC19

    VASP on GPU benchmarks

    Si256_VJT_HSE06

    • Vacancy in Si (Ω ≅5200 Å3)

    • 255 Si atoms (1020 e−)

    • DFT/HF-hybrid functional

    • Conjugate gradient

    • Batched FFTs

    • Explicit overlay of computationand communication usingnon-blocking collectives (NCCL)

    • CPU: 2⨉ E5-2698 v4 @ 2.20 GHz: 40 physical cores

    1.0 1.04.7 4.7

    8.8 9.0

    15.7 15.9

    28.1 28.7

    0

    5

    10

    15

    20

    25

    30

    35

    VASP 6 VASP NV

    Spe

    ed

    up

    vs.

    CP

    U

    CPU 1 V100 2 V100 4 V100 8 V100

  • NVIDIA BOOTH @ SC19

    The OpenACC port: current limitations

    • Some bottlenecks must be addressed: computation of the local potential is still done CPU-side.

    • Not all features are ported yet: Currently we are porting the linear response solvers and cubic-scaling ACFDT (RPA total energies)

    • Some features of VASP, e.g. cubic-scaling RPA, are very (very) memory intensive, and involve diagonalization of large complex matrices(> 100k ⨉ 100k): e.g. cusolverMgSyevd

    • PGI compilers only

  • NVIDIA BOOTH @ SC19

    New Release: VASP6

    • …

    • Cubic-scaling RPA (ACFDT,GW)

    • On-the-fly machine learned force-fields

    • Electron-Phonon coupling

    • MPI+OpenMP

    • OpenACC port

    • …

    • Caveat: the OpenACC port is stillregarded to be “experimental” at this stage

    • Actively gather feedback(from HPC sites)

    • Intensive support effort

    https://www.vasp.at/wiki/index.php/Category:VASP6

    https://www.vasp.at/wiki/index.php/Category:VASP6

  • NVIDIA BOOTH @ SC19

    THE END

    Special thanks to Stefan Maintz, Andreas Hehn, and Markus Wetzsteinfrom NVIDIA and PGI!

    And to Ani Anciaux-Sedrakian and Thomas Guignon at IFPEN!

    And to you for listening!