Porting VASP to GPU using OpenACC · VASP on GPU •VASP has organically grown over more than 25...

Porting VASP to GPU using OpenACC

Martijn Marsman, Stefan Maintz, Andreas Hehn, Markus Wetzstein,and Georg Kresse

SC19, Denver, 19th Nov. 2019

The Vienna Ab-initioSimulation Package: VASP

Electronic structure from first principles:

𝐻𝜓 = 𝐸𝜓

• Approximations:

• Density Functional Theory (DFT)

• Hartree-Fock/DFT-HF hybrid functionals

• Random-Phase-Approximation(GW, ACFDT)

• 3500+ licensed academic and industrial groups world wide.

• 10k+ publications in 2015 (Google Scholar),and rising.

• Developed in the group of Prof. G. Kresse at the University Vienna.

NVIDIA BOOTH @ SC19

NVIDIA BOOTH @ SC19

VASP: Computational Characteristics

VASP does:

• Lots of “smallish” FFTs:(e.g. 100⨉100⨉100)

• Matrix-Matrix multiplication(DGEMM and ZGEMM)

• Matrix diagonalization: 𝒪 𝑁3

(𝑁 ≈ #-of-electrons)

• All-2-all communication

Using:

• fftw3d (or fftw-wrappers to mkl-ffts)

• LAPACK BLAS3 (mkl, OpenBLAS)

• scaLAPACK (or ELPA)

• MPI (OpenMPI, impi, …) [+ OpenMP]

VASP is pretty well characterized by the SPECfp2006 benchmark

NVIDIA BOOTH @ SC19

VASP on GPU

• VASP has organically grown over more than 25 years(450k+ lines of Fortran 77/90/2003/2008/… code)

• Current release: some features were ported with CUDA C(DFT and hybrid functionals)

• Upcoming VASP6 release: re-ported to GPU using OpenACC

• The OpenACC port is more complete already than the CUDA port(Gamma-only version and support for reciprocal space projectors)

NVIDIA BOOTH @ SC19

Porting VASP to GPU using OpenACC

• Compiler-directive based: single source, readability, maintainability, …

• cuFFT, cuBLAS, cuSOLVER, CUDA aware MPI, NCCL

• Some dedicated kernel versions: e.g. batching FFTs, loop re-ordering

• “Manual” deep copies of derived types(nested and/or with pointer members)

• Multiple MPI ranks sharing a GPU (using MPS)

• Combine OpenACC and OpenMP(OpenMP threads driving asynchronous execution queues)

NVIDIA BOOTH @ SC19

OpenACC directives

25

OPENACC DIRECTIVESData directives are designed to be optional

Manage

Data

Movement

Initiate

Parallel

Execution

Optimize

Loop

Mappings

!$acc data copyin(a,b) copyout(c)

...!$acc parallel

!$acc loop gang vectordo i=1, n

c(i) = a(i) + b(i)...

enddo!$acc end parallel...

!$acc end data

NVIDIA BOOTH @ SC19

Nested derived types

• OpenACC + Unified Memory not an option yet

• OpenACC 2.6 manual deep copy was key

• Requires large numbers of directives in some cases,

• ... but well encapsulated

• Future versions of OpenACC (3.0) will add true deep copy, require far fewer data directives

• When CUDA Unified Memory + HMM supports all classes of data, potential for a VASP port with no data directives at all

11

MANAGING VASP AGGREGATEDATA STRUCTURES

• OpenACC + Unified Memory not an option today, some aggregates have static members

• OpenACC 2.6 manual deep copy was key

• Requires large numbers of directives in some cases, but well encapsulated (107 lines for COPYIN)

• Future versions of OpenACC (3.0) will add true deep copy, require far fewer data directives

• When CUDA Unified Memory + HMM supports all classes of data, potential for a VASP port with no data directives at all

Derived Type 1Members:3 dynamic

1 derived type 2

Derived Type 2Members:

21 dynamic1 derived type 31 derived type 4

Derived Type 3Members:only static


4 derived type 52 derived type 6



+12 lines of code

+48 lines of code

+26 lines of code

+8 lines of code

+13 lines of code

NVIDIA BOOTH @ SC19

VASP on GPU benchmarks

CuC_vdW

• C@Cu surface (Ω≅ 2800 Å3)

• 96 Cu + 2 C atoms (1064 e−)

• vdW-DFT

• RMM-DIIIS

• OpenACC port outperformsthe previous CUDA port …

1.0 1.0 1.0

1.7

2.32.5

2.2

3.3

3.7

2.9

4.1

4.7

3.3

5.4

6.6

0

1

2

3

4

5

6

7

VASP 5 VASP 6 VASP 6+

Spee

du

p v

s. C

PU

CPU 1 V100 2 V100 4 V100 8 V100

• CPU: 2⨉ E5-2698 v4 @ 2.20 GHz: 40 physical cores

NVIDIA BOOTH @ SC19

CUDA C vs. OpenACC port

• Full benchmark timings are interesting for time-to-solution, but are not an ‘apples- to-apples’ comparison between the CUDA and OpenACC versions:

• Amdahl’s law for non-GPU accelerated parts of code affects both implementations, but blurs differences

• Using OpenACC allowed to port additional kernels with minimal effort, has not been undertaken with CUDA version

• OpenACC version uses GPU-aware MPI to help more communication heavy parts, like orthonormalization

• OpenACC version was forked out of a more recent version of CPU code, while CUDA implementation is older

Can we find a fairer comparison? Let’s look at the RMM-DIIS algorithm …

NVIDIA BOOTH @ SC19

Iterative diagonalization: RMM-DIIS (EDDRMM)

• EDDRMM part has comparable GPU- coverage for CUDA and OpenACC versions

• CUDA version uses kernel fusing, OpenACC version uses two refactored kernels

• minimal amount of MPI communication

• OpenACC version improves scaling with number of GPUs

18

0

4

8

12

16

20

1 2 4 8Sp

ee

du

pNumber of V100 GPUs

EDDRMM section (silica_IFPEN), speedup over CPU

VASP 5.4.4

dev_OpenACC

VASP OPENACC PERFORMANCEEDDRMM section of silica_IFPEN on V100

• EDDRMM takes 17% of total runtime

• benefits for expectation values included

• These high speedups are not the single aspect for overall improvement, but an important contribution

• OpenACC improves scaling yet again

• MPS always helps, but does not pay off in total time due to start-up overhead

CPU: dual socket Broadwell E5-2698 v4, compiler Intel 17.0.1GPU: 5.4.4 compiler Intel 17.0.1; dev_OpenACC compiler: PGI 18.3 (CUDA 9.1)

NVIDIA BOOTH @ SC19

Orthonormalization

• GPU-aware MPI benefits from NVLink latency and B/W

• Data remains on GPU, CUDA port streamed data for GEMMs

• Cholesky on CPU saves a (smaller) mem-transfer

• 180 ms (40%) are saved by GPU-aware MPI alone

• 33 ms (7.5%) by others

15

VASP OPENACC PERFORMANCESection-level comparison for orthonormalization

CUDA CPORT

OPENACCPORT

Redistributing wavefunctions

Host-only MPI

(185 ms)

GPU-aware MPI

(110 ms)

Matrix-Matrix-MulsStreamed data

(19 ms)

GPU local data

(15 ms)

Cholesky decomposition

CPU-only

(24 ms)

cuSolver

(12 ms)

Matrix-Matrix-MulsDefault scheme

(30 ms)

better blocking

(13 ms)

Redistributing wavefunctions

Host-only MPI

(185 ms)

GPU-aware MPI

(80 ms)

• GPU-aware MPI benefits from NVLink latency and B/W

• Data remains on GPU, CUDA port streamed data for GEMMs

• Cholesky on CPU saves a (smaller) mem-transfer

• 180 ms (40%) are saved by GPU-aware MPI alone

• 33 ms (7.5%) by others

NVIDIA BOOTH @ SC19

VASP on GPU benchmarks

Si256_VJT_HSE06

• Vacancy in Si (Ω ≅5200 Å3)

• 255 Si atoms (1020 e−)

• DFT/HF-hybrid functional

• Conjugate gradient

• Batched FFTs

• Explicit overlay of computationand communication usingnon-blocking collectives (NCCL)

• CPU: 2⨉ E5-2698 v4 @ 2.20 GHz: 40 physical cores

1.0 1.04.7 4.7

8.8 9.0

15.7 15.9

28.1 28.7

0

5

10

15

20

25

30

35

VASP 6 VASP NV

Spe

ed

up

vs.

CP

U

CPU 1 V100 2 V100 4 V100 8 V100

NVIDIA BOOTH @ SC19

The OpenACC port: current limitations

• Some bottlenecks must be addressed: computation of the local potential is still done CPU-side.

• Not all features are ported yet: Currently we are porting the linear response solvers and cubic-scaling ACFDT (RPA total energies)

• Some features of VASP, e.g. cubic-scaling RPA, are very (very) memory intensive, and involve diagonalization of large complex matrices(> 100k ⨉ 100k): e.g. cusolverMgSyevd

• PGI compilers only

NVIDIA BOOTH @ SC19

New Release: VASP6

• …

• Cubic-scaling RPA (ACFDT,GW)

• On-the-fly machine learned force-fields

• Electron-Phonon coupling

• MPI+OpenMP

• OpenACC port

• …

• Caveat: the OpenACC port is stillregarded to be “experimental” at this stage

• Actively gather feedback(from HPC sites)

• Intensive support effort

https://www.vasp.at/wiki/index.php/Category:VASP6

https://www.vasp.at/wiki/index.php/Category:VASP6

NVIDIA BOOTH @ SC19

THE END

Special thanks to Stefan Maintz, Andreas Hehn, and Markus Wetzsteinfrom NVIDIA and PGI!

And to Ani Anciaux-Sedrakian and Thomas Guignon at IFPEN!

And to you for listening!

Porting VASP to GPU using OpenACC · VASP on GPU •VASP has organically grown over more than 25...

Documents

Transcript of Porting VASP to GPU using OpenACC · VASP on GPU •VASP has organically grown over more than 25...