Porting Computational Physics Applications to the Titan Supercomputer with OpenACC and OpenMP
PORTING VASP TO GPUS WITH OPENACC
Transcript of PORTING VASP TO GPUS WITH OPENACC
2
AGENDA
Introduction to VASP
Status of the CUDA port
Prioritizing Use Cases for GPU Acceleration
OpenACC in VASP
Comparative Benchmarking
3
INTRODUCTION TO VASP
A leading electronic structure program for solids, surfaces and interfaces
Used to study chemical and physical properties, reactions paths, etc.
Atomic scale materials modeling from first principles from 1 to 1000s atoms
Liquids, crystals, magnetism, semiconductors/insulators, surfaces, catalysts
Solves many-body Schrödinger equation
Scientific Background
4
INTRODUCTION TO VASP
Density Functional Theory (DFT)
Enables solving sets of Kohn-Sham equations
In a plane-wave based framework (PAW)
Hybrid DFT adding (parts of) exact exchange (Hartree-Fock) and even beyond!
Quantum-mechanical methods
5
INTRODUCTION TO VASP12-25% of CPU cycles at supercomputing centers
Material Sciences
Chemical Engineering
Physics & Physical Chemistry
Academ
iaCom
panie
s Large semiconductor companies
Oil & gas
Chemicals – bulk or fine
Materials – glass, rubber,
ceramic, alloys,
polymers and metals
Rank Application
1 GROMACS
2 ANSYS Fluent
3 Gaussian
4 VASP
5 NAMD
Top 5 HPC Applications
Source: Intersect360 2017 Site Census Mentions
CSC, Finland (2012)
VASP
cp2k
CASTEP
Gromacs
Other
18.5%
5.6%
5.4%
4.9%
Archer, UK (03/2018)
Source: http://www.archer.ac.uk/status/codes/ 03/21/2018
6
INTRODUCTION TO VASP
Developed by G. Kresse’s group at University of Vienna (and external contributors)
Under development/refactoring for about 25 years
460K lines of Fortran 90, some FORTRAN 77
MPI parallel, OpenMP recently added for multicore
First endeavors on GPU acceleration date back to <2011 timeframe with CUDA C
Details on the code
7
AGENDA
Introduction to VASP
Status of the CUDA port
Prioritizing Use Cases for GPU Acceleration
OpenACC in VASP
Comparative Benchmarking
8
COLLABORATION ON CUDA C PORT OF VASP
U of Chicago
Collaborators
Earlier Work
CUDA Port Project Scope
Minimization algorithms to calculate electronic ground state: Blocked Davidson (ALGO = Normal & Fast) and RMM-DIIS (ALGO = VeryFast & Fast)
Parallelization over k-points
Exact-exchange calculations
Speeding up plane-wave electronic-structure calculations using graphics-processing units, Maintz, Eck, Dronskowski
VASP on a GPU: Application to exact-exchange calculations of the stability of elemental boron, Hutchinson, Widom
Accelerating VASP Electronic Structure Calculations Using Graphic Processing Units, Hacene, Anciaux-Sedrakian,
Rozanska, Klahr, Guignon, Fleurat-Lessard
9
GPU READY APPLICATIONS PROGRAMInstructions to compile and run VASP on GPUs
NVIDIA offers step-by-step instructions to compile and run various important applications:
https://www.nvidia.com/en-us/data-center/gpu-accelerated-applications/
Exemplary benchmarks to test against expected performance
Hints on tuning important run-time parameters
10
CUDA SOURCE INTEGRATION IN VASP 5.4.4
davidson.F davidson_gpu.F
davidson.cu
cuda_helpers.h
cuda_helpers.cu
…
makefileswitch
Original source tree
(Fortran)
Accelerated call tree,drop-in replacements
(Fortran)
Custom kernelsand support code
(CUDA C)
11
CUDA C ACCELERATED VERSION OF VASP
• All GPU acceleration with CUDA C
• Only some cases are ported to GPUs
• Different source trees for Fortran vs CUDA C
• CPU code gets continuously updated and enhanced, required for various platforms
• Challenge to keep CUDA C sources up-to-date
• Long development cycles to port new features
Available today on NVIDIA Tesla GPUs with VASP 5.4.4
Upper Levels
GPUcall tree
routine A
routine B
CPU call tree
routine A
routine B
12
CUDA C ACCELERATED VERSION OF VASP
Source code duplication for CUDA C in VASP led to:
• increased maintenance cost
• improvements in CPU code need replication
• long development cycles to port new solvers
Available today on NVIDIA Tesla GPUs with VASP 5.4.4
Upper Levels
GPUcall tree
routine A
routine B
CPUcall tree
routine A
routine BExplore OpenACC as an improvement for GPU acceleration
13
AGENDA
Introduction to VASP
Status of the CUDA port
Prioritizing Use Cases for GPU Acceleration
OpenACC in VASP
Comparative Benchmarking
14
CATEGORIES FOR METHODOLOGICAL OPTIONS
Standard DFTHybrid DFT (exact exchange)RPA (ACFDT, GW)Bethe-Salpeter Equations (BSE)…
LEVELS OF THEORY
DavidsonRMM-DIISDavidson+RMM-DIISDamped…
SOLVERS / MAIN ALGORITHM
Real spaceReal space (automatic optimization)Reciprocal space
PROJECTION SCHEME
Standard variantGamma-point simplification variantNon-collinear spin variant
EXECUTABLE FLAVORS
15
EXAMPLE BENCHMARK: SILICA_IFPEN
Standard DFTlevel of theory
Davidsonsolver
RMM-DIISsolver
Realspaceproj. scheme
standardexec. flavor
KPAR, NSIM, NCOREparallelization options
Gamma-pointexec. flavor
Non-collinearexec. flavor
Reciprocalproj. scheme
Automaticproj. scheme
Dav.+RMM-DIISsolver
Dampedsolver
Hybrid DFTlevel of theory
RPAlevel of theory
BSElevel of theory
16
PARALLELIZATION LAYERS IN VASP
Ψ
↑ ↓
𝑘1 𝑘2 𝑘3 … 𝑘1 𝑘2 𝑘3 …
Wavefunction
Spins
k-points
𝑛1 𝑛2Bands/Orbitals … 𝑛1 𝑛2 …
𝐶1
𝐶2
𝐶3
…
𝐶1
𝐶2
𝐶3
…
…
…
…
…
𝐶1
𝐶2
𝐶3
…
𝐶1
𝐶2
𝐶3
…
…
…
…
…
Plane-wave
coefficients
KPAR>1
Physical
quantities
Parallelization
feature
Default
NCORE>1
17
Distributes k-points
Highest level parallelism, more or less embarrassingly parallel
Can help for smaller systems
Not always possible
PARALLELIZATION OPTIONS
KPAR
Blocking of orbitals
Grouping (no distributing) influences caching and communication
Ideal value differs between CPU and GPU
Needs to be tuned
NSIM
Distributes plane waves
Lowest level parallelism, needs parallel 3D FFT, inserts lots of MPI msgs
Can help with load balancing problems
No support in CUDA port
NCORE
18
POSSIBLE USE CASES IN VASP
Supports a plethora of run-time options that define the workload (use case)
Those methodological options can be grouped into categories
Some, but not all are combinable
Combination determines if GPU acceleration is supported and also how well
Benchmarking the complete situation is tremendously complex
Each with a different computational profile
19
WHERE TO START
Ideally every use case would be ported
Standard and Hybrid DFT alone give 72 use cases (ignoring parallelization options)!
Need to select most important use-cases
Selection should be based on real-world or supercomputing-facility scenarios
You cannot accelerate everything (at least soon)
20
STATISTICS ON VASP USE CASES
Zhengji Zhao (NERSC) collected such data (INCAR) for 30397 VASP jobs over nearly 2 months
Data is based on job count, but has no timing information
Includes 130 unique users on Edison (CPU-only system)
No 1:1-mapping of parameters possible, expect large error margins
Data does not include calculation sizes, but it’s a great start
NERSC job submission data 2014
21
Davidson
Dav+RMM
RMM-DIIS
Damped
Exact
RPA
Conjugate
BSE
EIGENVAL
EMPLOYED VASP FEATURES AT NERSC Levels of theory and main algorithms w.r.t. job count
51%
Source: based on data provided by Zhengji Zhao, NERSC, 2014
2%
standard DFT
hybrid DFT
RPA
BSE
22
SUMMARY
Start with standard DFT, to accelerate most jobs
RMM-DIIS and Davidson nearly equally important, share a lot of routines anyway
Realspace projection more important for large setups
Gamma-point executable flavor as important as standard, so start with general one
Support as many parallelization options as possible (KPAR, NSIM, NCORE)
Communication is important, but scaling to large node counts is low priority(62% fit into 4 nodes, 95% used ≤12 nodes)
Where to start
23
VASP OPENACC PORTING PROJECT
• Can we get a working version, with today’s compilers, tools, HW ?
• Decision to focus on one algorithm: RMM-DIIS
• Guidelines:
• work out of existing CPU code
• minimally invasive to CPU code
• Goals:
• allow for performance comparison to CUDA port
• assess maintainability, threshold for future porting efforts
feasibility study
24
AGENDA
Introduction to VASP
Status of the CUDA port
Prioritizing Use Cases for GPU Acceleration
OpenACC in VASP
Comparative Benchmarking
25
OPENACC DIRECTIVESData directives are designed to be optional
Manage
Data
Movement
Initiate
Parallel
Execution
Optimize
Loop
Mappings
!$acc data copyin(a,b) copyout(c)
...!$acc parallel
!$acc loop gang vectordo i=1, n
c(i) = a(i) + b(i)...
enddo!$acc end parallel...
!$acc end data
26
DATA REGIONS IN OPENACC
All static intrinsic data types of the programming language can appear in an OpenACC data directive, e.g. real, complex, integer scalar variables in Fortran.
Same for all fixed size arrays of intrinsic types, and dynamically allocated arrays of intrinsic type, e.g. allocatable and pointer variables in Fortran
The compiler will know the base address and size (in C size needs to be specified in directive).
… So what about derived types? Two variants:
intrinsic data types, static and dynamic
type stat_def
integer a,b
real c
end type stat_def
type(stat_def)::var_stat
type dyn_def
integer m
real,allocatable,dimension(:) :: r
end type dyn_def
type(dyn_def)::var_dyn
27
DEEPCOPY IN OPENACC
The generic case is a main goal for a future OpenACC 3.0 specification. This is often referred to as full deep copy.
Until then, writing a manual deep copy is the best way to handle derived types:
• OpenACC 2.6 provides functionality (attach/detach)
• Static members in a derived type are handled by the compiler.
• The programmer manually copies every dynamic member in the derived type.
• AND ensures correct pointer attachment/detachment in the parent!
full vs manual
28
DERIVED TYPES IN OPENACCmanual copy
type dyn_def
integer m
real,allocatable,dimension(:) :: r
end type dyn_def
type(dyn_def)::var_dyn
…
allocate(var_dyn%r(some_size))
!$acc enter data copyin(var_dyn,var_dyn%r)
…
!$acc exit data copyout(var_dyn%r,var_dyn)
1. allocates device memory for var_dyn
2. copies m (H2D)3. copies host pointer for var_dyn%r!
-> device ptr invalid
1. copies r (D2H)2. deallocates device memory for r3. detaches var_dyn%r on the device,
i.e. overwrites r with its host value !-> device ptr invalid
1. allocates device memory for r2. copies r (H2D)3. attaches the device copy’s pointer
var_dyn%r to the device copy of r
1. copies m (D2H)2. copies var_dyn%r -> host pointer intact !3. deallocates device memory for var_dyn
29
MANAGING VASP AGGREGATEDATA STRUCTURES
• OpenACC + Unified Memory not an option today, some aggregates have static members
• OpenACC 2.6 manual deepcopy was key
• Requires large numbers of directives in some cases, but well encapsulated (107 lines for COPYIN)
• Future versions of OpenACC (3.0) will add true deep copy, require far fewer data directives
• When CUDA Unified Memory + HMM supports all classes of data, potential for a VASP port with no data directives at all
Derived Type 1Members:3 dynamic
1 derived type 2
Derived Type 2Members:
21 dynamic1 derived type 31 derived type 4
Derived Type 3Members:only static
Derived Type 4Members:8 dynamic
4 derived type 52 derived type 6
Derived Type 5Members:3 dynamic
Derived Type 6Members:8 dynamic
+12 lines of code
+48 lines of code
+26 lines of code
+8 lines of code
+13 lines of codeManual deepcopy allowed to port RMM-DIIS
30
STATUS OF PORTING VASP WITH OPENACC
Indeed, porting was possible with only little code refactoring
Manual deep copy was key
Interfaced to cuFFT, cuBLAS and cuSolver math libraries, NCCL for communication
Ongoing integration of OpenACC into current VASP development source
Public availability expected with next VASP release
Intermediate results on feasibility
31
STATUS OF PORTING VASP WITH OPENACC
Solvers: RMM-DIIS, blocked Davidson (ALGO = Normal, Fast & VeryFast)
Levels of Theory: standard DFT, Fock operator action (exact exchange) for hybrid DFT
Projection schemes: Realspace and Automatic, reciprocal ongoing
Executable flavors: standard and Gamma-point variant
Most important parallelization options
Supported features and algorithms
32
AGENDA
Introduction to VASP
Status of the CUDA port
Prioritizing Use Cases for GPU Acceleration
OpenACC in VASP
Comparative Benchmarking
33
0
1
2
3
4
1 2 4 8
Spee
du
p
Number of V100 GPUs
Full benchmark (silica_IFPEN), speedup over dual CPU
VASP 5.4.4
dev_OpenACC
VASP OPENACC PERFORMANCEALGO=VeryFast / RMM-DIIS: silica_IFPEN on V100 SXM2 16GB
• Total elapsed time for entire benchmark
• NCORE=36/NSIM=8 on CPU:smaller workload than on GPU, but itimproves CPU performance
• ‘tuned’ full run takes 311 s on CPU
• GPUs outperform dual socket CPU node, in particular OpenACC version
• 80 s on Volta-based DGX-1 with OpenACC
• OpenACC needs fewer ranks per GPUCPU: dual socket Skylake Gold 6140, compiler Intel 2018 Update 4GPU: VASP 5.4.4 (Intel 18, CUDA 10.0); dev_OpenACC (PGI 18.7, CUDA 10.0)
34
CUDA port OpenACC port
kernels per orbital1
(69 µs)
8
(90 µs total)
kernels per NSIM-block(4 orbitals)
1
(137 µs)
0
(0 µs)
runtime per orbital 104 µs 90 µs
runtime per NSIM-block(4 orbitals)
413 µs 360 µs
VASP OPENACC PERFORMANCEKernel-level comparison for energy expectation values
• NSIM independent reductions
• Additional NSIM-fused kernel was probably better on older GPU generations
• Unfusing removes a synchronization point
• OpenACC adapts optimization to architecture with a flag
35
CUDA port OpenACC port
Redistributing wavefunctions
Host-only MPI
(185 ms)
GPU-aware MPI
(165 ms)
Matrix-Matrix-MulsStreamed data
(19 ms)
GPU local data
(10 ms)
Cholesky decomposition
CPU-only
(24 ms)
cuSolver
(5 ms)
Matrix-Matrix-MulsDefault scheme
(30 ms)
better blocking
(7 ms)
Redistributing wavefunctions
Host-only MPI
(185 ms)
GPU-aware MPI
(125 ms)
VASP OPENACC PERFORMANCESection-level comparison for orthonormalization
• GPU-aware MPI beneficial
• Data stays on GPU for OpenACC;streaming back to GEMMs in CUDA
• Cholesky on CPU saves a (smaller) mem-transfer
• 20% saved by GPU communication
• 10% by others
• Communication still bottleneck
36
VASP OPENACC PERFORMANCE
Full benchmark timings are interesting for time-to-solution, but are not an ‘apples-to-apples’ comparison between the CUDA and OpenACC versions:
• Amdahl’s law for non-GPU accelerated parts of code affects both implementations, but blurs differences
• Using OpenACC allowed to port additional kernels with minimal effort, has not been undertaken with CUDA version
• OpenACC version uses GPU-aware MPI to help more communication intensive parts, like orthonormalization
• OpenACC version was forked out of more recent version of CPU code, while CUDA implementation is older
Can we find a subset which allows for fairer comparison?
Differences between CUDA and OpenACC versions
use EDDRMM
37
0
2
4
6
8
10
12
1 2 4 8
Spee
du
p
Number of V100 GPUs
EDDRMM section (silica_IFPEN), speedup over dual CPU
VASP 5.4.4
dev_OpenACC
VASP OPENACC PERFORMANCEALGO=VeryFast / RMM-DIIS: silica_IFPEN on V100 SXM2 16GB
• EDDRMM part has comparable GPU-coverage for CUDA and OpenACCversions
• CUDA version uses kernel fusing, OpenACC version uses two refactored kernels
• minimal amount of MPI communication
• OpenACC version improves scaling with number of GPUs
• GPUs still outperform dual socket CPU node, in particular OpenACC version
CPU: dual socket Skylake Gold 6140, compiler Intel 2018 Update 4GPU: VASP 5.4.4 (Intel 18, CUDA 10.0); dev_OpenACC (PGI 18.7, CUDA 10.0)
38
VASP OPENACC PERFORMANCEALGO=Normal / Davidson: si-Huge on V100 SXM2 16GB
• Total elapsed time for entire benchmark
• NCORE=36 on CPU: smaller workload than on GPU versionsimproves CPU performance
• ‘tuned’ full run takes 3468 s on CPU
• Drastically improved scaling for blocked Davidson with OpenACC
• 521 s on Volta-based DGX-1 with OpenACC
• MPS not needed for OpenACCCPU: dual socket Skylake Gold 6140, compiler Intel 2018 Update 4GPU: VASP 5.4.4 (Intel 18, CUDA 10.0); dev_OpenACC (PGI 18.7, CUDA 10.0)
0
1
2
3
4
5
6
7
1 2 4 8
Spee
du
p
Number of V100 GPUs
Full benchmark (si-Huge), speedup over dual CPU
VASP 5.4.4
dev_OpenACC
39
VASP OPENACC PERFORMANCEALGO=Fast / Davidson+RMM-DIIS: GaAs_BiDoped0.04 on V100 SXM2 16GB
• New, customer-provided dataset that we have not specifically optimized for
• Does 5 Davidson and 8 RMM-DIIS steps
• ‘tuned’ full run takes 2932 s on CPU
• 299 s on Volta-based DGX-1 with OpenACC
• MPS not needed for OpenACC
• High speedups for standard DFT!
CPU: dual socket Skylake Gold 6140, compiler Intel 2018 Update 4GPU: VASP 5.4.4 (Intel 18, CUDA 10.0); dev_OpenACC (PGI 18.7, CUDA 10.0)
0
2
4
6
8
10
1 2 4 8
Spee
du
p
Number of V100 GPUs
Full benchmark (GaAs_BiDoped0.04), speedup over dual CPU
VASP 5.4.4
dev_OpenACC
40
VASP OPENACC SNEAK PEEK PERFORMANCEALGO=All / direct optimization: UCSB_CeO2_HSE06 on V100 SXM2 16GB
• Another, new customer-provided dataset
• Gamma-point version usable
• Hybrid DFT (exact exchange)
• Direct optimization not officiallysupported with CUDA port
• ‘tuned’ full run takes 15950 s on CPU
• 1592 s on Volta-based DGX-1 with OpenACC; 10x when only Fock operator action is already ported to the GPU!
CPU: dual socket Skylake Gold 6140, compiler Intel 2018 Update 4GPU: VASP 5.4.4 (Intel 18, CUDA 10.0); dev_OpenACC (PGI 18.7, CUDA 10.0)
0
2
4
6
8
10
1 2 4 8
Spee
du
p
Number of V100 GPUs
LOOP+ time (ucsb_ceo2_hse06), speedup over dual CPU
VASP 5.4.4
dev_OpenACC
41
VASP OPENACC SNEAK PEEK PERFORMANCEALGO=All / direct optimization: UCSB_CeO2_HSE06 on V100 SXM2 16GB
• ‘tuned’ timings for fock_all:OpenACC vs CPU (dev version)
• Using NCCL for double-buffered, asynchronous GPU communication and reduction
• MPS gives only a slight advantage for 1 GPU case
• Measured with a development-snapshot that needs memory transfers before and afterfock_all until surrounding routines will be ported
CPU: dual socket Skylake Gold 6140, compiler Intel 2018 Update 4GPU: dev_OpenACC (PGI 18.7, CUDA 10.0)
0
5
10
15
20
1 2 4 8
Spee
du
p
Number of V100 GPUs
Fock operator (ucsb_ceo2_hse06), speedup over dual CPU
dev_OpenACC
42
“ For VASP, OpenACC is the way forward for
GPU acceleration. Performance is similar
and in some cases better than CUDA C,
and OpenACC dramatically decreases GPU
development and maintenance efforts.”Prof. Georg Kresse
Computational Materials Physics
University of Vienna