QE-GPU, an experience of porting to GPUs - Cineca · Presentation 05-Dec-2017 QE-GPU, an experience...
Transcript of QE-GPU, an experience of porting to GPUs - Cineca · Presentation 05-Dec-2017 QE-GPU, an experience...
Presentation 05-Dec-2017
QE-GPU, an experience of porting to GPUs
MHPC (ICTP - SISSA)
Dr Stefano de [email protected]
SISSA
Ivan [email protected]
ICTP
Filippo [email protected]
Cambridge
If things go by plan
!
Algorithm you have seen a million times
"
Background of the problem
#
Proposed solutions to this problem
$
How close are we to the solution?
%
What you can gain from this boring presentation (CUDA Fortran)
&
A sigular code CPU & GPU solution to exascale QE
MPI Performance AUSURF112
2 4 8 16 32 0
1000
2000
3000
4000
5000 forceselectronsinit_run
Number of Processes
tim
e [
s]
KNL
BWD
SKL
KNL
BWD
SKLKNL
BWD
SKLKNL
BWD SKL
KNL
BWD SKL
4 8 16 20 0
100
200
300
400
500
600
700
800 fftwh_psicdiaghg
Number of Processes
tim
e [
s]
4 8 16 20 0
200
400
600
800
1000
1200
1400
fftwh_1psicdiaghg
Number of Processes
tim
e [
s]
MPI Performance SiO2 Davidson (left) CG (right)
4 8 16 20 0
500
1000
1500
2000
2500
3000fftwcdiaghgs_psih_psi
Number of Processes
tim
e[s
]
4 8 16 20 0
2k
4k
6k
8k
10k
fftwcdiaghgh_1psi
Number of Processes
tim
e[s
]
MPI Performance AUSURF112 Davidson (left) CG (right)
MPI+OMP Performance of Davidson (AUSURF)
4::10 8::5 10::4 40 0
500
1000
1500
2000
2500
3000
fftwcdiaghgh_psi
Number of processes [MPI::OMP]
tim
e [
s]
solutions
##
Finding a better Method
$$
Porting currunt code to GPU systems
The ESLW project
Davidson vs PPCG Performance
8 20 40 60 0
20
40
60
80
100
120
DavidPPCG
Number of Processes
tim
e [
s]
Solution 2: Porting to GPU systems
'
Declare and allocate host and device memory
(
Initialize host data
)
Transfer data from the host to the device
"
Execute one or more kernels
*
Transfer results from the device to the host
module lazycontains subroutine whatever(N, x, y, a) implicit none integer,intent(in):: N real,intent(in) :: x(N), a real,intent(inout) :: y(N) integer :: i ! do i = 1, N y(i) = y(i) + a*x(i) end do ! end subroutineend module
program mySGEVV use lazy, only:whatever implicit none integer,parameter:: N=40000000 real :: x(N), a real :: y(N) integer :: i x = 1.0; y = 2.0; a = 2.0 call whatever(N,x,y,a)end program
module lazycontains attributes(global) subroutine whatever(x, y, a) implicit none real :: x(:), y(:); real, value :: a; integer :: i, n n = size(x); i = blockDim%x * (blockIdx%x - 1) + threadIdx%x if (i <= n) y(i) = y(i) + a*x(i) end subroutineend module
program mySGEVV use lazy; use cudafor implicit none integer, parameter :: N = 40000000 real :: x(N), y(N), a real, device :: x_d(N), y_d(N) ! Declare and allocate host & device integer :: i
type(dim3) :: grid, tBlock; tBlock = dim3(256,1,1) grid = dim3(ceiling(real(N)/tBlock%x),1,1)
x = 1.0; y = 2.0; a = 2.0 ! Initialize host data x_d = x; y_d = y ! Transfer data from the host to the device itr = 1000 call whatever<<<grid, tBlock>>>(x_d, y_d, a) ! Execute kernels y = y_d ! Transfer results from the device to the hostend program
module m ... real :: a(n) real,device :: a_d(n) ...end module
subroutine update #ifdef USE_GPU use m, only: a => a_d#else use m, only: a #endif ...!$cuf kernel do <<<*,*>>> do i=1, n a(i) = a(i) + b enddo ...end subroutine update
subroutine add_scalar(a,n) real:: a(n)#ifdef USE_GPU attributes(device) :: a #endif!$cuf kernel do <<<*,*>>> do i=1, n a(i)=a(i)+b end doend subroutine add_scalar
2::8::2 4::4::4 0
100
200
300
400
500
600
700
800
forceselectronsinit_run
MPI::OMP::GPU
tim
e [
s]
2::8::2 4::4::4 0
500
1000
1500
2000
2500
3000
3500
4000
4500 forceselectronsinit_run
MPI::OMP::GPU
tim
e [
s]
GPU Performance of AUSURF112 - Drake
2::8::2 4::4::4 0
20
40
60
80
100
120
140
160
180 forceselectronsinit_run
MPI::OMP::GPU
tim
e [
s]
2::8::2 4::4::4 0
50
100
150
200
250
300
350
400 forceselectronsinit_run
MPI::OMP::GPU
tim
e [
s]
GPU Performance of SiO2 - Drake
4::4::4 8::4::8 16::4::16 0
50
100
150
200
Davidson
MPI::OMP::GPU
tim
e [
s]
4::4::4 8::4::8 16::4::16 0
200
400
600
800
1000
1200
1400
1600
1800
CG
MPI::OMP::GPU
tim
e [
s]
GPU Performance of AUSURF112 - DAVIDE
4::4::4 8::4::8 16::4::16 0
5
10
15
20
25
30Davidson
Number of Processes
tim
e [
s]
4::4::4 8::4::8 16::4::16 0
50
100
150
200
250
300
350
400
CG
Number of Processes
tim
e [
s]
GPU Performance of SiO2 - DAVIDE
Memory Usage Davidson vs CG
SiO22::8::2
Au2::8::2
SiO24::4::4
Au4::4::4
0
2k
4k
6k
8k
10k
DavidsonCG
MPI::OMP::GPU
Mem
ory
Usa
ge/
Devi
ce [
Mb
]
8::1 2::2::2 0
500
1000
1500
2000
2500
3000CGDavidson
MPI::OMP::GPU
Tim
e [
s]
4.3x
5.7x
32::1 4::4::4 0
500
1000
1500
2000CGDavidson
MPI::OMP::GPU
Tim
e [
s]
1.2x
1.6x
CPU vs GPU for Davidson and CG
Average Volatile GPU usage Davidson vs CG
2::8::2 4::4::4 0
10
20
30
40
50
60
70
80 CGDavidson
MPI::OMP::GPU
Ave
rag
e G
PU
Uti
liza
tio
n[%
]
Version 1.0 of QE-GPU is available at https://github.com/fspiga/qe-gpu (https://github.com/fspiga/qe-gpu)contact me at [email protected]
Thank You