QE-GPU, an experience of porting to GPUs - Cineca · Presentation 05-Dec-2017 QE-GPU, an experience...

Presentation 05-Dec-2017

QE-GPU, an experience of porting to GPUs

[email protected]

MHPC (ICTP - SISSA)

Dr Stefano de [email protected]

SISSA

Ivan [email protected]

ICTP

Filippo [email protected]

Cambridge

If things go by plan

!

Algorithm you have seen a million times

"

Background of the problem

#

Proposed solutions to this problem

$

How close are we to the solution?

%

What you can gain from this boring presentation (CUDA Fortran)

&

A sigular code CPU & GPU solution to exascale QE

MPI Performance AUSURF112

2 4 8 16 32 0

1000

2000

3000

4000

5000 forceselectronsinit_run

Number of Processes

tim

e [

s]

KNL

BWD

SKL

KNL

BWD

SKLKNL

BWD

SKLKNL

BWD SKL

KNL

BWD SKL

4 8 16 20 0

100

200

300

400

500

600

700

800 fftwh_psicdiaghg

Number of Processes

tim

e [

s]

4 8 16 20 0

200

400

600

800

1000

1200

1400

fftwh_1psicdiaghg

Number of Processes

tim

e [

s]

MPI Performance SiO2 Davidson (left) CG (right)

4 8 16 20 0

500

1000

1500

2000

2500

3000fftwcdiaghgs_psih_psi

Number of Processes

tim

e[s

]

4 8 16 20 0

2k

4k

6k

8k

10k

fftwcdiaghgh_1psi

Number of Processes

tim

e[s

]

MPI Performance AUSURF112 Davidson (left) CG (right)

MPI+OMP Performance of Davidson (AUSURF)

4::10 8::5 10::4 40 0

500

1000

1500

2000

2500

3000

fftwcdiaghgh_psi

Number of processes [MPI::OMP]

tim

e [

s]

solutions

##

Finding a better Method

$$

Porting currunt code to GPU systems

The ESLW project

Davidson vs PPCG Performance

8 20 40 60 0

20

40

60

80

100

120

DavidPPCG

Number of Processes

tim

e [

s]

Solution 2: Porting to GPU systems

'

Declare and allocate host and device memory

(

Initialize host data

)

Transfer data from the host to the device

"

Execute one or more kernels

*

Transfer results from the device to the host

module lazycontains subroutine whatever(N, x, y, a) implicit none integer,intent(in):: N real,intent(in) :: x(N), a real,intent(inout) :: y(N) integer :: i ! do i = 1, N y(i) = y(i) + a*x(i) end do ! end subroutineend module

program mySGEVV use lazy, only:whatever implicit none integer,parameter:: N=40000000 real :: x(N), a real :: y(N) integer :: i x = 1.0; y = 2.0; a = 2.0 call whatever(N,x,y,a)end program

module lazycontains attributes(global) subroutine whatever(x, y, a) implicit none real :: x(:), y(:); real, value :: a; integer :: i, n n = size(x); i = blockDim%x * (blockIdx%x - 1) + threadIdx%x if (i <= n) y(i) = y(i) + a*x(i) end subroutineend module

program mySGEVV use lazy; use cudafor implicit none integer, parameter :: N = 40000000 real :: x(N), y(N), a real, device :: x_d(N), y_d(N) ! Declare and allocate host & device integer :: i

type(dim3) :: grid, tBlock; tBlock = dim3(256,1,1) grid = dim3(ceiling(real(N)/tBlock%x),1,1)

x = 1.0; y = 2.0; a = 2.0 ! Initialize host data x_d = x; y_d = y ! Transfer data from the host to the device itr = 1000 call whatever<<<grid, tBlock>>>(x_d, y_d, a) ! Execute kernels y = y_d ! Transfer results from the device to the hostend program

module m ... real :: a(n) real,device :: a_d(n) ...end module

subroutine update #ifdef USE_GPU use m, only: a => a_d#else use m, only: a #endif ...!$cuf kernel do <<<*,*>>> do i=1, n a(i) = a(i) + b enddo ...end subroutine update

subroutine add_scalar(a,n) real:: a(n)#ifdef USE_GPU attributes(device) :: a #endif!$cuf kernel do <<<*,*>>> do i=1, n a(i)=a(i)+b end doend subroutine add_scalar

2::8::2 4::4::4 0

100

200

300

400

500

600

700

800

forceselectronsinit_run

MPI::OMP::GPU

tim

e [

s]

2::8::2 4::4::4 0

500

1000

1500

2000

2500

3000

3500

4000


MPI::OMP::GPU

tim

e [

s]

GPU Performance of AUSURF112 - Drake

2::8::2 4::4::4 0

20

40

60

80

100

120

140

160


MPI::OMP::GPU

tim

e [

s]

2::8::2 4::4::4 0

50

100

150

200

250

300

350


MPI::OMP::GPU

tim

e [

s]

GPU Performance of SiO2 - Drake

4::4::4 8::4::8 16::4::16 0

50

100

150

200

Davidson

MPI::OMP::GPU

tim

e [

s]

4::4::4 8::4::8 16::4::16 0

200

400

600

800

1000

1200

1400

1600

1800

CG

MPI::OMP::GPU

tim

e [

s]

GPU Performance of AUSURF112 - DAVIDE

4::4::4 8::4::8 16::4::16 0

5

10

15

20

25

30Davidson

Number of Processes

tim

e [

s]

4::4::4 8::4::8 16::4::16 0

50

100

150

200

250

300

350

400

CG

Number of Processes

tim

e [

s]

GPU Performance of SiO2 - DAVIDE

Memory Usage Davidson vs CG

SiO22::8::2

Au2::8::2

SiO24::4::4

Au4::4::4

0

2k

4k

6k

8k

10k

DavidsonCG

MPI::OMP::GPU

Mem

ory

Usa

ge/

Devi

ce [

Mb

]

8::1 2::2::2 0

500

1000

1500

2000

2500

3000CGDavidson

MPI::OMP::GPU

Tim

e [

s]

4.3x

5.7x

32::1 4::4::4 0

500

1000

1500

2000CGDavidson

MPI::OMP::GPU

Tim

e [

s]

1.2x

1.6x

CPU vs GPU for Davidson and CG

Average Volatile GPU usage Davidson vs CG

2::8::2 4::4::4 0

10

20

30

40

50

60

70

80 CGDavidson

MPI::OMP::GPU

Ave

rag

e G

PU

Uti

liza

tio

n[%

]

Version 1.0 of QE-GPU is available at https://github.com/fspiga/qe-gpu (https://github.com/fspiga/qe-gpu)contact me at [email protected]

Thank You

https://github.com/fspiga/qe-gpu

QE-GPU, an experience of porting to GPUs - Cineca · Presentation 05-Dec-2017 QE-GPU, an experience...

Documents

Transcript of QE-GPU, an experience of porting to GPUs - Cineca · Presentation 05-Dec-2017 QE-GPU, an experience...