OpenACC CUDAによる GPUコンピューティング...2008 2010 2012 2014 2016 2018 GPU...

Akira Naruse, 19th Jul. 2018

OpenACC・CUDAによるGPUコンピューティング

2

自己紹介

▪ 成瀬彰 (Naruse, Akira)

▪ 2013年～: NVIDIA、シニア・デベローパーテクノロジー・エンジニア

▪ 1996～2013年: 富士通研究所、研究員など

▪ 専門・興味: 並列処理、性能最適化、スパコン・HPC、 GPUコンピューティング、 DeepLearning、…

▪ 詳しくは…

github.com/anaruse

3

AGENDA

▪ GPU Computing

▪ Volta GPUs

▪ OpenACC

▪ CUDA

4

GPU Computing

5

GPUコンピューティングLow latency + High throughput

CPU GPU

6

アプリケーション実行

アプリケーション・コード

GPU

CPU

並列部分はGPUで実行

計算の重い部分

逐次部分はCPU上で実行

Do i=1,N

End do

7

GPUの構造 (TESLA P100)

64 CUDA core/SM

大量のCUDAコア並列性が鍵

56 SM/chip

3584 CUDA core/chip

8

GPU APPLICATIONS

数百のアプリケーションがGPUに対応www.nvidia.com/object/gpu-applications.html

9

LEADING APPLICATIONS

10

DL FRAMEWORKS

11

アプリをGPU対応する方法

Application

CUDAOpenACCLibrary

GPU対応ライブラリにチェンジ

簡単に開始主要処理をCUDAで記述

高い自由度既存コードにディレクティブを挿入

簡単に加速

12

NVIDIA LIBRARIES

cuRAND Thrust

cuDNN

AmgXcuBLAS

cuFFT

cuSPARSE cuSOLVER

Performance Primitives NCCL

nvGRAPH

TensorRT

http://code.google.com/p/thrust/downloads/list

13

PARTNER LIBRARIES

Computer Vision

Sparse direct solvers

Sparse Iterative MethodsLinear Algebra

Linear Algebra

Graph

Audio and Video Matrix, Signal and Image

Math, Signal and Image Linear Algebra

Computational Geometry

Real-time visual simulation

14


Application

Library


簡単に開始

CUDAOpenACC

主要処理をCUDAで記述


簡単に加速

15

void saxpy(int n,

float a,

float *x,

float *restrict y)

{

#pragma acc parallel copy(y[:n]) copyin(x[:n])

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

}

...

saxpy(N, 3.0, x, y);

...

SAXPY (Y=A*X+Y)

OpenMP OpenACC

void saxpy(int n,

float a,

float *x,

float *restrict y)

{

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

}

...


...

void saxpy(int n,

float a,

float *x,

float *restrict y)

{

#pragma omp parallel for

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

}

...


...

16


Application

Library OpenACC CUDA


簡単に開始主要処理をCUDAで記述


簡単に加速

17

__global__ void saxpy(int n, float a,

float *x, float *y)

{

int i = threadIdx.x + blodkDim.x * blockIdx;

if (i < n)

y[i] += a*x[i];

}

...

size_t size = sizeof(float) * N;

cudaMemcpy(d_x, x, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_y, y, size, cudaMemcpyHostToDevice);

saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);

cudaMemcpy(y, d_y, size, cudaMemcpyDeviceToHost);

...

SAXPY (Y=A*X+Y)

CPU CUDA

void saxpy(int n, float a,

float *x, float *y)

{

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

}

...


...

18

Volta GPUs

19

Perf

orm

ance /

W

2012 20142008 2010 2016 2018

GPUロードマップ

TeslaFermi

Kepler

Maxwell

Pascal

Volta

20

TESLA V100の概要

Deep LearningとHPC、両方に最適なGPU

Volta Architecture

Most Productive GPU

Tensor Core

125 Programmable

TFLOPS Deep Learning

Improved SIMT Model

New Algorithms

Volta MPS

Inference Utilization

Improved NVLink &

HBM2

Efficient Bandwidth

21

VOLTAHPC性能を大きく向上

P100に対する相対性能

HPCアプリケーション性能

System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla P100 or V100. V100 measured on pre-production hardware.

Summit

Supercomputer

200+ PetaFlops

~3,400 Nodes

10 Megawatts

22

3+EFLOPSTensor Ops

AI Exascale Today

ACME

DIRAC FLASH GTC

HACC LSDALTON NAMD

NUCCOR NWCHEM QMCPACK

RAPTOR SPECFEM XGC

AcceleratedScience

10XPerf Over Titan

20 PF

200 PF

Performance Leadership

VOLTA 米国トップスパコンのエンジンSUMMIT

5-10XApplication Perf Over Titan

23

トランジスタ数:21B 815 mm2

80 SM5120 CUDAコア640 Tensorコア

HBM232 GB, 900 GB/s

NVLink 300 GB/s

TESLA V100

*full GV100 chip contains 84 SMs

24

P100 V100 性能UP

トレーニング性能 10 TOPS 125 TOPS 12x

インファレンス性能 21 TFLOPS 125 TOPS 6x

FP64/FP32 5/10 TFLOPS 7.8/15.6 TFLOPS 1.5x

HBM2 バンド幅 720 GB/s 900 GB/s 1.2x

NVLink バンド幅 160 GB/s 300 GB/s 1.9x

L2 キャッシュ 4 MB 6 MB 1.5x

L1 キャッシュ 1.3 MB 10 MB 7.7x

GPUピーク性能比較: P100 vs v100

25

HBM2メモリ、使用効率UP

STREAM

: Tr

iad-

Delivere

d G

B/s

P100 V100

76% 95%

実効バンド幅1.5倍

V100 measured on pre-production hardware.

HBM2 stack

使用効率

26

VOLTA NVLINK

P100 V100

リンク数 4 6

バンド幅 / リンク

40 GB/s 50 GB/s

トータルバンド幅

160 GB/s 300 GB/s(*) バンド幅は双方向

DGX1V

27

VOLTA GV100 SM

GV100

FP32ユニット 64

FP64ユニット 32

INT32ユニット 64

Tensorコア 8

レジスタファイル 256 KB

統合L1・共有メモリ

128 KB

Activeスレッド 2048(*) SMあたり

28

VOLTA GV100 SM

命令セットを一新

スケジューラを2倍

命令発行機構をシンプルに

L1キャッシュの大容量・高速化

SIMTモデルの改善

テンソル計算の加速

最もプログラミングの簡単なSM

生産性の向上

29

OpenACC

30

OPENACCプログラミング

▪ 概要紹介

▪ プログラムのOpenACC化

▪ OpenACC化事例

31

OPENACC

Program myscience

... serial code ...

!$acc kernels

do k = 1,n1

do i = 1,n2

... parallel code ...

enddo

enddo

!$acc end kernels

... serial code …

End Program myscience

CPU

GPU

既存のC/Fortranコード

簡単: 既存のコードにコンパイラへのヒントを追加

強力: 相応の労力で、コンパイラがコードを自動で並列化

オープン: 複数コンパイラベンダが、様々なプロセッサをサポート

NVIDIA GPU, AMD GPU,

x86 CPU, …

ヒントの追加

32

GPUコンピューティング

アプリケーション・コード

GPU

CPU

並列部分はGPUで実行

計算の重い部分

逐次部分はCPU上で実行

Do i=1,N

End do

OpenACC

33


float *x, float *y)

{


if (i < n)

y[i] += a*x[i];

}

...




saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);


...

SAXPY (Y=A*X+Y)

CPU CUDA


float *x, float *y)

{

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

}

...


...

34

void saxpy(int n,

float a,

float *x,

float *restrict y)

{


for (int i = 0; i < n; ++i)

y[i] += a*x[i];

}

...


...

SAXPY (y=a*x+y)

▪ omp → acc ▪ データの移動

OpenMP OpenACC

void saxpy(int n,

float a,

float *x,

float *restrict y)

{


for (int i = 0; i < n; ++i)

y[i] += a*x[i];

}

...


...

35

subroutine saxpy(n, a, X, Y)

real :: a, Y(:), Y(:)

integer :: n, i

!$acc parallel copy(Y(:)) copyin(X(:))

do i=1,n

Y(i) = a*X(i)+Y(i)

enddo

!$acc end parallel

end subroutine saxpy

...

call saxpy(N, 3.0, x, y)

...

SAXPY (y=a*x+y, FORTRAN)

▪ FORTRANも同様

OpenMP OpenACC

subroutine saxpy(n, a, X, Y)

real :: a, X(:), Y(:)

integer :: n, i

!$omp parallel do

do i=1,n

Y(i) = a*X(i)+Y(i)

enddo

!$omp end parallel do

end subroutine saxpy

...

call saxpy(N, 3.0, x, y)

...

36

簡単にコンパイル

OpenMP / OpenACC


float *x,

float *restrict y)

{



for (int i = 0; i < n; ++i)

y[i] += a*x[i];

}

...


...

$ pgcc -acc –Minfo=acc saxpy.csaxpy:

16, Generating present_or_copy(y[:n])

Generating present_or_copyin(x[:n])

Generating Tesla code

19, Loop is parallelizable

Accelerator kernel generated

19, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

37

簡単に実行

OpenMP / OpenACC


float *x,

float *restrict y)

{



for (int i = 0; i < n; ++i)

y[i] += a*x[i];

}

...


...

$ pgcc -Minfo -acc saxpy.csaxpy:

16, Generating present_or_copy(y[:n])

Generating present_or_copyin(x[:n])





$ nvprof ./a.out ==10302== NVPROF is profiling process 10302, command: ./a.out

==10302== Profiling application: ./a.out

==10302== Profiling result:

Time(%) Time Calls Avg Min Max Name

62.95% 3.0358ms 2 1.5179ms 1.5172ms 1.5186ms [CUDA memcpy HtoD]

31.48% 1.5181ms 1 1.5181ms 1.5181ms 1.5181ms [CUDA memcpy DtoH]

5.56% 268.31us 1 268.31us 268.31us 268.31us saxpy_19_gpu

38


▪ 概要紹介



39

事例: ヤコビ反復法

while ( error > tol ) {

error = 0.0;

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

error = max(error, abs(Anew[j][i] - A[j][i]));

}

}

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

A[j][i] = Anew[j][i];

}

}

}

A(i,j) A(i+1,j)A(i-1,j)

A(i,j-1)

A(i,j+1)

40

並列領域の指定 (parallel/kernelsディレクティブ)

▪ Parallel と Kernels

▪ Parallel

▪ OpenMPと親和性

▪ 開発者主体

▪ Kernels

▪ 複数kernelの生成

▪ コンパイラ主体


error = 0.0;

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;

error = max(error, abs(Anew[j][i] - A[j][i]);

}

}

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {


}

}

}

41

[PGI tips] コンパイラメッセージ

$ pgcc –acc –Minfo=accel jacobi.cjacobi:

44, Generating copyout(Anew[1:4094][1:4094])

Generating copyin(A[:][:])





45, #pragma acc loop gang /* blockIdx.y */


49, Max reduction generated for error

42

並列領域の指定 (kernels)

▪ 並列領域の指定▪ Parallels と

Kernels

▪ Parallels


▪ 開発者主体

▪ Kernels




error = 0.0;

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;


}

}

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {


}

}

}

$ pgcc -Minfo=acc -acc jacobi.cjacobi:

59, Generating present_or_copyout(Anew[1:4094][1:4094])

Generating present_or_copyin(A[:][:])







Max reduction generated for error

43

データの転送 (data clause)

▪ 並列領域の指定▪ Parallels と

Kernels

▪ Parallels


▪ 開発者主体

▪ Kernels




error = 0.0;

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;


}

}

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {


}

}

}

$ pgcc -Minfo=acc -acc jacobi.cjacobi:










44


▪ copyin (Host→GPU)

▪ copyout (HostGPU)

▪ copy

▪ create

▪ present


error = 0.0;

#pragma acc kernels \

pcopyout(Anew[1:4094][1:4094]) pcopyin(A[:][:])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;


}

}


pcopyout(A[1:4094][1:4094]) pcopyin(Anew[1:4094][1:4094])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {


}

}

} N=M=4096

45

配列形状の指定

▪ 配列は、全要素でなく、一部だけ指定して転送することも可能▪ 注意: C/C++ と Fortranでは指定方法が異なる

▪ C/C++: array[ start : size ]float Anew[4096][4096]

pcopyout( Anew[1:4094][1:4094]) pcopyin( A[:][:]) )

▪ Fortran: array( start : end )real Anew(4096,4096)

pcopyout( Anew(2:4095, 2:4095) ) pcopyin( A(:,:) )

46


▪ copyin (Host→GPU)

▪ copyout (HostGPU)

▪ copy

▪ create

▪ present


error = 0.0;


pcopy(Anew[:][:]) pcopyin(A[:][:])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;


}

}


pcopy(A[:][:]) pcopyin(Anew[:][:])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {


}

}

} N=M=4096

47

[PGI tips] PGI_ACC_TIME

$ PGI_ACC_TIME=1 ./a.outAccelerator Kernel Timing data

/home/anaruse/src/OpenACC/jacobi/C/task1-solution/jacobi.c

jacobi NVIDIA devicenum=0

time(us): 649,886

44: data region reached 200 times

44: data copyin transfers: 800

device time(us): total=14,048 max=41 min=15 avg=17

53: data copyout transfers: 800


44: compute region reached 200 times

46: kernel launched 200 times

grid: [32x4094] block: [128]

device time(us): total=382,798 max=1,918 min=1,911 avg=1,913

elapsed time(us): total=391,408 max=1,972 min=1,953 avg=1,957

46: reduction kernel launched 200 times

grid: [1] block: [256]


elapsed time(us): total=53,510 max=280 min=266 avg=267

53: data region reached 200 times

統計データ

48

[PGI tips] PGI_ACC_NOTIFY

$ PGI_ACC_NOTIFY=3 ./a.out...

upload CUDA data file=/home/anaruse/src/OpenACC/jacobi/C/task1-

solution/jacobi.c function=jacobi line=44 device=0 variable=A bytes=16777216

...

launch CUDA kernel file=/home/anaruse/src/OpenACC/jacobi/C/task1-

solution/jacobi.c function=jacobi line=46 device=0 num_gangs=131008

num_workers=1 vector_length=128 grid=32x4094 block=128 shared memory=1024

...

download CUDA data file=/home/anaruse/src/OpenACC/jacobi/C/task1-

solution/jacobi.c function=jacobi line=53 device=0 variable=Anew bytes=16736272

...

トレースデータ

49

NVIDIA VISUAL PROFILER (NVVP)

50

NVVPによる解析: データ転送がボトルネック

1 cycle

GPU

kernel

GPU

kernel

利用率:低い

51

過剰なデータ転送


error = 0.0;


pcopy(Anew[:][:]) pcopyin(A[:][:])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;


}

}


pcopy(A[:][:]) pcopyin(Anew[:][:])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {


}

}

}

52

過剰なデータ転送


error = 0.0;


pcopy(Anew[:][:]) \

pcopyin(A[:][:])

{

}


pcopy(A[:][:]) \

pcopyin(Anew[:][:])

{

}

}

#pragma acc loop reduction(max:error)

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;


}

}

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {


}

}

Host GPU

copyin

copyin

copyout

copyout

53

データ領域の指定 (dataディレクティブ)

▪ copyin (CPU→GPU)

▪ copyout (CPUGPU)

▪ copy

▪ create

▪ present

#pragma acc data pcopy(A, Anew)


error = 0.0;

#pragma acc kernels pcopy(Anew[:][:]) pcopyin(A[:][:])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;


}

}

#pragma acc kernels pcopy(A[:][:]) pcopyin(Anew[:][:])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {


}

}

}

54

データ領域の指定 (dataディレクティブ)

▪ copyin (CPU→GPU)

▪ copyout (CPUGPU)

▪ copy

▪ create

▪ present

#pragma acc data pcopy(A) create(Anew)


error = 0.0;


for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;


}

}

#pragma acc kernels pcopy(A[:][:]) pcopyin(Anew[:][:])

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {


}

}

}

55

適正なデータ転送

#pragma acc data \

pcopy(A) create(Anew)


error = 0.0;


pcopy(Anew[:][:]) \

pcopyin(A[:][:])

{

}


pcopy(A[:][:]) \

pcopyin(Anew[:][:])

{

}

}

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;


}

}

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {


}

}

copyin

copyout

Host GPU

56

データ転送が減少 (NVVP)

利用率:高い1 cycle

57

2つの処理

データ転送

計算オフロード

計算オフロード、データ転送、両方を考慮する必要がある

GPU MemoryCPU Memory

PCI

58

float *array;

Init( ) {

...

array = (float*)malloc( … );

input_array( array );

#pragma enter data copyin(array)

...

}

Fin( ) {

...

#pragma exit data copyout(array)

output_array( array );

free( array );

...

}

その他のデータ管理方法

▪ Enter data▪ Copyin

▪ Create

▪ Exit data▪ Copyout

▪ Delete

59

#pragma acc data pcopy(A,B)

for (k=0; k<LOOP; k++) {

#pragma acc kernels present(A,B)

for (i=0; i<N; i++) {

A[i] = subA(i,A,B);

}

#pragma acc update self(A[0:1])

output[k] = A[0];

A[N-1] = input[k];

#pragma acc update device(A[N-1:1])

#pragma acc kernels present(A,B)

for (i=0; i<N; i++) {

B[i] = subB(i,A,B);

}

}


▪ Update selfCPU GPU

▪ Update deviceCPU → GPU

60


▪ Unified Memoryの利用

▪ PGIコンパイラオプション: -acc –ta=…,managed,…

▪ プログラム実行中に動的に確保される配列は（C: malloc, Fortran: allocate）、Unified Memoryで管理される

▪ OpenACCのデータディレクティブで移動を指示する必要がない

61

リダクション(縮約計算)


error = 0.0;

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;


}

}

...

}

62

リダクション(縮約計算)

▪ 演算の種類+ 和

* 積

Max 最大

Min 最小

| ビット和

& ビット積

^ XOR

|| 論理和

&& 論理積


error = 0.0;

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;


}

}

#pragma acc kernels

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {


}

}

}

$ pgcc -Minfo=acc -acc jacobi.c

jacobi:










63

リダクション (REDUCTION CLAUSE)

▪ 演算種類(C/C++)

+ 和

* 積

max 最大

min 最小

| ビット和

& ビット積

^ XOR

|| 論理和

&& 論理積


error = 0.0;

#pragma acc kernels


for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;


}

}

...

}

64

並列方法の指示



error = 0.0;



for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;


}

}

...

}

65

並列方法の指示



error = 0.0;



for (int j = 1; j < N-1; j++) {


for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;


}

}

...

}

$ pgcc -Minfo=acc -acc jacobi.c

jacobi:










66

並列方法の指示 (loopディレクティブ)

▪ Gang

▪ Worker

▪ Vector … SIMD幅

▪ Collapse

▪ Independent

▪ Seq

▪ Cache

▪ Tile



error = 0.0;


#pragma acc loop gang vector(1) reduction(max:error)

for (int j = 1; j < N-1; j++) {

#pragma acc loop gang vector(128)

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;


}

}

...

}

67

実行条件設定 (gang, vector)


for (j = 0; j < 16; j++) {


for (i = 0; i < 16; i++) {

...

4 x 16

i

4 x 16

4 x 16

4 x 16

j


for (j = 1; j < 16; j++) {


for (i = 0; i < 16; i++) {

...

i

j

8 x 8 8 x 8

8 x 8 8 x 8

68

ループを融合 (collapse)

▪ Gang

▪ Worker


▪ Collapse

▪ Independent

▪ Seq

▪ Cache

▪ Tile

▪ ...



error = 0.0;


#pragma acc loop reduction(max:error) \

collapse(2) gang vector(128)

for (int j = 1; j < N-1; j++) {

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;


}

}

...

}

69

ループを融合 (collapse)

▪ Gang

▪ Worker


▪ Collapse

▪ Independent

▪ Seq

▪ Cache

▪ Tile

▪ ...



error = 0.0;


#pragma acc loop reduction(max:error) gang vector(128)

for (int ji = 0; ji < (N-2)*(M-2); ji++) {

j = (ji / (M-2)) + 1;

i = (ji % (M-2)) + 1;

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;


}

...

}

70

並列実行可能(independent)

▪ Gang

▪ Worker


▪ Collapse

▪ Independent

▪ Seq

▪ Cache

▪ Tile

▪ ...



error = 0.0;


#pragma acc loop reduction(max:error) independent

for (int jj = 1; jj < NN-1; jj++) {

int j = list_j[jj];

for (int i = 1; i < M-1; i++) {

Anew[j][i] = (A[j][i+1] + A[j][i-1] +

A[j-1][i] + A[j+1][i]) * 0.25;


}

}

...

}

71

逐次に実行 (seq)

▪ Gang

▪ Worker


▪ Collapse

▪ Independent

▪ Seq

▪ Cache

▪ Tile

▪ ...


#pragma acc loop seq

for (int k = 3; k < NK-3; k++) {

#pragma acc loop

for (int j = 0; j < NJ; j++) {

#pragma acc loop

for (int i = 0; i < NI; i++) {

Anew[k][j][i] = func(

A[k-1][j][i], A[k-2][j][i], A[k-3][j][i],

A[k+1][j][i], A[k+2][j][i], A[k+3][j][i], ...

);

}

}

}

72


▪ 概要紹介



73

LSDalton

Quantum ChemistryAarhus University

12X speedup 1 week

PowerGrid

Medical ImagingUniversity of Illinois

40 days to2 hours

INCOMP3D

CFDNC State University

4X speedup

NekCEM

Comp ElectromagneticsArgonne National Lab

2.5X speedup60% less energy

CloverLeaf

Comp HydrodynamicsAWE

4X speedupSingle CPU/GPU code

MAESTROCASTRO

AstrophysicsStony Brook University

4.4X speedup4 weeks effort

FINE/Turbo

CFDNUMECA International

10X faster routines2X faster app

OPENACC ACCELERATES COMPUTATIONAL SCIENCE

74

Janus Juul Eriksen, PhD Fellow

qLEAP Center for Theoretical Chemistry, Aarhus University

“

OpenACC makes GPU computing approachable for

domain scientists. Initial OpenACC implementation

required only minor effort, and more importantly,

no modifications of our existing CPU implementation.

“

LSDALTON

Large-scale application for calculating high-accuracy molecular energies

Lines of Code

Modified

# of Weeks

Required

# of Codes to

Maintain

<100 Lines 1 Week 1 Source

Big Performance

7.9x8.9x

11.7x

ALANINE-113 ATOMS

ALANINE-223 ATOMS

ALANINE-333 ATOMS

Speedup v

s C

PU

Minimal Effort

LS-DALTON CCSD(T) ModuleBenchmarked on Titan Supercomputer (AMD CPU vs Tesla K20X)

https://developer.nvidia.com/openacc/success-stories


75

“

OpenACC enabled us to

target routines for GPU

acceleration without

rewriting code, allowing us

to maintain portability on a

code that is 20-years old

“

David Gutzwiller, Head of HPC

NUMECA International

CHALLENGE• Accelerate 20 year old highly optimized code

on GPUs

SOLUTION• Accelerated computationally intensive routines

with OpenACC

RESULTS• Achieved 10x or higher speed-up on key

routines

• Full app speedup of up to 2x on the Oak Ridge

Titan supercomputer

• Total time spent on optimizing various routines

with OpenACC was just five person-months

NUMECA FINE/TURBOCommercial CFD Application



76

Brad Sutton, Associate Professor of Bioengineering and

Technical Director of the Biomedical Imaging Center

University of Illinois at Urbana-Champaign

“Now that we’ve seen how easy it is to program the

GPU using OpenACC and the PGI compiler, we’re

looking forward to translating more of our projects

“

CHALLENGE• Produce detailed and accurate brain images by applying

computationally intensive algorithms to MRI data

• Reduce reconstruction time to make diagnostic use possible

SOLUTION• Accelerated MRI reconstruction application with OpenACC

using NVIDIA GPUs

RESULTS• Reduced reconstruction time for a single high-resolution MRI

scan from 40 days to a couple of hours

• Scaled on Blue Waters at NCSA to reconstruct 3000 images in

under 24 hours

POWERGRIDAdvanced MRI Reconstruction Model



77

Lixiang Luo, Researcher

Aerospace Engineering Computational Fluid Dynamics Laboratory

North Carolina State University (NCSU)

“

OpenACC is a highly effective tool for programming fully

implicit CFD solvers on GPU to achieve true 4X speedup“

* CPU Speedup on 6 cores of Xeon E5645, with additional cores performance reduces due to partitioning and MPI overheads

CHALLENGE• Accelerate a complex implicit CFD solver on GPU

SOLUTION• Used OpenACC to run solver on GPU with

minimal code changes.

RESULTS• Achieved up to 4X speedups on Tesla GPU over

parallel MPI implementation on x86 multicore

processor

• Structured approach to parallelism provided by

OpenACC allowed better algorithm design

without worrying about GPU architecture

INCOMP3D3D Fully Implicit CFD Solver



78

The most significant result from

our performance studies is faster

computation with less energy

consumption compared with our

CPU-only runs.

Dr. Misun Min, Computation Scientist

Argonne National Laboratory

CHALLENGE• Enable NekCEM to deliver strong scaling on

next-generation architectures while

maintaining portability

SOLUTION• Use OpenACC to port the entire program

to GPUs

RESULTS• 2.5x speedup over a highly tuned CPU-only

version

• GPU used only 39 percent of the energy

needed by 16 CPUs to do the same

computation in the same amount of time

NEKCEMComputational Electromagnetics Application

“

“



79

Adam Jacobs, PhD candidate in the Department of Physics and

Astronomy at Stony Brook University

On the reactions side, accelerated calculations allow us to model larger

networks of nuclear reactions for similar computational costs as the

simple networks we model now

MAESTRO & CASTROAstrophysics 3D Simulation

““

CHALLENGE• Pursuing strong scaling and portability to run

code on GPU-powered supercomputers

SOLUTION• OpenACC compiler on OLCF’s Titan

supercomputer

RESULTS• 4.4x faster reactions than on a multi-core

computer with 16 cores

• Accelerated calculations allow modeling larger

networks of nuclear reactions for similar

computational costs as simple networks

2 weeks to learn OpenACC

2 weeks to modify code



80

Wayne Gaudin and Oliver Perks

Atomic Weapons Establishment, UK

We were extremely impressed that we can run OpenACC on a CPU with

no code change and get equivalent performance to our OpenMP/MPI

implementation.

CLOVERLEAFPerformance Portability for a Hydrodynamics Application

Sp

ee

du

p v

s 1

CP

U C

ore

Benchmarked Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz, Accelerator: Tesla K80

““

CHALLENGE• Application code that runs across architectures

without compromising on performance

SOLUTION• Use OpenACC to port the CloverLeaf mini app

to GPUs and then recompile and run the same

source code on multi-core CPUs

RESULTS• Same performance as optimized OpenMP

version on x86 CPU

• 4x faster performance using the same code on

a GPU



81

Earthquake Simulation (理研、東大地震研)

▪ WACCPD 2016(SC16併催WS): Best Paper Award

http://waccpd.org/wp-content/uploads/2016/04/SC16_WACCPD_fujita.pdfより引用

http://waccpd.org/wp-content/uploads/2016/04/SC16_WACCPD_fujita.pdf

82

NICAM

▪ 気象・気候モデル by 理研AICS/東大▪ 膨大なコード (数十万行)

▪ ホットスポットがない (パレートの法則)

▪ 特性の異なる2種類の処理▪ 力学系 … メモリバンド幅ネック

▪ 物理系 … 演算ネック

83

NICAM: 力学系(NICAM-DC)

▪ OpenACCによるGPU化▪ 主要サブルーチンは、全てGPU上で動作(50以上)

▪ MPI対応済み

▪ 2週間

▪ 良好なスケーラビリティ▪ Tsubame 2.5, 最大2560

GPUs

▪ Scaling factor: 0.8

Weak scaling

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04Perf

orm

ance (

GFLO

PS)

Number of CPUs or GPUs

Tsubame 2.5 (GPU:K20X)

K computer

Tsubame 2.5 (CPU:WSM)

(*) weak scaling

Courtesy of Dr. Yashiro from RIKEN AICS

84

NICAM: 力学系(NICAM-DC)

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+02 1.E+03 1.E+04 1.E+05 1.E+06

Measu

red P

erf

orm

ance

(GFLO

PS)

Aggregate Peak Memory Bandwidth (GB/s)

Tsubame 2.5 (GPU:K20X)

K computer

Tsubame 2.5 (CPU:WSM)

Courtesy of Dr. Yashiro from RIKEN AICS

85

NICAM: 物理系(SCALE-LES)

▪ Atmospheric radiation transfer

▪ 物理系の中で、最も重い計算

▪ OpenACCによるGPU対応済

1.00 1.99 3.88 8.51

37.8

76.0

151

0

20

40

60

80

100

120

140

160

1 core 2 core 4 core 10 core 1 GPU 2 GPUs 4 GPUs

Xeon E5-2690v2(3.0GHz,10-core) Tesla K40

Speedup

vs.

CPU

1-c

ore

(*) PCIデータ転送時間込み, グリッドサイズ:1256x32x32

Better

86

SEISM3D

▪ 地震シミュレーション by 古村教授(東大地震研)

▪ 主要サブルーチンのGPU対応が完了▪ メモリバンド幅ネック、 3次元モデル(2次元分割)、隣接プロセス間通信

605

459

134

0

100

200

300

400

500

600

K: 8x SPARC64VIIIfx

CPU: 8x XeonE5-2690v2

GPU: 8x TeslaK40

Tim

e (

sec)

SEISM3D (480x480x1024, 1K steps)

3.4x speedup

(アプリ全体)

0

20

40

60

80

100

120

140

GPU: 8x Tesla K40

Others (CPU, MPI and so on)

[CUDA memcpy DtoH]

[CUDA memcpy HtoD]

(other subroutines)

update_vel_pml

update_vel

update_stress_pml

update_stress

diff3d_*

GPUの実行時間内訳

Better

87

FFR/BCM (仮称)

▪ 次世代CFDコード by 坪倉准教授(理研AICS/北大)

▪ MUSCL_bench:

▪ MUSCLスキームに基づくFlux計算 (とても複雑な計算)

▪ CFD計算の主要部分 (60-70%)

▪ OpenACCによるGPU対応、完了

1.00 1.934.55

8.30

33.21

05

101520253035

1 core 2 core 5 core 10 core 1 GPU

Xeon E5-2690v2(3.0GHz,10-core) Tesla K40

Speedup

vs.

1 C

PU

core

(*) PCIデータ転送時間込み、サイズ:80x32x32x32

Better

88

CM-RCM IC-CG

▪ IC-CG法のベンチマークコード by 中島教授(東大)▪ CM-RCM法(Cyclic Multi-coloring Reverse Cuthill-Mckee)を使用

▪ メインループ内のサブルーチンを全てOpenACCでGPU化

3.36

1.58 1.53 1.39

0.655

0

1

2

3

4

OpenMP OpenMP OpenMP OpenMP OpenACC

Opteron 6386SE(2.8GHz,16core)

SPARC64 Ixfx(1.85GHz,16core)

Xeon E5-2680v2(2.8GHz,10core)

Xeon-Phi 5110P Tesla K40

Tim

e (

second)

CM-RCM ICCG (100x100x100)

Better

Courtesy of Dr. Ohshima from U-Tokyo

89

CCS-QCD

▪ QCDコード by 石川准教授(広島大)

▪ BiCGStab計算を全てOpenACCでGPU化▪ データレイアウトを変更: AoS → SoA

32.324.3 22.1 20.9 19.7

42.5

53.357.9 60.8 63.4

0

10

20

30

40

50

60

70

80

90

8x8x8x32 8x8x8x64 8x8x16x64 8x16x16x64 16x16x16x64

GFLO

PS

Problem Size

CCS-QCD: BiCGStab Total FLOPS

Xeon E5-2690v2(3.0GHz,10core,OpenMP) Tesla K40(OpenACC)Better

90

CUDA

91

CUDAプログラミング

▪ プログラミングモデル

▪ アーキテクチャ

▪ 性能Tips

92

GPUコンピューティング

▪ 高スループット指向のプロセッサ

▪ 分離されたメモリ空間

GPU MemoryCPU Memory

PCI

CPU GPU

93

GPUプログラム


float *x, float *y)

{


if (i < n)

y[i] += a*x[i];

}

...




saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);

cudaDeviceSynchronize();


...


float *x, float *y)

{

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

}

...


...

CPU GPU

94

GPU実行の基本的な流れ

▪ GPUは、CPUからの制御で動作

▪ 入力データ: CPUからGPUに転送 (H2D)

▪ GPUカーネル: CPUから投入

▪ 出力データ: GPUからCPUに転送 (D2H)

CPU GPU

入力データ転送

GPUカーネル投入

同期

出力データ転送

GPU上で演算

95

GPUプログラム


float *x, float *y)

{


if (i < n)

y[i] += a*x[i];

}

...




saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);



...


float *x, float *y)

{

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

}

...


...

CPU GPU

入力データ転送

カーネル起動

同期

出力データ転送

96

GPUプログラム (Unified Memory)


float *x, float *y)

{


if (i < n)

y[i] += a*x[i];

}

...

saxpy<<< N/128, 128 >>>(N, 3.0, x, y);


...


float *x, float *y)

{

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

}

...


...

CPU GPU

カーネル起動

同期

97

GPUカーネル

▪ GPUカーネル: 1つのGPUスレッドの処理内容を記述▪ 基本: 1つのGPUスレッドが、1つの配列要素を担当


float *x, float *y)

{

int i = threadIdx.x + blodkDim.x * blockIdx.x;

if (i < n)

y[i] += a*x[i];

}

...

saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);

...


float *x, float *y)

{

for (int i = 0; i < n; ++i)

y[i] += a*x[i];

}

...


...

CPU GPU

Global スレッドID

98

Execution Configuration(ブロック数とブロックサイズ)


float *x, float *y)

{


if (i < n)

y[i] += a*x[i];

}

...

saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);

...

ブロック数ブロックサイズ

ブロック数 x ブロックサイズ = 配列要素数

スレッドIDブロックID

ブロックサイズ

99

スレッド階層(スレッド、ブロック、グリッド)

グリッド

x[] 0 127 128 255

y[] 0 127 128 255

y[i] = a*x[i] + y[i]

256 383 384

256 383 384

ブロック2ブロック1ブロック0 ブロック

スレッド(global)

0 127 128 255 256 383 384

▪ ブロックサイズ(スレッド数/ブロック)は、カーネル毎に設定可能▪ 推奨: 128 or 256 スレッド

スレッド(local)

0 127 0 127 0 127 0

100

Execution Configuration(ブロック数とブロックサイズ)


float *x, float *y)

{


if (i < n)

y[i] += a*x[i];

}

...

saxpy<<< N/128, 128 >>>(N, 3.0, d_x, d_y);

...

ブロック数ブロックサイズ

ブロック数 x ブロックサイズ = 配列要素数

N/ 64, 64N/256, 256

101

2D配列のGPUカーネル例

▪ ブロックサイズ(ブロック形状)は、1D～3Dで表現可能

__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N])

{


int j = threadIdx.y + blodkDim.y * blockIdy.y;

if ( i < N && j < N )

C[i][i] = A[i][j] + B[i][j];

}

...

dim3 sizeBlock( 16, 16 );

dim3 numBlocks( N/sizeBlock.x, N/sizeBlock.y );

MatAdd<<< numBlocks, sizeBlock >>>(A, B, C);

...

32, 864, 4

Globalスレッド ID (x)

GlobalスレッドID (y)

ブロックサイズ (x,y)

ブロック数 (x,y)

102

ブロック・マッピング、スレッド・マッピング

dim3 sizeBlock(16,16)

(0,0) (1,0) (2,0)

(0,1) (1,1) (2,1)

(0,2) (1,2) (2,2)

ブロックID(blockIdx)

dim3 sizeBlock(32,8)

(0,0) (1,0)

(0,1) (1,1)

(0,2) (1,2)

(0,3) (1,3)

(0,4) (1,4)

(0,0)

(15,15)

(0,0)

(31,7)

スレッドID(threadIdx)

103




▪ 性能Tips

104

GPUアーキテクチャ概要

▪ PCI I/F▪ ホスト接続インタフェース

▪ Giga Thread Engine▪ SMに処理を割り振るスケジューラ

▪ DRAM I/F (HBM2)▪ 全SM、PCI I/Fからアクセス可能なメモリ (デバイスメモリ, フレームバッファ)

▪ L2 cache (4MB)▪ 全SMからアクセス可能なR/Wキャッシュ

▪ SM (Streaming Multiprocessor)▪ 「並列」プロセッサ、GP100:最多60Pascal GP100

105

SM (Stream Multi-Processor)

▪ CUDAコア▪ GPUスレッドはこの上で動作

▪ GP100: 64個 /SM

▪ Other units▪ LD/ST, SFU, etc

▪ レジスタ(32bit): 64K個

▪ 共有メモリ: 64KB

▪ Tex/L1キャッシュPascal GP100

106

GPUカーネル実行の流れ

▪ CPUが、GPUに、グリッドを投入▪ 具体的な投入先は、Giga Thread Engine

グリッド

ブロック

ブロック

スレッド

スレッド

107


▪ Giga Thread Engine(GTE)が、SMに、ブロックを投入▪ GTEは、ブロックスケジューラ

▪ グリッドをブロックに分解して、ブロックを、空いているSMに割当てる

グリッド

ブロック

ブロック

スレッド

スレッド

108

ブロックをSMに割り当て

▪ 各ブロックは、互いに独立に実行▪ ブロック間では同期しない、実行順序の保証なし

▪ 1つのブロックは複数SMにまたがらない▪ 1つのSMに、複数ブロックが割当てられることはある

グリッド

ブロック

ブロック

ブロック

ブロック

109


▪ SM内のスケジューラが、スレッドをCUDAコアに投入

グリッド

ブロック

ブロック

スレッド

スレッド

110


▪ SM内のスケジューラが、ワープをCUDAコアに投入▪ ワープ: 32スレッドの塊

▪ ブロックをワープに分割、実行可能なワープを、空CUDAコアに割当てる

グリッド

ブロック

ブロック

ワープ

ワープ

111

ワープのCUDAコアへの割り当て

▪ ワープ内の32スレッドは、同じ命令を同期して実行

▪ 各ワープは、互いに独立して実行▪ 同じブロック内のワープは、明示的に同期可能(__syncthreads())

グリッド

ブロック

ブロック

ワープ

ワープ

ワープ

SIMT(Single Instruction Multiple Threads)

ワープ

112

GPUアーキの変化を問題としないプログラミングモデル

Kepler, CC3.5

192 cores /SM

Pascal, CC6.0

64 cores /SM

Maxwell, CC5.0

128 cores /SM

113




▪ 性能Tips

114

リソース使用率 (Occupancy)

SMの利用効率を上げる≒ SMに割当て可能なスレッド数を、

上限に近づける

▪ レジスタ使用量(/スレッド)▪ できる限り減らす

▪ DP(64bit)は、2レジスタ消費

▪ レジスタ割り当て単位は8個

▪ レジスタ使用量と、割当て可能なスレッド数の関係▪ 32レジスタ: 2048(100%), 64レジスタ: 1024(50%)

▪ 128レジスタ:512(25%), 256レジスタ: 256(12.5%)

CUDAコア数: 64最大スレッド数: 2048最大ブロック数: 32共有メモリ: 64KB

レジスタ数(32-bit): 64K個

リソース量/SM (GP100)

115


SMの利用効率を上げる≒ SMに割当て可能なスレッド数を、上限に近づける

▪ スレッド数(/ブロック)▪ 64以上にする▪ 64未満だと最大ブロック数がネックになる

▪ 共有メモリ使用量(/ブロック)▪ できる限り減らす▪ 共有メモリ使用量と、割当て可能なブロック数の関係

▪ 32KB:2ブロック, 16KB:4ブロック, 8KB:8ブロック

CUDAコア数: 64最大スレッド数: 2048最大ブロック数: 32共有メモリ: 64KB

レジスタ数(32-bit): 64K個

リソース量/SM (GP100)

116


空き時間を埋める

▪ CUDAストリーム (≒キュー)▪ 同じCUDAストリームに投入した操作: 投入順に実行

▪ 別のCUDAストリームに投入した操作: 非同期に実行 (オーバラップ実行)

(*) 操作: GPUカーネル, データ転送

[CUDAストリームの効果例]GPUカーネルとデータ転送が

オーバラップして同時に実行されている

117

デバイスメモリへのアクセスは、まとめて

▪ コアレス・アクセス▪ 32スレッド(ワープ)のロード/ストアをまとめて、メモリトランザクションを発行

▪ トランザクションサイズ: 32B, 64B or 128B

▪ トランザクション数は、少ないほど良い

配列 128B境界 128B境界

0 1 2 29 30 31283

0 1 14 17 30 311615

0 1 2

Best

Good

Bad

128B x1

64B x2

32B x32

118

分岐を減らす

▪ ワープ内のスレッドが別パスを選択すると遅くなる▪ ワープ内のスレッドは、命令を共有 (SIMT)

▪ ワープ内のスレッドが選んだ全パスの命令を実行

▪ あるパスの命令を実行中、そのパスにいないスレッドはinactive状態

▪ Path divergenceを減らす▪ できる限り、同ワープ内のスレッドは

同じパスを選択させる

A[id] > 0

処理 X

処理 Y

true false

01

23

119

まとめ

▪ GPU Computing

▪ Volta GPUs

▪ OpenACC

▪ CUDA

OpenACC CUDAによる GPUコンピューティング...2008 2010 2012 2014 2016 2018 GPU...

Documents

Transcript of OpenACC CUDAによる GPUコンピューティング...2008 2010 2012 2014 2016 2018 GPU...