OPENACCの現状 - GTC On-Demand Featured Talks |...
Transcript of OPENACCの現状 - GTC On-Demand Featured Talks |...
![Page 1: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/1.jpg)
OPENACCの現状
Akira Naruse
NVIDAI Developer Technologies
![Page 2: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/2.jpg)
アプリをGPUで加速する方法
Application
Library
GPU対応ライブラリにチェンジ 簡単に開始
CUDA OpenACC
主要処理をCUDAで記述 高い自由度
既存コードにディレクティブを挿入 簡単に加速
![Page 3: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/3.jpg)
OPENACC
Program myscience
... serial code ...
!$acc kernels
do k = 1,n1
do i = 1,n2
... parallel code ...
enddo
enddo
!$acc end kernels
... serial code …
End Program myscience
CPU
GPU
既存のC/Fortranコード
簡単: 既存のコードに
コンパイラへのヒントを追加
強力: そこそこの労力で、コンパイラがコードを自動で並列化
オープン: 複数コンパイラベンダが、 複数アクセラレータをサポート
NVIDIA, AMD, Intel(予定)
ヒントの追加
![Page 4: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/4.jpg)
実行モデル
アプリケーション・コード
GPU
CPU
並列部分は GPUコードを生成
計算の 重い部分
逐次部分は CPUコードを生成
$acc parallel
$acc end parallel
![Page 5: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/5.jpg)
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
#pragma acc parallel copy(y[:n]) copyin(x[:n])
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
saxpy(N, 3.0, x, y);
...
SAXPY (Y=A*X+Y, C/C++)
OpenMP OpenACC
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
#pragma omp parallel for
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
saxpy(N, 3.0, x, y);
...
![Page 6: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/6.jpg)
subroutine saxpy(n, a, X, Y)
real :: a, Y(:), Y(:)
integer :: n, i
!$acc parallel copy(Y(:)) copyin(X(:))
do i=1,n
Y(i) = a*X(i)+Y(i)
enddo
!$acc end parallel
end subroutine saxpy
...
call saxpy(N, 3.0, x, y)
...
SAXPY (Y=A*X+Y, FORTRAN)
OpenMP OpenACC
subroutine saxpy(n, a, X, Y)
real :: a, X(:), Y(:)
integer :: n, i
!$omp parallel do
do i=1,n
Y(i) = a*X(i)+Y(i)
enddo
!$omp end parallel do
end subroutine saxpy
...
call saxpy(N, 3.0, x, y)
...
![Page 7: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/7.jpg)
OPENMPとの併用
OpenMP / OpenACC
void saxpy(int n, float a,
float *x,
float *restrict y)
{
#pragma acc parallel copy(y[:n]) copyin(x[:n])
#pragma omp parallel for
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
saxpy(N, 3.0, x, y);
...
![Page 8: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/8.jpg)
簡単にコンパイル
OpenMP / OpenACC
void saxpy(int n, float a,
float *x,
float *restrict y)
{
#pragma acc parallel copy(y[:n]) copyin(x[:n])
#pragma omp parallel for
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
saxpy(N, 3.0, x, y);
...
$ pgcc -Minfo -acc saxpy.c
saxpy:
16, Generating present_or_copy(y[:n])
Generating present_or_copyin(x[:n])
Generating Tesla code
19, Loop is parallelizable
Accelerator kernel generated
19, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
![Page 9: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/9.jpg)
簡単に実行
OpenMP / OpenACC
void saxpy(int n, float a,
float *x,
float *restrict y)
{
#pragma acc kernels copy(y[:n]) copyin(x[:n])
#pragma omp parallel for
for (int i = 0; i < n; ++i)
y[i] += a*x[i];
}
...
saxpy(N, 3.0, x, y);
...
$ pgcc -Minfo -acc saxpy.c
saxpy:
16, Generating present_or_copy(y[:n])
Generating present_or_copyin(x[:n])
Generating Tesla code
19, Loop is parallelizable
Accelerator kernel generated
19, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
$ nvprof ./a.out
==10302== NVPROF is profiling process 10302, command: ./a.out
==10302== Profiling application: ./a.out
==10302== Profiling result:
Time(%) Time Calls Avg Min Max Name
62.95% 3.0358ms 2 1.5179ms 1.5172ms 1.5186ms [CUDA memcpy HtoD]
31.48% 1.5181ms 1 1.5181ms 1.5181ms 1.5181ms [CUDA memcpy DtoH]
5.56% 268.31us 1 268.31us 268.31us 268.31us saxpy_19_gpu
![Page 10: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/10.jpg)
簡単に高速
Real-Time Object Detection
Global Manufacturer of Navigation Systems
Valuation of Stock Portfolios using Monte Carlo
Global Technology Consulting Company
Interaction of Solvents and Biomolecules
University of Texas at San Antonio
40時間で5倍 4時間で2倍 8時間で5倍
Automotive Financial Life Science
![Page 11: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/11.jpg)
コンパイラとツール
2013年10月~ 2013年12月~ 2014年1月~ 2015年(予定)
コンパイラ
デバッグツール
OpenACC 2.0対応
![Page 13: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/13.jpg)
NCAR-CISL, ORNL / CESM • CAM-SE (HOMME)
• LANL / POP
NASA / GEOS-5 NOAA-GFDL / CFSv2 • NOAA-GFDL / MOM6
UKMO / HadGEM3 • UM
• NEMO
MPI-M / MPI-ESM • ECHAM6
• MPIOM
RIKEN, UniTokyo / NICAM IPSL / DYNAMICO
UKMO / UM ECMWF / IFS DWD / GME NOAA-NCEP / GFS EC, CMC / GEM USNRL / NAVGEM NOAA-ESRL / FIM DWD, MPI-M / ICON NOAA-ESRL / NIM NCAR / MPAS-A
LANL / POP NOAA-GFDL / MOM6 CNRS, STFC/ NEMO USNRL / HYCOM MIT / MITgcm LANL / MPAS-O MPI-M / ICON-OCE
NCAR-M3 / WRF USNRL / COAMPS DWD, MCH / COSMO MFR / AROME MFR, ICHEC / HARMONIE • HIRLAM + ALADIN
JAMSTEC-JMA / ASUCA CAS-CMA / GRAPES UniMiami / OLAM
NCAR-M3 / WRF DWD, MCH / COSMO UniMiami / OLAM
Rutgers-UCLA / ROMS UNC-ND / ADCIRC
MPAS-O
MPAS-A or NIM
MPAS-A or NIM
ICON-ATM
NIM
GungHo
PantaRhei
MPAS-O
NIM?
ICON-OCE
GPU Development (8) CAM-SE, GEOS-5, NEMO, WRF, COSMO, NIM, FIM, GRAPES
GPU Evaluation (15) POP, ICON, NICAM, OLAM, GungHo, PantaRhei, ASUCA,
HARMONIE, COAMPS, HYCOM, MITgcm, ROMS, ADCIRC,
DYNAMICO, MOM6
GPU Not Started (7) MPAS-A, MPAS-O, GFS, GEM, NAVGEM, AROME, ICON-OCE
Indicates Next-Gen Model
ICON
GungHo
気候(C) 天候(W) 海洋(O)
気象・天候・海洋モデル
![Page 14: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/14.jpg)
OPENACCへの移行 Model Focus GPU Approach Collaboration
NCAR / WRF NWP/Climate-R (1) OpenACC, (2) CUDA (1) NCAR-MMM, (2) SSEC UW-M
DWD / COSMO NWP/Climate-R CUDA+OpenACC CSCS, MeteoSwiss (MCH)
ORNL / CAM-SE Climate-G CUDA-F OpenACC ORNL, Cray
NCAR / CAM-SE Climate- G CUDA,CUDA-F,OpenACC NCAR-CISL
NOAA / NIM&FIM NWP/Climate-G F2C-ACC,OpenACC NOAA-ESRL, PGI
NASA / GEOS-5 Climate-G CUDA-F OpenACC NASA, PGI
CNRS / NEMO Ocean GCM OpenACC STFC
UKMO / GungHo NWP/Climate-G OpenACC STFC, UKMO in future?
USNRL / HYCOM Ocean GCM OpenACC US Naval Research Lab
RIKEN / NICAM Climate-G OpenACC RIKEN, UniTokyo
UNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI
NOAA / MOM6 Ocean GCM OpenACC NOAA-GFDL
NASA / FV-Core Atmospheric GCM OpenACC NASA, NOAA-GFDL
Other Evaluations: US – COAMPS, MPAS, ROMS, OLAM; Europe – ICON, IFS, HARMONIE; DYNAMICO
Asia-Pacific – ASUCA (JP), GRAPES (CN)
![Page 15: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/15.jpg)
OPENACCでどこまで出来るの?
![Page 16: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/16.jpg)
例: JACOBI ITERATION
while ( error > tol ) {
error = 0.0;
for (int j = 1; j < N-1; j++) {
for (int i = 1; i < M-1; i++) {
Anew[j][i] = (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]) * 0.25;
error = max(error, abs(Anew[j][i] - A[j][i]));
}
}
for (int j = 1; j < N-1; j++) {
for (int i = 1; i < M-1; i++) {
A[j][i] = Anew[j][i];
}
}
}
A(i,j) A(i+1,j) A(i-1,j)
A(i,j-1)
A(i,j+1)
![Page 17: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/17.jpg)
並列領域 (KERNELS CONSTRUCT)
Parallels と Kernels
— 並列領域を指示
Parallels
— 並列実行スタート
Kernels
— 複数のカーネル
while ( error > tol ) {
error = 0.0;
#pragma acc kernels
for (int j = 1; j < N-1; j++) {
for (int i = 1; i < M-1; i++) {
Anew[j][i] = (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]) * 0.25;
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
#pragma acc kernels
for (int j = 1; j < N-1; j++) {
for (int i = 1; i < M-1; i++) {
A[j][i] = Anew[j][i];
}
}
}
![Page 18: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/18.jpg)
並列領域 (KERNELS CONSTRUCT)
Parallels と Kernels
— 並列領域を指示
Parallels
— 並列走行の開始
Kernels
— 複数のGPUカーネル
while ( error > tol ) {
error = 0.0;
#pragma acc kernels
for (int j = 1; j < N-1; j++) {
for (int i = 1; i < M-1; i++) {
Anew[j][i] = (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]) * 0.25;
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
#pragma acc kernels
for (int j = 1; j < N-1; j++) {
for (int i = 1; i < M-1; i++) {
A[j][i] = Anew[j][i];
}
}
}
$ pgcc -Minfo=acc -acc jacobi.c
jacobi:
60, Loop carried scalar dependence for 'error' at line 64
...
Accelerator scalar kernel generated
61, Loop carried scalar dependence for 'error' at line 64
...
Accelerator scalar kernel generated
![Page 19: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/19.jpg)
リダクション (REDUCTION CLAUSE)
演算の種類
+ 和
* 積
Max 最大
Min 最小
| ビット和
& ビット積
^ XOR
|| 論理和
&& 論理積
while ( error > tol ) {
error = 0.0;
#pragma acc kernels
#pragma acc loop reduction(max:error)
for (int j = 1; j < N-1; j++) {
#pragma acc loop reduction(max:error)
for (int i = 1; i < M-1; i++) {
Anew[j][i] = (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]) * 0.25;
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
#pragma acc kernels
for (int j = 1; j < N-1; j++) {
for (int i = 1; i < M-1; i++) {
A[j][i] = Anew[j][i];
}
}
}
![Page 20: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/20.jpg)
リダクション (REDUCTION CLAUSE)
演算の種類
+ 和
* 積
Max 最大
Min 最小
| ビット和
& ビット積
^ XOR
|| 論理和
&& 論理積
while ( error > tol ) {
error = 0.0;
#pragma acc kernels
#pragma acc loop reduction(max:error)
for (int j = 1; j < N-1; j++) {
#pragma acc loop reduction(max:error)
for (int i = 1; i < M-1; i++) {
Anew[j][i] = (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]) * 0.25;
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
#pragma acc kernels
for (int j = 1; j < N-1; j++) {
for (int i = 1; i < M-1; i++) {
A[j][i] = Anew[j][i];
}
}
}
$ pgcc -Minfo=acc -acc jacobi.c
jacobi:
59, Generating present_or_copyout(Anew[1:4094][1:4094])
Generating present_or_copyin(A[:][:])
Generating Tesla code
61, Loop is parallelizable
63, Loop is parallelizable
Accelerator kernel generated
61, #pragma acc loop gang /* blockIdx.y */
63, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
Max reduction generated for error
![Page 21: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/21.jpg)
データ転送方法 (DATA CLAUSE)
while ( error > tol ) {
error = 0.0;
#pragma acc kernels
#pragma acc loop reduction(max:error)
for (int j = 1; j < N-1; j++) {
#pragma acc loop reduction(max:error)
for (int i = 1; i < M-1; i++) {
Anew[j][i] = (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]) * 0.25;
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
#pragma acc kernels
for (int j = 1; j < N-1; j++) {
for (int i = 1; i < M-1; i++) {
A[j][i] = Anew[j][i];
}
}
}
$ pgcc -Minfo=acc -acc jacobi.c
jacobi:
59, Generating present_or_copyout(Anew[1:4094][1:4094])
Generating present_or_copyin(A[:][:])
Generating Tesla code
61, Loop is parallelizable
63, Loop is parallelizable
Accelerator kernel generated
61, #pragma acc loop gang /* blockIdx.y */
63, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
Max reduction generated for error
![Page 22: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/22.jpg)
データ転送方法 (DATA CLAUSE)
while ( error > tol ) {
error = 0.0;
#pragma acc kernels \
pcopyout(Anew[1:N-2][1:M-2]) pcopyin(A[0:N][0:M])
#pragma acc loop reduction(max:error)
for (int j = 1; j < N-1; j++) {
#pragma acc loop reduction(max:error)
for (int i = 1; i < M-1; i++) {
Anew[j][i] = (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]) * 0.25;
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
#pragma acc kernels \
pcopyout(A[1:N-2][1:M-2]) pcopyin(Anew[1:N-2][1:M-2])
for (int j = 1; j < N-1; j++) {
for (int i = 1; i < M-1; i++) {
A[j][i] = Anew[j][i];
}
}
}
copyin (HostGPU)
copyout (HostGPU)
copy
create
present
pcopyin
pcopyout
pcopy
pcreate
![Page 23: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/23.jpg)
データ転送方法 (DATA CLAUSE)
while ( error > tol ) {
error = 0.0;
#pragma acc kernels \
pcopy(Anew[:][:]) pcopyin(A[:][:])
#pragma acc loop reduction(max:error)
for (int j = 1; j < N-1; j++) {
#pragma acc loop reduction(max:error)
for (int i = 1; i < M-1; i++) {
Anew[j][i] = (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]) * 0.25;
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
#pragma acc kernels \
pcopy(A[:][:]) pcopyin(Anew[:][:])
for (int j = 1; j < N-1; j++) {
for (int i = 1; i < M-1; i++) {
A[j][i] = Anew[j][i];
}
}
}
copyin (HostGPU)
copyout (HostGPU)
copy
create
present
pcopyin
pcopyout
pcopy
pcreate
![Page 24: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/24.jpg)
データ転送がボトルネック (NVVP)
1 cycle
GPU
kernel
GPU
kernel
稼働率:低い
![Page 25: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/25.jpg)
過剰なデータ転送
while ( error > tol ) {
error = 0.0;
#pragma acc kernels \
pcopy(Anew[:][:]) \
pcopyin(A[:][:])
{
}
#pragma acc kernels \
pcopy(A[:][:]) \
pcopyin(Anew[:][:])
{
}
}
#pragma acc loop reduction(max:error)
for (int j = 1; j < N-1; j++) {
#pragma acc loop reduction(max:error)
for (int i = 1; i < M-1; i++) {
Anew[j][i] = (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]) * 0.25;
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
for (int j = 1; j < N-1; j++) {
for (int i = 1; i < M-1; i++) {
A[j][i] = Anew[j][i];
}
}
Host GPU
copyin
copyin
copyout
copyout
![Page 26: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/26.jpg)
データ領域 (DATA CONSTRUCT)
copyin (CPUGPU)
copyout (CPUGPU)
copy
create
present
pcopyin
pcopyout
pcopy
pcreate
#pragma acc data pcopy(A) create(Anew)
while ( error > tol ) {
error = 0.0;
#pragma acc kernels pcopy(Anew[:][:]) pcopyin(A[:][:])
#pragma acc loop reduction(max:error)
for (int j = 1; j < N-1; j++) {
#pragma acc loop reduction(max:error)
for (int i = 1; i < M-1; i++) {
Anew[j][i] = (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]) * 0.25;
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
#pragma acc kernels pcopy(A[:][:]) pcopyin(Anew[:][:])
for (int j = 1; j < N-1; j++) {
for (int i = 1; i < M-1; i++) {
A[j][i] = Anew[j][i];
}
}
}
![Page 27: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/27.jpg)
適正なデータ転送
#pragma acc data \
pcopy(A) create(Anew)
while ( error > tol ) {
error = 0.0;
#pragma acc kernels \
pcopy(Anew[:][:]) \
pcopyin(A[:][:])
{
}
#pragma acc kernels \
pcopy(A[:][:]) \
pcopyin(Anew[:][:])
{
}
}
#pragma acc loop reduction(max:error)
for (int j = 1; j < N-1; j++) {
#pragma acc loop reduction(max:error)
for (int i = 1; i < M-1; i++) {
Anew[j][i] = (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]) * 0.25;
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
for (int j = 1; j < N-1; j++) {
for (int i = 1; i < M-1; i++) {
A[j][i] = Anew[j][i];
}
}
copyin
copyout
Host GPU
![Page 28: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/28.jpg)
データ転送の削減 (NVVP)
稼働率:高い 1 cycle
![Page 29: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/29.jpg)
2つの処理
データ転送
計算オフロード
計算オフロード、データ転送、両方を考慮する必要がある
GPU Memory CPU Memory
PCI
![Page 30: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/30.jpg)
カーネルチューニング
![Page 31: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/31.jpg)
カーネルチューニング (LOOP CONSTRUCT)
#pragma acc data pcopy(A) create(Anew)
while ( error > tol ) {
error = 0.0;
#pragma acc kernels pcopy(Anew[:][:]) pcopyin(A[:][:])
#pragma acc loop reduction(max:error)
for (int j = 1; j < N-1; j++) {
#pragma acc loop reduction(max:error)
for (int i = 1; i < M-1; i++) {
Anew[j][i] = (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]) * 0.25;
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
...
}
![Page 32: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/32.jpg)
カーネルチューニング (LOOP CONSTRUCT)
Gang
Worker
Vector … SIMD幅
Independent
Collapse
Seq
...
#pragma acc data pcopy(A) create(Anew)
while ( error > tol ) {
error = 0.0;
#pragma acc kernels pcopy(Anew[:][:]) pcopyin(A[:][:])
#pragma acc loop reduction(max:error)
for (int j = 1; j < N-1; j++) {
#pragma acc loop reduction(max:error)
for (int i = 1; i < M-1; i++) {
Anew[j][i] = (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]) * 0.25;
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
...
}
$ pgcc -Minfo=acc -acc jacobi.c
jacobi:
59, Generating present_or_copyout(Anew[1:4094][1:4094])
Generating present_or_copyin(A[:][:])
Generating Tesla code
61, Loop is parallelizable
63, Loop is parallelizable
Accelerator kernel generated
61, #pragma acc loop gang /* blockIdx.y */
63, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
Max reduction generated for error
![Page 33: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/33.jpg)
カーネルチューニング (LOOP CONSTRUCT)
Gang
Worker
Vector … SIMD幅
Collapse
Independent
Seq
Cache
Tile
#pragma acc data pcopy(A) create(Anew)
while ( error > tol ) {
error = 0.0;
#pragma acc kernels pcopy(Anew[:][:]) pcopyin(A[:][:])
#pragma acc loop reduction(max:error) gang vector(1)
for (int j = 1; j < N-1; j++) {
#pragma acc loop reduction(max:error) gang vector(128)
for (int i = 1; i < M-1; i++) {
Anew[j][i] = (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]) * 0.25;
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
...
}
![Page 34: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/34.jpg)
実行条件設定 (VECTOR CLAUSE)
#pragma acc loop gang vector(4)
for (j = 0; j < 16; j++) {
#pragma accloop gang vector(16)
for (i = 0; i < 16; i++) {
...
4 x 16
i
4 x 16
4 x 16
4 x 16
j
#pragma acc loop gang vector(8)
for (j = 1; j < 16; j++) {
#pragma accloop gang vector(8)
for (i = 0; i < 16; i++) {
...
i
j
8 x 8 8 x 8
8 x 8 8 x 8
![Page 35: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/35.jpg)
カーネルチューニング (LOOP CONSTRUCT)
Gang
Worker
Vector … SIMD幅
Collapse
Independent
Seq
Cache
Tile
...
#pragma acc data pcopy(A) create(Anew)
while ( error > tol ) {
error = 0.0;
#pragma acc kernels pcopy(Anew[:][:]) pcopyin(A[:][:])
#pragma acc loop reduction(max:error) \
collapse(2) gang vector(128)
for (int j = 1; j < N-1; j++) {
for (int i = 1; i < M-1; i++) {
Anew[j][i] = (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]) * 0.25;
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
...
}
![Page 36: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/36.jpg)
カーネルチューニング (LOOP CONSTRUCT)
Gang
Worker
Vector … SIMD幅
Collapse
Independent
Seq
Cache
Tile
...
#pragma acc data pcopy(A) create(Anew)
while ( error > tol ) {
error = 0.0;
#pragma acc kernels pcopy(Anew[:][:]) pcopyin(A[:][:])
#pragma acc loop reduction(max:error) independent
for (int jj = 1; jj < NN-1; jj++) {
int j = list_j[jj];
#pragma acc loop reduction(max:error)
for (int i = 1; i < M-1; i++) {
Anew[j][i] = (A[j][i+1] + A[j][i-1] +
A[j-1][i] + A[j+1][i]) * 0.25;
error = max(error, abs(Anew[j][i] - A[j][i]);
}
}
...
}
![Page 37: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/37.jpg)
カーネルチューニング (LOOP CONSTRUCT)
Gang
Worker
Vector … SIMD幅
Collapse
Independent
Seq
Cache
Tile
...
#pragma acc kernels pcopy(Anew[:][:]) pcopyin(A[:][:])
#pragma acc loop seq
for (int k = 3; k < NK-3; k++) {
#pragma acc loop
for (int j = 0; j < NJ; j++) {
#pragma acc loop
for (int i = 0; i < NI; i++) {
Anew[k][j][i] = func(
A[k-1][j][i], A[k-2][j][i], A[k-3][j][i],
A[k+1][j][i], A[k+2][j][i], A[k+3][j][i], ...
);
}
}
}
![Page 38: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/38.jpg)
MPIとは簡単に併用できるの?
![Page 39: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/39.jpg)
MPI並列 (HALO EXCHANGE)
ブロック分割
各プロセスは1ブロック担当
境界部(halo)のデータ交換
A(i,j) A(i+1,j) A(i-1,j)
A(i,j-1)
A(i,j+1)
![Page 40: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/40.jpg)
MPI JACOBI ITERATION
#pragma acc data pcopy(A) create(Anew)
while ( error > tol ) {
#pragma acc kernels pcopy(Anew) pcopyin(A)
calc_new_A( Anew, A, ... );
#pragma acc kernels pcopy(A) pcopyin(Anew)
update_A( A, Anew );
}
![Page 41: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/41.jpg)
MPI JACOBI ITERATION
#pragma acc data pcopy(A) create(Anew)
while ( error > tol ) {
pack_data_at_boundary( send_buf, A, ... );
exchange_data_by_MPI( recv_buf, send_buf, ... );
unpack_data_to_halo( A, recv_buf, ... );
#pragma acc kernels pcopy(Anew) pcopyin(A)
calc_new_A( Anew, A, ... );
#pragma acc kernels pcopy(A) pcopyin(Anew)
update_A( A, Anew );
}
1.送信データ の梱包
2.データの交換
3.受信データ の開梱
GPU
GPU
MPI
![Page 42: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/42.jpg)
MPI JACOBI ITERATION
#pragma acc data pcopy(A) create(Anew)
while ( error > tol ) {
#pragma acc kernels pcopyin(A) pcopyout(send_buf)
pack_data_at_boundary( send_buf, A, ... );
exchange_data_by_MPI( recv_buf, send_buf, ... );
#pragma acc kernels pcopy(A) pcopyin(recv_buf)
unpack_data_to_halo( A, recv_buf, ... );
#pragma acc kernels pcopy(Anew) pcopyin(A)
calc_new_A( Anew, A, ... );
#pragma acc kernels pcopy(A) pcopyin(Anew)
update_A( A, Anew );
}
1. GPU上でデータを送信バッファに梱包し、Hostに転送
3. GPUに転送、GPU上で受信バッファのデータを開梱
2. 隣接プロセスとデータ交換
GPU
GPU
MPI
![Page 43: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/43.jpg)
MPI JACOBI ITERATION (NVVP)
1 cycle
データ梱包
MPI
データ開梱
MPI Pack Upck
![Page 44: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/44.jpg)
オーバーラップ (ASYNC/WAIT CLAUSE)
while ( error > tol ) {
#pragma acc kernels pcopyin(A) pcopyout(send_buf)
pack_data_at_boundary( send_buf, A, ... );
exchange_data_by_MPI( recv_buf, send_buf, ... );
#pragma acc kernels pcopy(A) pcopyin(recv_buf)
unpack_data_to_halo( A, recv_buf, ... );
#pragma acc kernels pcopy(Anew) pcopyin(A)
calc_new_A( Anew, A, ... );
#pragma acc kernels pcopy(A) pcopyin(Anew)
update_A( A, Anew );
}
![Page 45: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/45.jpg)
オーバーラップ (ASYNC/WAIT CLAUSE)
while ( error > tol ) {
#pragma acc kernels pcopyin(A) pcopyout(send_buf)
pack_data_at_boundary( send_buf, A, ... );
#pragma acc kernels pcopy(Anew) pcopyin(A)
calc_new_A_inside( Anew, A, ... );
exchange_data_by_MPI( recv_buf, send_buf, ... );
#pragma acc kernels pcopy(A) pcopyin(recv_buf)
unpack_data_to_halo( A, recv_buf, ... );
#pragma acc kernels pcopy(Anew) pcopyin(A)
calc_new_A_at_boundary( Anew, A, ... );
#pragma acc kernels pcopy(A) pcopyin(Anew)
update_A( A, Anew );
}
内部
境界部
![Page 46: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/46.jpg)
オーバーラップ (ASYNC/WAIT CLAUSE)
while ( error > tol ) {
#pragma acc kernels pcopyin(A) pcopyout(send_buf) async(2)
pack_data_at_boundary( send_buf, A, ... );
#pragma acc kernels pcopy(Anew) pcopyin(A) async(1)
calc_new_A_inside( Anew, A, ... );
#pragma acc wait(2)
exchange_data_by_MPI( recv_buf, send_buf, ... );
#pragma acc kernels pcopy(A) pcopyin(recv_buf) async(2)
unpack_data_to_halo( A, recv_buf, ... );
#pragma acc kernels pcopy(Anew) pcopyin(A) async(2)
calc_new_A_at_boundary( Anew, A, ... );
#pragma acc kernels pcopy(A) pcopyin(Anew) wait(1,2)
update_A( A, Anew );
}
![Page 47: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/47.jpg)
オーバーラップ(NVVP)
1 cycle
MPI Pack Upck
![Page 48: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/48.jpg)
OPENACCって、 実際に使われているの?
![Page 49: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/49.jpg)
NICAM
気象・気候モデル by 理研AICS/東大
—膨大なコード (数十万行)
—ホットスポットがない (パレートの法則)
特性の異なる2種類の処理
—力学系 … メモリバンド幅ネック
—物理系 … 演算ネック
![Page 50: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/50.jpg)
NICAM: 力学系(NICAM-DC)
OpenACCによるGPU化
— 主要サブルーチンは、全てGPU上で動作(50以上)
— MPI対応済み
— 2週間
良好なスケーラビリティ
— Tsubame 2.5, 最大2560 GPUs
— Scaling factor: 0.8
Weak scaling
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04
Perf
orm
ance (
GFLO
PS)
Number of CPUs or GPUs
Tsubame 2.5 (GPU:K20X)
K computer
Tsubame 2.5 (CPU:WSM)
(*) weak scaling
![Page 51: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/51.jpg)
NICAM: 力学系(NICAM-DC)
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+02 1.E+03 1.E+04 1.E+05 1.E+06
Measu
red P
erf
orm
ance
(GFLO
PS)
Aggregate Peak Memory Bandwidth (GB/s)
Tsubame 2.5 (GPU:K20X)
K computer
Tsubame 2.5 (CPU:WSM)
![Page 52: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/52.jpg)
NICAM: 物理系(SCALE-LES)
Atmospheric radiation transfer
—物理系の中で、最も重い計算
— OpenACCによるGPU対応、完了
1.00 1.99 3.88 8.51
37.8
76.0
151
0
20
40
60
80
100
120
140
160
1 core 2 core 4 core 10 core 1 GPU 2 GPUs 4 GPUs
Xeon E5-2690v2(3.0GHz,10-core) Tesla K40
Speedup
vs.
CPU
1-c
ore
(*) PCIデータ転送時間込み, グリッドサイズ:1256x32x32
![Page 53: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/53.jpg)
SEISM3D
地震シミュレーション by 東大地震研(古村教授)
主要サブルーチンのGPU対応が完了
— メモリバンド幅ネック、 3次元モデル(2次元分割)、隣接プロセス間通信
605
459
134
0
100
200
300
400
500
600
K: 8x SPARC64VIIIfx
CPU: 8x XeonE5-2690v2
GPU: 8x TeslaK40
Tim
e (
sec)
SEISM3D (480x480x1024, 1K steps)
3.4x speedup
(アプリ全体)
0
20
40
60
80
100
120
140
GPU: 8x Tesla K40
Others (CPU, MPI and so on)
[CUDA memcpy DtoH]
[CUDA memcpy HtoD]
(other subroutines)
update_vel_pml
update_vel
update_stress_pml
update_stress
diff3d_*
GPUの実行時間内訳
![Page 54: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/54.jpg)
SEISM3D
100
1,000
10,000
100 1,000 10,000
性能
(M
gri
ds/
sec)
トータルピークメモリバンド幅 (GB/s)
Tesla K40
SX9
FX10
K
Xeon E5-2* v2 (IVB)
Xeon E5-4* (SDB)
Xeon X7* (NHL EX)
![Page 55: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/55.jpg)
FFR/BCM
次世代CFD by 理研AICS/北大(坪倉准教授)
MUSCL_bench:
— MUSCLスキームに基づくFlux計算 (とても複雑な計算)
— CFD計算の主要部分 (60-70%)
— OpenACCによるGPU対応、完了
1.00 1.93 4.55
8.30
33.21
05
101520253035
1 core 2 core 5 core 10 core 1 GPU
Xeon E5-2690v2(3.0GHz,10-core) Tesla K40
Speedup
vs.
1 C
PU
core
(*) PCIデータ転送時間込み、サイズ:80x32x32x32
![Page 56: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/56.jpg)
まとめ
OpenACCの現状を紹介
簡単: 既存コードへのディレクティブ追加
強力: 少ない労力でGPU利用可能
オープン: 採用事例の増加
![Page 57: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/57.jpg)
CUDA 6の強化ポイント
Akira Naruse
NVIDAI Developer Technologies
![Page 58: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/58.jpg)
CUDA 6
ユニファイド・メモリ
XTライブラリ
ドロップイン・ライブラリ
GPUDirect RDMA
開発ツール
![Page 59: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/59.jpg)
ユニファイドメモリ
Now
ホストメモリ GPUメモリ
開発者から見えるメモリモデル
ユニファイドメモリ
![Page 60: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/60.jpg)
煩雑なメモリマネジメント
void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); }
void sortfile(FILE *fp, int N) { char *data char *d_data; data = (char *)malloc(N); cudaMalloc(&d_data, N); fread(data, 1, N, fp); cudaMemcpy(d_data, data, N, ..); qsort<<<...>>>(d_data,N,1,compare); cudaDeviceSynchronize(); cudaMemcpy(data, d_data, N, ..); use_data(data); cudaFree(d_data); free(data); }
CPUコード GPUコード
![Page 61: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/61.jpg)
メモリマネジメントを簡素化
void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); }
void sortfile(FILE *fp, int N) { char *data cudaMallocManaged(&d_data, N); fread(data, 1, N, fp); qsort<<<...>>>(d_data,N,1,compare); cudaDeviceSynchronize(); use_data(data); cudaFree(data); }
CPUコード ユニファイドメモリ(CUDA6)
![Page 62: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/62.jpg)
メモリマネジメントの統合(将来)
void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); }
void sortfile(FILE *fp, int N) { char *data data = (char *)malloc(N); fread(data, 1, N, fp); qsort<<<...>>>(d_data,N,1,compare); cudaDeviceSynchronize(); use_data(data); free(data); }
CPUコード 将来?
![Page 63: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/63.jpg)
DEEP COPY
CPU Memory
“Hello World”
dataElem
prop1
prop2
*text
GPU Memory
struct dataElem { int prop1; int prop2; char *text; };
![Page 64: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/64.jpg)
CPU Memory
“Hello World”
dataElem
prop1
prop2
*text
GPU Memory
“Hello World”
dataElem
prop1
prop2
*text
コピーが
2回必要
struct dataElem { int prop1; int prop2; char *text; };
DEEP COPY
![Page 65: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/65.jpg)
CPU Memory
“Hello World”
dataElem
prop1
prop2
*text
GPU Memory
“Hello World”
dataElem
prop1
prop2
*text
void launch(dataElem *elem) { dataElem *g_elem; char *g_text; int textlen = strlen(elem->text); cudaMalloc(&g_elem, sizeof(dataElem)); cudaMalloc(&g_text, textlen); cudaMemcpy(g_elem, elem, sizeof(dataElem)); cudaMemcpy(g_text, elem->text, textlen); cudaMemcpy(&(g_elem->text), &g_text, sizeof(g_text)); kernel<<< ... >>>(g_elem); }
DEEP COPY
実際は
3回必要
![Page 66: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/66.jpg)
CPU Memory
GPU Memory
Unified Memory
“Hello World”
dataElem
prop1
prop2
*text
void launch(dataElem *elem) { kernel<<< ... >>>(elem); }
DEEP COPY (ユニファイドメモリ)
![Page 67: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/67.jpg)
連結リスト
CPU Memory
GPU Memory
key
data
next
key
data
next
key
data
next
key
data
next
![Page 68: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/68.jpg)
連結リスト
CPU Memory
GPU Memory
key
data
next
key
data
next
key
data
next
key
data
next
全部を
転送?
毎回、全部転送
— PCIのバンド幅ネック
最初は全部転送、以降は更新箇所だけ転送
— とても複雑な処理
CPUメモリにデータを配置、GPUはPCI経由のアクセス
— PCI経由、遅い
![Page 69: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/69.jpg)
連結リスト (ユニファイドメモリ)
CPU Memory
GPU Memory
Unified Memory
key
data
next
key
data
next
key
data
next
key
data
next
通常のメモリアクセス
通常のメモリアクセス
CPUからもGPUからもリスト操作が可能
— 挿入、削除
リスト更新後に、CPUメモリとGPUメモリ間の明示的な同期は不要
CPUとGPUから同時アクセスはNG、排他制御必要
![Page 70: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/70.jpg)
ロードマップ
CUDA 6:簡単に利用
単一のポインタ
Memcpy記述不要
ホスト側プログラムと データ構造を共有
Next:最適化
プリフェッチ
データ移動ヒント
OSサポートの追加
Pascal
システムアロケータの統合
スタックメモリの統合
メモリコヒーレンシを HWでアクセラレート
![Page 71: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/71.jpg)
XTライブラリ
![Page 72: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/72.jpg)
XTライブラリ
cuBLAS-XT and cuFFT-XT
明示的なデータ転送の指示は不要
—必要なGPUメモリはライブラリが確保
マルチGPUに自動対応
—マルチGPU向けのコード記述は不要
GPUメモリ容量を超えるサイズに対応 (out-of-core)
—カーネル実行とデータ転送をオーバーラップ(BLAS level 3)
![Page 73: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/73.jpg)
CUBLAS
cublasHandle_t handle; cublasCreate(&handle); cudaMalloc(&d_A, ..); cudaMalloc(&d_B, ..); cudaMalloc(&d_C, ..); cudaSetMatrix(.., d_A, .., A, ..); cudaSetMatrix(.., d_B, .., B, ..); cublasDgemm(handle, .., d_A, .., d_B, .., d_C, ..); cudaGetMatrix(.., d_C, .., C, ..); cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); cublasDestroy(handle);
cuBLAS 行列積コード
![Page 74: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/74.jpg)
CUBLAS CUBLAS-XT
cublasHandle_t handle; cublasCreate(&handle); cudaMalloc(&d_A, ..); cudaMalloc(&d_B, ..); cudaMalloc(&d_C, ..); cudaSetMatrix(.., d_A, .., A, ..); cudaSetMatrix(.., d_B, .., B, ..); cublasDgemm(handle, .., d_A, .., d_B, .., d_C, ..); cudaGetMatrix(.., d_C, .., C, ..); cudaFree(d_A); cudaFree(d_B); cudaFree(d_C); cublasDestroy(handle);
cublasXtHandle_t handle; cublasXtCreate(&handle); cublasXtDgemm(handle, .., A, .., B, .., C, ..); cublasXtDestroy(handle);
cuBLAS cuBLAS-XT 行列積コード
![Page 75: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/75.jpg)
CUBLAS-XT API
使用GPU
— cublasXtDeviceSelect() GPU数、使用GPU IDs
ブロッキングサイズ
— cublasXtSetBlockDim() ブロッキングサイズの設定
— cublasXtGetBloskDim() (現設定の取得)
CPU・GPUハイブリッド実行
— cublasXtSetCpuRoutine() CPU版BLASの設定
— cublasXtSetCpuRatio() CPU比率の設定
Pinnedメモリ — cublasXtSetPinningMemMode() Pinnedメモリの設定
— cublasXtGetPinningMemMode() (現設定の取得)
![Page 76: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/76.jpg)
CUBLAS-XT
全てのBLAS level 3 ルーチンをサポート
行列サイズがGPUメモリ容量超でもOK (out-of-core)
0
500
1000
1500
2000
2500
0 4096 8192 12288 16384 20480 24576 28672
GFLO
PS
Matrix Size (NxN)
cuBLAS ZGEMM Performance on 2 GPUs
1 K20c 2 K20c
In-core Out-of-core
![Page 77: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/77.jpg)
CUBLAS-XT (NVVP)
![Page 78: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/78.jpg)
ドロップイン・ライブラリ
![Page 79: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/79.jpg)
ドロップイン・ライブラリ
標準ライブラリAPIでのGPU利用を可能に
NVBLAS
— BLAS level 3関数呼び出しを、自動的にcuBLASに置き換え
— cuBLAS利用のためのソース変更は不要
使い方
— NVBLASを入れて再コンパイル
— Linuxは、LD_PRELOAD設定で使用可能 (最コンパイル不要)
dgemm(.., A, .., B, .., C, ..);
CPU
![Page 80: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/80.jpg)
ドロップイン・ライブラリ
標準ライブラリAPIでのGPU利用を可能に
NVBLAS
— BLAS level 3関数呼び出しを、自動的にcuBLASに置き換え
— cuBLAS利用のためのソース変更は不要
使い方
— NVBLASを入れて再コンパイル
— Linuxは、LD_PRELOAD設定で使用可能 (最コンパイル不要)
dgemm(.., A, .., B, .., C, ..); dgemm(.., A, .., B, .., C, ..);
CPU NVBLAS
![Page 81: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/81.jpg)
NVBLAS (LINUX)
NVBLAS_LOGFILE nvblas.log
NVBLAS_CPU_BLAS_LIB libmkl_intel_lp64.so \ libmkl_core.so \ libmkl_intel_thread.so
NVBLAS_GPU_LIST 0 # ALL, ALL0
NVBLAS_TILE_DIM 2048
NVBLAS_AUTOPIN_MEM_ENABLED
設定ファイル (nvblas.conf)
$ LD_PRELOAD=/usr/local/cuda-6.0/lib64/libnvblas.so ./a.out
![Page 82: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/82.jpg)
BLAS level 3使用のアプリに適用可能
— Octave, Scilab, など.
0
500
1000
1500
2000
2500
3000
0 5000 10000 15000 20000 25000 30000 35000
fp64 G
Flo
ps/
s
matrix dimension
R言語での行列乗算
nvBLAS, 4x K20X GPUs
MKL, 6-core Xeon E5-2667 CPU
NVBLAS
![Page 83: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/83.jpg)
NVBLASデモ
![Page 84: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/84.jpg)
CUDA 6
ユニファイド・メモリ
XTライブラリ
ドロップイン・ライブラリ
GPUDirect RDMA
開発ツール
![Page 85: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/85.jpg)
CUDA 6
並列コンピューティング
を簡単に
developer.nvidia.com/cuda-toolkit
CUDA Registered Developer Program
![Page 86: OPENACCの現状 - GTC On-Demand Featured Talks | …on-demand.gputechconf.com/gtc/2014/jp/sessions/4005.pdfUNC / ADCIRC Storm Surge OpenACC (AmgX?) LSU LONI NOAA / MOM6 Ocean GCM](https://reader033.fdocuments.net/reader033/viewer/2022050903/5aefae267f8b9a8c308c4f1f/html5/thumbnails/86.jpg)
CUDA 6.5 RC
64-bit ARMマシン
Microsoft Visual Studio 2013 (VC12)
cuFFT callbacks
cuSPARSE (BSR格納形式)
CUDA占有率計算API
CUDA FORTRAN デバッグ機能
アプリケーションリプレイモード (Visual Profile and nvprof)
Nvprune ユーティリティ (objectサイズ削減)