Cuda fortranの利便性を高めるfortran言語の機能

CUDA Fortranの利便性を高めるFortran言語の機能

長岡技術科学大学技学研究院

電気電子情報工学専攻（電気系）

特任准教授出川智啓

2015/9/17 Prometech Simulation Conference 2015 1

内容

• Modern Fortran(90/95/2003)の簡単な紹介

• 1次元移流方程式のCUDA Fortran実装（例）

• オブジェクト指向プログラミングの導入による移植範囲の限定

• まとめ


はじめに

• GPGPUの普及と裾野の広がり

• 多くの資産を持つFortranユーザからの要求

• CUDA Fortranの登場

• GPUの世代更新に伴う高性能化

• ハードウェアの高性能化

• プログラムの適化・高速化に関する知見の蓄積

• GPGPUは黎明期から成熟期へ


はじめに

• プログラミング方法論の進展

• プログラムの保守性，拡張性，再利用性の向上

• 手続き型からオブジェクト指向プログラミングへ転換

• Fortranの対応状況

• Fortran2003以降でオブジェクト指向プログラミングが可能

• CUDA Fortranの情報＝CUDA Cと同じ処理をどう書くか• Fortranの機能やプログラミング方法論と関連付けた情報は少ない

• CUDA Fortranから利用できるModern Fortranの機能を紹介し，使用例を示す


CUDA Fortran• FortranのNVIDIA GPU向け拡張

• PGI Fortranコンパイラで利用可能

• 2015年9月16日現在の新版は15.7（7月13日リリース）

• CUDA Cを利用するが，新機能はFortranコンパイラが対応しないと利用できない

• かけた労力と得られる利得（性能向上）のバランスがよい

• 並列計算の知識だけである程度の性能が得られる


PGI Fortranコンパイラ

• Fortran 77/90/95/2003コンパイラ

• Fortran 2008に一部対応

• 対応状況はよくない

• CUDA Fortranが利用できる機能は？

• CUDA Fortranの情報＝CUDA Cと同じ処理をどう書くか

• Modern Fortranの機能をCUDA Fortranから利用するための情報は少ない


Fortran 90/95• FORTRAN 77から大幅に進化

• 主な特徴

• implicit noneによる暗黙の型宣言の無効化

• 配列演算子の導入• 配列の全要素に対する演算を一括して記述

• 配列を関数の返値に利用可能

• moduleによるカプセル化• public, privateによる変数・手続き†の公開範囲の制御


†手続き（procedure）は関数とサブルーチンの総称

Fortran 90/95• 主な特徴

• 柔軟なメモリ管理• allocate/deallocate, pointerによる配列の動的管理

• 自動割付配列

• 実引数から配列サイズを取得し，自動で割付・解放される配列

• 再帰手続き

• 派生型の導入• C言語の構造体に相当

• 手続きおよび演算子のオーバーロード• 手続きの呼出名称を共通化

• どの手続きが呼び出されるかは実引数の型で判断


Fortran 2003• Fortran 90/95のメジャーバージョンアップ

• オブジェクト指向プログラミングへの対応

• C言語との連携強化[1]

• 主な特徴

• 派生型を拡張• 変数だけでなく手続きも包括して定義

• 継承，多相性，抽象型などの導入

• メモリ管理の強化• source指定子による変数のクローン作成


[1]出川，PGI CUDA FortranとGPU 適化ライブラリの一連携法，Prometech Simulation Conference 2014.

Fortran 90/95らしい処理の書き方

• 昇順クイックソート


recursive function qsort(data) result(sorted)implicit noneinteger,intent(in) :: data(:)integer :: sorted(1:size(data))

if(size(data) > 1)thensorted = (/ qsort(pack(data(2:),data(2:)< data(1))), &!pack関数のフィルタを

data(1), &!>, <=に変更すればqsort(pack(data(2:),data(2:)>=data(1))) /) !降順

elsesorted = data

end if

end function qsort

1次元移流方程式のCUDA Fortran実装（例）


支配方程式

• 1次元移流方程式

• 空間微分

• 2次精度中心差分

• 時間積分

• 1次精度Euler法


0

xfc

tf

t : 時間

c : 移流速度

x : 空間方向

x

fn+1

x

fn

t c

プログラム作成，実行環境

• 開発環境

• Microsoft Visual Studio Community 2013• PGI Accelerator Compiler 15.7 + CUDA 6.5• コンパイルオプション

• ‐fast ‐Mcuda (‐McudaはGPU向けにコンパイルする場合のみ)

• 実行環境

• OS Windows 8.1• CPU Core i7 920 (2.66GHz)• メモリ 6GB• GPU NVIDIA GTX Titan


メインルーチン


program mainuse parametersuse kernelimplicit nonereal(8),allocatable :: f (:)real(8),allocatable :: d_f_dx(:)integer :: n

allocate( f (Nx))allocate(d_f_dx(Nx))

call initialize(f)call output(f,"f_start.txt")do n=1,Nt

call computeDifference(d_f_dx,f)call integrate(f,d_f_dx)

end docall output(f,"f_end.txt")

deallocate( f )deallocate(d_f_dx)

end program main

program mainuse parametersuse kernelimplicit none

real(8),allocatable :: f (:)real(8),allocatable :: d_f_dx(:)integer :: n


call initialize(f)call output(f,"f_start.txt")

do n=1,Ntcall computeDifference(d_f_dx,f)call integrate(f,d_f_dx)

end do

call output(f,"f_end.txt")


end program main

モジュール（計算パラメータ）


module parametersimplicit noneprivatepublic :: PI2, Lx, Nx, dx, dx2, conv, dt, Nt

real(8),parameter :: PI = 3.1415926535897932384626433832795d0real(8),parameter :: PI2 = 6.283185307179586476925286766559d0

real(8),parameter :: Lx = 1d0integer,parameter :: Nx = 2**20real(8),parameter :: dx = Lx/dble(Nx‐1)real(8),parameter :: dx2 = Lx/dble(Nx‐1)*2d0

real(8),parameter :: conv = 1d0real(8),parameter :: dt = 1d‐5real(8),parameter :: endT = 0.5d0integer,parameter :: Nt = int(endT/dt)

end module parameters

計算条件

計算領域 Lx = 1 m分割数 Nx = 220（大）移流速度 c = 1 m/s時間間隔 t = 10−5 s終了時間 t = 0.5 s

モジュール（サブルーチン群）


module kerneluse parametersimplicit nonecontains!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!subroutine initialize(f) !関数値の初期化

:end subroutine initialize!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!subroutine computeDifference(d_f_dx,f) !空間微分

:end subroutine computeDifference!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!subroutine integrate(f,d_f_dx) !時間積分

:end subroutine integrate!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!subroutine output(value,filename) !ファイル出力

:end subroutine output!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!

end module kernel

•

関数値の初期化


subroutine initialize(f)use parameters,only:Nx,dx,Lx,PI2implicit none

real(8),intent(inout) :: f(Nx)

integer :: i

do i = 1,Nxf(i) = ((1d0‐cos(PI2*dble(i‐1)*dx/Lx))/2d0)**10

end do

!配列構成子とdo反復を用いた書き方!f = (/ ( ((1d0‐cos(PI2*dble(i‐1)*dx/Lx))/2d0)**10, i=1,Nx ) /)

end subroutine initialize

0 0.2 0.4 0.6 0.8 10

0.5

1 10

/2cos121

xLxxf

x

f

空間微分



subroutine computeDifference(d_f_dx,f)use parameters,only:Nx,dx2implicit none

real(8),intent(out) :: d_f_dx(Nx)real(8),intent(in) :: f (Nx)integer :: i

i=1d_f_dx(i) = (‐3d0*f(i)+4d0*f(i+1)‐f(i+2))/dx2

do i=2,Nx‐1d_f_dx(i) = (f(i+1)‐f(i‐1))/dx2

end doi=Nx

d_f_dx(i) = ( 3d0*f(i)‐4d0*f(i‐1)+f(i‐2))/dx2

end subroutine computeDifference

空間微分の計算

Δxfff

ΔxffΔx

fff

dxdf

xxx NNN

ii

243

2

243

21

11

321

時間積分



subroutine integrate(f,d_f_dx)use parameters,only:Nx,conv,dtimplicit none

real(8),intent(inout) :: f (Nx)real(8),intent(in ) :: d_f_dx(Nx)

integer :: i

!1次精度Euler法による積分do i = 1,Nx

f(i) = f(i) ‐ conv*dt*d_f_dx(i)end do

!配列演算を利用した書き方!f = f ‐ conv*dt*d_f_dx

end subroutine integrate

dxdftcΔf

dtdfΔtff

nnnn 1

ファイル出力

• 自動再割付配列

• 代入される配列の大きさに応じて，動的配列の形状が自動で調整

• 可変長文字列

• 文字列の長さをコロン(:)で宣言

• character(:),allocatable :: 変数名

• 引数で受け取る時はアスタリスク(*)


subroutine output(value,filename)use parametersimplicit none

real(8),intent(in) :: value(Nx)!filenameは可変長文字列として受け取るcharacter(*) :: filenameinteger :: i

open(unit=100,file=filename)do i=1,Nx

write(100,*) (i‐1)*dx, value(i)end doclose(100)

end subroutine output

実行結果

• fが一定速度cで+x方向へ移流


0 0.2 0.4 0.6 0.8 10

0.5

1 t=0 t=0.1 t=0.2 t=0.3 t=0.4 t=0.5

x

f

GPUへの移植

• CUDA Cと比較して若干簡素

• エラーを考慮しなければ変更箇所を少なくできる

• GPUの制御を隠して数値計算に集中

• CとFortranにおけるメモリの取り扱い

• Cはポインタが基本• メモリ割付け関数を変えることでホスト変数†とデバイス変数‡を区別

• Fortranは変数が基本• 変数に属性を追加することでホスト変数とデバイス変数を区別

• 関数の明示的な変更を隠蔽


†CPU側のメモリに確保される通常の変数‡GPU側のメモリに確保される変数

GPUへの移植

• ファイル拡張子を.cufに変更

• GPUの都合を反映• サブルーチンにattributes(global)を付与

• サブルーチン名と引数の間に<<<:,:>>>を追加• 実行時の並列度の指定

• サブルーチンには1スレッドが処理する内容を記述

• GPUで使うメモリにdevice属性を付与

• GPUとのデータの受け渡しには代入演算子(=)が利用可能


program main

use parametersuse kernelimplicit none



call initialize(f)do n=1,Nt

call computeDifference(d_f_dx,f)call integrate(f,d_f_dx)

end do


end program main

メインルーチン（GPU版）


program mainuse cudaforuse parametersuse kernel !モジュールを直接書き換えるimplicit none

real(8),allocatable,device :: f (:) !device属性を付与してデバイス変数とするreal(8),allocatable,device :: d_f_dx(:) !integer :: n

allocate( f (Nx)) !メモリ確保は変更無しallocate(d_f_dx(Nx)) !

call initialize<<<Block,Threads>>>(f) !実行時の並列度の指定do n=1,Nt

call computeDifference<<<Block,Threads>>>(d_f_dx,f)call integrate<<<Block,Threads>>>(f,d_f_dx)

end do

deallocate( f ) !メモリ解放は変更無しdeallocate(d_f_dx) !

end program main

モジュール（GPU版サブルーチン群）


module kerneluse cudaforuse parameter,only:Nximplicit none

type(dim3),parameter :: Threads = dim3(min(Nx,256) ,1,1)type(dim3),parameter :: Blocks = dim3(Nx/Threads%x,1,1)

contains!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!

:!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!

end module kernel

カーネル†呼出時の並列度の指定に利用する派生型（dim3型）パラメータを宣言

定義された成分の値を全て列挙することで，派生型名（ここではdim3）をコンストラクタとして利用可能

†GPUで実行されるサブルーチンの総称

subroutine initialize(f)use parameters,only:Nx,dx,Lx,PI2implicit none

real(8),intent(inout) :: f(Nx)

integer :: i


end do


•

関数値の初期化（GPU版）


attributes(global)& !GPUで実行するカーネルと認識させるsubroutine initialize(f) !ためにattributs(global)を付与

use parameters,only:Nx,dx,Lx,PI2implicit none

real(8),device,intent(inout) :: f(Nx)

integer :: ii = (blockIdx%x‐1)*blockDim%x + threadIdx%x !GPUのスレッドと配列添字の対応付け


end do


10

/2cos121

xLxxf

空間微分（GPU版）



attributes(global) subroutine computeDifference(d_f_dx,f)use parameter,only:Nx,dx2implicit none

real(8),device,intent(out) :: d_f_dx(Nx)real(8),device,intent(in) :: f (Nx)integer :: ii = (blockIdx%x‐1)*blockDim%x + threadIdx%x

if(i==1)thend_f_dx(i) = (‐3d0*f(i)+4d0*f(i+1)‐f(i+2))/dx2

else&if(1<i .and. i<Nx)then

d_f_dx(i) = (f(i+1)‐f(i‐1))/dx2else&if(i==Nx)then

d_f_dx(i) = ( 3d0*f(i)‐4d0*f(i‐1)+f(i‐2))/dx2end if

end subroutine computeDifference

時間積分（GPU版）



attributes(global) subroutine integrate(f,d_f_dx)use parameter,only:Nx,conv,dtimplicit none

real(8),device,intent(inout) :: f (Nx)real(8),device,intent(in) :: d_f_dx(Nx)

integer :: ii = (blockIdx%x‐1)*blockDim%x + threadIdx%x

do i = 1,Nxf(i) = f(i) ‐ dt*conv*d_f_dx(i)

end do


ファイル出力（GPU版）



real(8),device,intent(in) :: value(Nx)character(*) :: filenamereal(8),allocatable :: host_value(:)integer :: i

allocate(host_value, source = value)open(unit=100,file=filename)do i=1,Nx

write(100,*) (i‐1)*dx, host_value(i)end doclose(10)deallocate(host_value)


source指定子による変数のクローンの作成

この1行で・配列valueのサイズの確認・host_valueのメモリ確保・データのコピー（GPU→CPU）を実行

ファイル出力（GPU版）



real(8),device,intent(in) :: value(Nx)character(*) :: filenamereal(8),allocatable,save :: host_value(:)integer :: i

if(.not.allocated(host_value)) allocate(host_value(Nx))host_value = valueopen(unit=100,file=filename)do i=1,Nx

write(100,*) (i‐1)*dx, host_value(i)end doclose(10)


save属性を付与し，プログラ

ムの終了までメモリの状態（割り付け済みか否か）を保持

関数allocated()で状態を

確認し，未割付の場合のみallocateでメモリ確保

実行結果（1ステップあたりの実行時間）


配列サイズNx

実行時間[ms]CPU GPU

210 0.0190 0.120212 0.0720 0.110214 0.280 0.150216 1.20 0.160218 7.00 0.560220 33.0 1.90



210 212 214 216 218 220

102

101

100

10-1

10-2

CPUGPU

配列サイズNx

実行時間

[ms]

CPUコードとGPUコードの共存

• 移流方程式は規模が小さく，処理が簡単

• CPUコードを保持せず，直接書き換えることができた

• 規模が大きい場合

• CPUコードから徐々に（サブルーチン毎に）GPUへ移植

• CPUコードとGPUコードの混在と切替が必要• CPUコードと同じソースに追記

• CPUコードとは別のソースを新しく作り，そこに記述

• GPUの利用に直接関係ない箇所の変更は極力少なくしたい


CPUコードと同じソースに追記

• CPUコードのファイル拡張子を.cufに変更

• カーネルを追加• 当然カーネル名はサブルーチン名と異なる

• 手続きのオーバーロード• CPUで実行する手続きとGPUで実行するカーネルを共通の名前で呼び出し

• 引数（ホスト変数かデバイス変数か）に応じて呼び出される手続きが変化


カーネルが追記されたモジュール


module kerneluse cudaforuse parametersimplicit none

: !カーネル実行時の情報を定義interface initialize

module procedure initializemodule procedure cufInitialize

end interfacecontains!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!subroutine initialize(f)

:end subroutine initialize!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!

:!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!attributes(global) subroutine cufInitialize(f)

:end subroutine cufInitialize!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!

end module kernel

interfaceを定義し，手続き名をinitializeでオーバーロード

少なくとも一つの引数の型・属性が異なっている必要がある

オーバーロードによる呼び出しの切替

• ホスト変数を渡す場合

• デバイス変数を渡す場合


real(8),allocatable :: f(:):

call initialize(f) !サブルーチンinitializeが呼ばれる:

real(8),allocatable,device :: f(:):

call initialize(f) !コンパイルエラー: !カーネルcufInitializeが呼ばれるためシェブロン（<<<,>>>）が必要

標準の並列度を定めておき，<<<,>>>が無い場合は標準の並列度，ある場合にはその並列度を使ってくれるようになると非常にうれしい

CPUコードと別のソースに記述

• 新しいファイルを作成してカーネルを記述

• moduleが異なれば同じ名前の手続きを定義可能

• 同じ名前の手続きが定義されたmoduleをuseすると名前が衝突• CPU版とGPU版で関数名を区別せず，呼出元の変更を限定したい

• 参照名を変更することで対処

• use モジュール名, 参照名=>モジュール内の手続き名

• 参照名が複数衝突した場合は後で読み込まれた方が有効


CPU版とGPU版のモジュール

CPU版 GPU版


module kernel

use parametersimplicit none

: !実行に必要なパラメータを定義contains!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!subroutine initialize(f)

::

end subroutine initialize!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!

:end module kernel

module cufKerneluse cudaforuse parametersimplicit none

: !実行に必要なパラメータを定義contains!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!attributes(global)&

subroutine initialize(f):

end subroutine initialize!‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐!

:end module cufKernel

参照名を変更した手続きの呼出


program mainuse parametersuse kernel ,initializeKernel=>initializeuse cufKernel,initializeKernel=>initializeimplicit none

real(8),allocatable,device :: f (:)real(8),allocatable,device :: d_f_dx(:)integer :: n


call initializeKernel<<<Block,Threads>>>(f)do n=1,Nt

call computeDifference<<<Block,Threads>>>(d_f_dx,f)call integrate<<<Block,Threads>>>(f,d_f_dx)

end do

deallocate( f (Nx))deallocate(d_f_dx(Nx))

end program main

参照名を設定することで呼出時の手続き名を変更

参照名が重複した場合は後で定義した方が有効

ポインタを利用した配列コピーの回避





call initialize(f)call output(f,"f_start.txt")

do n=1,Ntcall computeDifference(d_f_dx,f)call integrate(f,d_f_dx)

end do

call output(f,"f_end.txt")


end program main

微分値を変数d_f_dxに書き

込み，積分の際に読み出すため非効率

微分と積分のフュージョン


subroutine computeDifferenceAndIntegrate(fnew,f)use parametersimplicit none

real(8),intent(out) :: fnew(Nx)real(8),intent(in ) :: f (Nx)

real(8) :: d_f_dxinteger :: i

i=1d_f_dx = (‐3d0*f(i)+4d0*f(i+1)‐f(i+2))/dx2fnew(i) = f(i) ‐ conv*dt*d_f_dx

do i=2,Nx‐1d_f_dx = (f(i+1)‐f(i‐1))/dx2fnew(i) = f(i) ‐ conv*dt*d_f_dx

end doi=Nx

d_f_dx = ( 3d0*f(i)‐4d0*f(i‐1)+f(i‐2))/dx2fnew(i) = f(i) ‐ conv*dt*d_f_dx

end subroutine computeDifferenceAndIntegrate

微分を計算して直ちに積分に利用

fに書き込むと微分値が正しく

求められないため，更新された値を保持する変数fnewを追加

微分と積分のフュージョン



real(8),allocatable :: f (:) !時刻nの値real(8),allocatable :: fnew(:) !時刻n+1の値integer :: n

allocate(f (Nx))allocate(fnew(Nx))

call initialize(f)do n=1,Nt

call computeDifferenceAndIntegrate(fnew, f)f = fnew

end do

deallocate(f )deallocate(fnew)

end program main

fnewの値をfにコピーして次の時刻の積分に備える

配列のアドレスが交換できれば配列の全要素のコピーを回避

Fortranのポインタ

• pointer属性を付与して宣言

• ポインタ変数と変数を結合するには => を利用• 結合される変数にはtarget属性が必要

• ポインタ変数の型・属性と結合できる変数の型・属性が厳格に対応

• 変数と結合後は通常の変数と同じように利用可能

• デバイス変数を指すにはdevice,pointer属性を付与して宣言





real(8),allocatable,target :: f (:)real(8),allocatable,target :: fnew(:)integer :: n

real(8),dimension(:),pointer :: ptr_fcrntreal(8),dimension(:),pointer :: ptr_fnewreal(8),dimension(:),pointer :: ptr_swap


ptr_fcrnt => fptr_fnew => fnewptr_swap => null()

real(8),pointer :: ptr_fcrnt(:)と宣言しても「ポインタ変数の配列」とはならず，「1次元実数型配列へのポインタ」となる



call initialize(ptr_fcrnt)call output(ptr_fcrnt,"f_start.txt")

do n=1,Ntcall computeDifferenceAndIntegrate(ptr_fnew, ptr_fcrnt)ptr_swap => ptr_fcrntptr_fcrnt => ptr_fnewptr_fnew => ptr_swap

end docall output(ptr_fcrnt,"f_end.txt")

ptr_f_crnt => null()ptr_f_new => null()ptr_swap => null()


end program main

手続きの引数としてポインタを渡す

Fortranのポインタ変数は，

ポインタとしても配列としても利用可能

ポインタを利用した配列コピーの回避（GPU版）


program mainuse cudaforuse parametersuse kernelimplicit none

real(8),allocatable,device,target :: f (:)real(8),allocatable,device,target :: fnew(:)integer :: n

real(8),dimension(:),device,pointer :: ptr_fcrntreal(8),dimension(:),device,pointer :: ptr_fnewreal(8),dimension(:),device,pointer :: ptr_swap


ptr_fcrnt => fptr_fnew => fnewptr_swap => null()

デバイス変数を指すポインタを宣言すれば，デバイス変数と結合でき，配列としても利用できる

ポインタを利用した配列コピーの回避（GPU版）


call initialize<<<Blocks,Threads>>>(ptr_fcrnt)do n=1,Nt

call computeDifferenceAndIntegrate<<<Blocks,Threads>>>(ptr_fnew, ptr_fcrnt)

ptr_swap => ptr_fcrntptr_fcrnt => ptr_fnewptr_fnew => ptr_swap

end do

ptr_f_crnt => null()ptr_f_new => null()ptr_swap => null()


end program main

微分と積分のフュージョン（GPU版）


attributes(global) subroutine computeDifferenceAndIntegrate(fnew,f)use parametersimplicit none

real(8),device,intent(out) :: fnew(Nx)real(8),device,intent(in ) :: f (Nx)real(8) :: d_f_dxinteger :: ii = (blockIdx%x‐1)*blockDim%x + threadIdx%x

if(i==1)thend_f_dx = (‐3d0*f(i)+4d0*f(i+1)‐f(i+2))/dx2

else&if(1<i .and. i<Nx)then

d_f_dx = (f(i+1)‐f(i‐1))/dx2else&if(i==Nx)then

d_f_dx = ( 3d0*f(i)‐4d0*f(i‐1)+f(i‐2))/dx2end iffnew(i) = f(i) ‐ conv*dt*d_f_dx !時間積分

end subroutine computeDifferenceAndIntegrate

空間微分を計算



配列サイズNx


210 0.0140 0.0740212 0.0550 0.0600214 0.220 0.0600216 0.800 0.0840218 4.40 0.290220 14.0 0.970



210 212 214 216 218 220

102

101

100

10-1

10-2

CPUGPU

CPUGPU

単純実装

ポインタ利用

配列サイズNx

実行時間

[ms]

手続き内で配列を扱う時の落とし穴

• Modern Fortranでは，仮引数の配列要素数の指定が不要

• 配列要素数が(:) 実引数から配列サイズを特定

• 配列要素数が(*) 配列要素数が不明


attributes(global) subroutine integrate(f,d_f_dx)implicit none

real(8),device,intent(inout) :: f (:) !配列要素数Nxはカーネル内で取り扱わないreal(8),device,intent(in) :: d_f_dx(:) !ため，要素数は未指定でよい


f(i) = f(i) ‐ dt*conv*d_f_dx(i)


何の問題も無いように見えるが・・・

仮引数の配列要素数を(:)とした結果


210 212 214 216 218 220

102

101

100

10-1

10-2

CPUGPU

単純実装

配列サイズNx

実行時間

[ms]

CPUGPU

要素数(:)

仮引数の配列要素数を(:)とした結果

• CPU（Fortran 90/95/2003）• 実行速度には影響しない

• 若干遅くなる傾向を示すが，配列サイズが小さい場合には高速化することもある

• GPU(CUDA Fortran)• 実行速度が著しく低下

• 実行速度は問題規模によらずほぼ一定

• 何が原因？

• Modern Fortranの機能がGPUで利用できたとしても，実行速度に影響がないか確認する必要がある


オブジェクト指向プログラミングの導入による移植範囲の限定


オブジェクト指向プログラミング

• この世にあるモノの振る舞いを表現する

• イヌもネコも哺乳類で・・・

• オブジェクト同士がメッセージを交換しあいながら相互作用を・・・


Fortranによるオブジェクト指向プログラミング

• プログラミング方法論の一つ

• 関係するデータと処理を一括して取り扱う

• 一括して取り扱うことで色々お得なことがある

• 派生型type(*)にサブルーチンを追加


Fortran Java C++

derived type（派生型）

class class

component（成分）

field data member

type‐bound procedure（手続き）

method virtual memberfunction

用語の対応


オブジェクト指向プログラミングによる数値計算

• オブジェクト指向プログラミングは高コスト

• 手続き型プログラミングよりも処理の回数，メモリの使用量，処理時間が増加

• 多少冗長でも保守性，拡張性，再利用性の確保を重要視

• プログラム作成時の人的ミスの排除

• FORTRAN77スタイルのプログラムの取り込み

• 既存のプログラムのシームレスな拡張

• 死蔵されたプログラムを統合する枠組みを作成したい


移流計算のオブジェクト指向化

• 手続き型

• オブジェクト指向プログラミング（このように書きたい）


real(8) :: f(N), d_f_dx(N)do n=1, n_end

call computeDifference(f, d_f_dx, N)f(:) = f(:) ‐ dt*c*d_f_dx(:)

end do

type(Field) :: fdo n=1, n_end

f = f ‐ dt*c*f%x()end do

物理量とその微分値を個別に宣言

微分値と微分の計算が分離

物理量と微分値，微分の計算を包括した派生型

書籍等に書かれている式との類似性を持たせる

移流方程式の一つの見方

• 場は複数の物理量が集まって作られる

• 物理量の値とその微分値は不可分

• 微分は各物理量に対する処理

• 積分は場（全物理量）に対する処理


0

xfc

tf

場

物理量 f 物理量

値

微分値

値

微分値

値を取り扱うarray型の定義


type :: array

real(8),allocatable,private :: array(:)

contains

procedure,public,pass :: construct !成分arrayを動的確保procedure,public,pass :: destruct !確保されたarrayを解放

procedure,public,pass :: all !成分arrayへのポインタを返す手続きprocedure,public,pass :: getPointer !自身のポインタを返す手続き

procedure,public,pass :: assignprocedure,public,pass :: addprocedure,public,pass :: multiplyScalarprocedure,public,pass :: divideScalargeneric :: assignment(=) => assigngeneric :: operator(+) => addgeneric :: operator(*) => multiplyScalargeneric :: operator(/) => divideScalar

end type array

演算子のオーバーロードによって四則演算を定義

必要な演算子= 配列の代入+配列同士の加算* 配列とスカラ変数の乗算/ 配列とスカラ変数の除算

target属性は付与不可

pointer属性は付与可能だが挙動が怪しい



subroutine assign(lhs,rhs)implicit noneclass(array),intent(inout) :: lhsclass(array),intent(in ) :: rhs

lhs%array(:) = rhs%array(:)

end subroutine assign

function add(term1,term2) result(sum)use parametersimplicit noneclass(array),intent(in) :: term1class(array),intent(in) :: term2class(array),allocatable :: sum

allocate(sum)call sum%construct(Nx)sum%array(:) = term1%array(:)+term2%array(:)

end function add

代入演算を行う手続きarray型変数に対して代入演算子(=)が用いられる

とこの手続きが呼び出される

加算を行う手続きarray型同士の加算演算子(+)が記述されるとこの手続きが呼び出される



function all(this) result(realPtr)use iso_c_bindinguse parametersimplicit noneclass(array),intent(in) :: thisreal(8),dimension(:),pointer :: realPtr

call c_f_pointer( c_ptr(c_loc(this%array)),&realPtr, (/Nx/) )

end function all

!getPointerを呼び出したarray型オブジェクトへの!ポインタを返す手続きfunction getPointer(this) result(ptr)

implicit noneclass(array),intent(in),target :: thistype(array),pointer :: ptrptr=>this

end function getPointer

派生型arrayの成分array（実数型配列）へのポインタを返す手続き

派生型の成分はtarget属性を持てないので，C言語のポインタを作成してからFortranのポインタへ変換

c_loc 変数のアドレスを取り出すc_ptr Cのポインタ型type(c_ptr)

のコンストラクタc_f_pointer CのポインタをFortran

のポインタに変換

privateで隠蔽した変数を書き換えることができてしまう！

物理量を表すScalarVariable型の定義


type :: ScalarVariable

type(array),public :: valuetype(array),public :: d_v_dxtype(array),public :: d_v_dtlogical,public :: d_v_dxCalculated = .false.logical,public :: d_v_dtCalculated = .false.logical,public :: updated = .false.

containsprocedure,public,pass :: construct !値，微分値のコンストラクタを呼び出すprocedure,public,pass :: destruct !値，微分値のデストラクタを呼び出すprocedure,public,pass :: initializeprocedure,public,pass :: xprocedure,public,pass :: update

procedure,public,pass :: assignprocedure,public,pass :: addArraygeneric :: assignment(=) => assigngeneric :: operator(+) => addArray

end type ScalarVariable

物理量の値と空間微分値，時間微分値を定義

物理量に対する初期化と空間微分は処理を定義

物理量を表すScalarVariable型の定義


subroutine initialize(this)use kernel, initializeKernel=>initializeimplicit noneclass(ScalarVariable) :: this

call initializeKernel(this%value%all())end subroutine initialize

function x(this) result(d_v_dx)use parameters,only:Nxuse kernel,computeDifferenceKernel=>computeDifferenceimplicit noneclass(ScalarVariable) :: thistype(array),pointer :: d_v_dx

call computeDifferenceKernel(this%d_v_dx%all(),this%value%all())

d_v_dx => this%d_v_dx%getPointer()this%d_v_dxCalculated = .true.this%updated = .false.

end function x

初期化を行う手続き

処理の切替を容易にするために他のmoduleで定義された手続きを呼出し

空間微分を行う手続き

他のmoduleの手続きを呼出し

場を表すField型の定義


type :: Field

type(ScalarVariable),private :: f

contains

procedure,public,pass :: construct !各物理量のコンストラクタを呼び出すprocedure,public,pass :: destruct !各物理量のデストラクタを呼び出す

procedure,public, pass :: initializeprocedure,private,pass :: xprocedure,public ,pass :: tprocedure,private,pass :: update

procedure,public,pass :: assignprocedure,public,pass :: addArraygeneric :: assignment(=) => assigngeneric :: operator(+) => addArray

end type Field

物理量を成分として保持（ここではfのみ）

場の初期化を行う手続きや空間微分，時間微分を計算する手続きを定義

実際は各物理量型の初期化手続きや空間微分計算の手続きを呼び出す

場を表すField型の定義


function t(this) result(d_f_dt)use parametersuse class_arrayimplicit noneclass(Field) :: thistype(array),pointer :: d_f_dt

this%f%d_v_dt = this%x()*‐conv

d_f_dt => this%f%d_v_dt%getPointer()this%f%d_v_dtCalculated = .true.this%f%updated = .false.

end function t

function x(this) result(d_v_dx)use class_arrayimplicit noneclass(Field) :: thistype(array),pointer :: d_v_dx

d_v_dx=>this%f%x()end function x

場の時間微分

を計算する手続き

この手続きで移流方程式を表現

xfc

tf

場の空間微分を計算各物理量の空間微分計算の手続きを呼び出す

メインルーチン


program mainuse parametersuse class_Fieldimplicit none

type(Field) :: finteger :: n

call f%initialize()do n=1,Nt

print *,n

f = f + f%t()*dt

end do

end program main

書籍等に書かれているEuler法の定義と同じ書き方ができている

各派生型と手続きの呼出


Field

物理量

場の初期化時間微分の計算空間微分の計算代入演算子加算演算子

ScalarVariable

値時間微分値空間微分値

値の初期化空間微分の計算代入演算子加算演算子




print *,n

f = f + f%t()*dt

end do

end program main

array

値

代入演算子加算演算子乗算演算子除算演算子

値の初期化


型の利用

手続きの呼出



Field

物理量


ScalarVariable



array

値






print *,n

f = f + f%t()*dt

end do

end program main値の初期化

型の利用

手続きの呼出

メインルーチン（修正Euler法へ変更）



type(Field) :: ftype(Field) :: f05integer :: n


print *,n

f05 = f + f%t()*dtf = f + (f%t()+f05%t())/2d0*dt

end do

end program main

時間積分を修正Euler法へ変更

手続きを一切追加することなく，書籍に書かれた式と同じ書き方で変更可能

GPUへの移植

• 数値を取り扱うarray型• 四則演算をGPUで実行するカーネルを作成

• 変数を表すScalarVariable型• 初期化や微分の計算をGPUで実行するカーネルを作成

• 既存のカーネルを流用可能• 流用する場合の変更は2行のみ

• そもそも派生型の手続きとしてカーネルは定義できない• 今後定義できるようになるかは不明

• 必ず外部モジュールを呼ぶ必要がある

• 場を表すField型• 変更無し


array型（GPU版）


type :: array

real(8),allocatable,private,device :: array(:)

contains

procedure,public,pass :: construct !成分arrayを動的確保procedure,public,pass :: destruct !確保されたarrayを解放

procedure,public,pass :: all !成分arrayへのポインタを返す手続きprocedure,public,pass :: getPointer !自身のポインタを返す手続き

procedure,public,pass :: assignprocedure,public,pass :: addprocedure,public,pass :: multiplyScalarprocedure,public,pass :: divideScalargeneric :: assignment(=) => assigngeneric :: operator(+) => addgeneric :: operator(*) => multiplyScalargeneric :: operator(/) => divideScalar

end type array

array型の演算子（GPU版）


subroutine assign(lhs,rhs)use parameters,only:Nximplicit noneclass(array),intent(inout) :: lhsclass(array),intent(in ) :: rhsinteger :: statstat = cudaMemcpy(lhs%array,rhs%array,Nx,cudaMemcpyDeviceToDevice)

end subroutine assign

function add(term1,term2) result(sum)use parameters,only:Nxuse cufParameters,only:Blocks,Threadsuse arrayOperator,only:addKernel=>addArrayKernelimplicit noneclass(array),intent(in) :: term1class(array),intent(in) :: term2class(array),allocatable :: sum

allocate(sum)call sum%construct(Nx)call addKernel<<<Blocks, Threads>>>(sum%array,term1%array,term2%array)

end function add

代入演算はcudaMemcpyへ変更

加算を実行するカーネルを作成し，加算演算子をオーバーロードしている手続き内から呼び出す

加算演算を行うカーネル


attributes(global) subroutine addArrayKernel(result,term1,term2)use parameters,only:Nximplicit nonereal(8),intent(out),device :: result(Nx)real(8),intent(in ),device :: term1(Nx)real(8),intent(in ),device :: term2(Nx)


result(i) = term1(i) + term2(i)

end subroutine addArrayKernel

ScalarVariable型の変更箇所（GPU版）


subroutine initialize(this)use cufKernel, only:initializeKernel=>cufinitializeuse cufParametersimplicit noneclass(ScalarVariable) :: this

call initializeKernel<<<Blocks, Threads>>>(this%value%all())end subroutine initialize

function x(this) result(d_v_dx)use parameters,only:Nxuse cufKernel,only:computeDifferenceKernel=>cufComputeDifferenceuse cufParametersimplicit noneclass(ScalarVariable) :: thistype(array),pointer :: d_v_dx

call computeDifferenceKernel<<<Blocks, Threads>>>(this%d_v_dx%all(),this%value%all())

d_v_dx => this%d_v_dx%getPointer()end function x

既存（前のスライドで作成済み）の初期化カーネルを呼び出し

既存（前のスライドで作成済み）の空間微分カーネルを呼び出し

メインルーチン（変更無し）





print *,n

f = f + f%t()*dt

end do

end program main



Field

物理量


ScalarVariable



加算

乗算

除算

array

値


値の初期化


GPUで実行するために

カーネルを作成，あるいは既存のカーネルを流用

カーネルを呼び出すように若干変更

device属性の追加

手続きの呼出



配列サイズNx


210 0.190 3.10212 0.800 2.50214 3.90 2.70216 19.0 8.00218 98.0 17.0220 415 36.5



210 212 214 216 218 220

103

102

101

100

10-1

10-2

CPUGPU

CPUGPU

OOP

配列サイズNx

実行時間

[ms]

ポインタ利用

おわりに

• Fortran 90/95/2003(Modern Fortran)の機能を簡単に紹介

• Modern Fortranの機能を使い，1次元移流方程式の計算を実行，GPUへ移植

• CPUコードで利用できる機能の大半はGPUでも利用可能

• CUDA Fortranでは実行時間が著しく変化する場合がある

• 1次元移流方程式のプログラムをオブジェクト指向プログラミングにより作成し，GPUへ移植

• GPU移植に伴う変更の範囲を限定できる


まとめ

• 極めて有用

• メモリ管理• allocate/deallocate, pointer, source指定子

• 配列を引数にとる場合は配列要素数の指定に注意が必要

• 使いどころはある

• サブルーチンのオーバーロード，参照名

• オブジェクト指向プログラミング（実行制御，カプセル化）

• 使い物にならない

• オブジェクト指向プログラミング（型に対する演算の定義）• 一時オブジェクトの生成と破棄が高負荷


Cuda fortranの利便性を高めるfortran言語の機能

Engineering

Transcript of Cuda fortranの利便性を高めるfortran言語の機能