Parallel Programming with CUDA FortranCUDA is a scalable programming model for parallel computing...
Transcript of Parallel Programming with CUDA FortranCUDA is a scalable programming model for parallel computing...
Parallel Programming with CUDA Fortran
© NVIDIA Corporation 2011
Outline
What is CUDA FortranSimple ExamplesCUDA Fortran FeaturesUsing CUBLAS with CUDA FortranCompilation
© NVIDIA Corporation 2011
CUDA Fortran
CUDA is a scalable programming model for parallel computing
CUDA Fortran is the Fortran analog of CUDA CProgram host and device code similar to CUDA CHost code is based on Runtime APIFortran language extensions to simplify data management
Co-defined by NVIDIA and PGI, implemented in the PGI Fortran compiler
Separate from PGI AcceleratorDirective-based, OpenMP-like interface to CUDA
© NVIDIA Corporation 2011
CUDA Programming
Heterogeneous programming modelCPU and GPU are separate devices with separate memory spacesHost code runs on the CPU
Handles data management for both the host and deviceLaunches kernels which are subroutines executed on the GPU
Device code runs on the GPUExecuted by many GPU threads in parallel
Allows for incremental development
© NVIDIA Corporation 2011
F90 Example
module simpleOps_mcontains subroutine inc(a, b) implicit none integer :: a(:) integer :: b integer :: i, n
n = size(a) do i = 1, n a(i) = a(i)+b enddo end subroutine incend module simpleOps_m
program incTest use simpleOps_m implicit none integer, parameter :: n = 256 integer :: a(n), b
a = 1 ! array assignment b = 3 call inc(a, b)
if (all(a == 4)) then write(*,*) 'Success' endif
end program incTest
© NVIDIA Corporation 2011
CUDA Fortran - Host Code
CUDA Fortranprogram incTest use cudafor use simpleOps_m implicit none integer, parameter :: n = 256 integer :: a(n), b integer, device :: a_d(n)
a = 1 b = 3
a_d = a call inc<<<1,n>>>(a_d, b) a = a_d
if (all(a == 4)) then write(*,*) 'Success' endifend program incTest
F90program incTest use simpleOps_m implicit none integer, parameter :: n = 256 integer :: a(n), b
a = 1 b = 3
call inc(a, b)
if (all(a == 4)) then write(*,*) 'Success' endifend program incTest
© NVIDIA Corporation 2011
CUDA Fortran - Device Code
CUDA Fortran
module simpleOps_mcontains attributes(global) subroutine inc(a, b) implicit none integer :: a(:) integer, value :: b integer :: i
i = threadIdx%x a(i) = a(i)+b
end subroutine incend module simpleOps_m
F90
module simpleOps_mcontains subroutine inc(a, b) implicit none integer :: a(:) integer :: b integer :: i, n
n = size(a) do i = 1, n a(i) = a(i)+b enddo end subroutine incend module simpleOps_m
© NVIDIA Corporation 2011
Extending to Larger Arrays
Previous example works for small arrays
Limit of n=1024 (Fermi) or n=512 (pre-Fermi)
For larger arrays, change the first Execution Configuration parameter (<<<1,n>>>)
call inc<<<1,n>>>(a_d,b)
© NVIDIA Corporation 2011
Execution Model
...
ThreadThread
Processor
ThreadBlock Multiprocessor
GridDevice
SoftwareThreads are executed by thread processors
Thread blocks are executed on multiprocessors
Thread blocks do not migrate
Several concurrent thread blocks can reside on a multiprocessor
Hardware
A kernel is launched on a device as a grid of thread blocks
© NVIDIA Corporation 2011
Execution Configuration
Execution configuration specified on host code
Previous example used a single thread block
Multiple threads blocks
call inc<<<blocksPerGrid, threadsPerBlock>>>(a_d,b)
call inc<<<1,n>>>(a_d,b)
tPB = 256call inc<<<ceiling(real(n)/tPB),tPB>>>(a_d,b)
© NVIDIA Corporation 2011
Large Array - Host Codeprogram incTest use cudafor use simpleOps_m implicit none integer, parameter :: n = 1024*1024 integer, parameter :: tPB = 256 integer :: a(n), b integer, device :: a_d(n)
a = 1 b = 3
a_d = a call inc<<<ceiling(real(n)/tPB),tPB>>>(a_d, b) a = a_d
if (all(a == 4)) then write(*,*) 'Success' endifend program incTest
© NVIDIA Corporation 2011
Large Array - Device Code
module simpleOps_mcontains attributes(global) subroutine inc(a, b) implicit none integer :: a(:) integer, value :: b integer :: i, n
i = (blockIdx%x-1)*blockDim%x + threadIdx%x n = size(a) if (i <= n) a(i) = a(i)+b
end subroutine incend module simpleOps_m
© NVIDIA Corporation 2011
Multidimensional Arrays - Host
Execution Configuration
Grid dimensions in blocks (blocksPerGrid) and block dimensions (threadsPerBlock) can be either integer or of type dim3
call inc<<<blocksPerGrid, threadsPerBlock>>>(a_d,b)
type (dim3) integer (kind=4) :: x, y, zend type
© NVIDIA Corporation 2011
Multidimensional Arrays - Device
Predefined variables in device subroutinesGrid and block dimensions - gridDim, blockDimBlock and thread indices - blockIdx, threadIdxOf type dim3
blockIdx and threadIdx fields have unit offset
type (dim3) integer (kind=4) :: x, y, zend type
1 <= blockIdx%x <= gridDim%x
© NVIDIA Corporation 2011
2D Example - Host Codeprogram incTest use cudafor use simpleOps_m implicit none integer, parameter :: nx=1024, ny=512 real :: a(nx,ny), b real, device :: a_d(nx,ny) type(dim3) :: grid, tBlock
a = 1; b = 3 tBlock = dim3(32,8,1) grid = dim3(ceiling(real(nx)/tBlock%x), ceiling(real(ny)/tBlock%y), 1) a_d = a call inc<<<grid,tBlock>>>(a_d, b) a = a_d
write(*,*) 'Max error: ', maxval(abs(a-4))end program incTest
© NVIDIA Corporation 2011
2D Example - Device Code
module simpleOps_mcontains attributes(global) subroutine inc(a, b) implicit none real :: a(:,:) real, value :: b integer :: i, j
i = (blockIdx%x-1)*blockDim%x + threadIdx%x j = (blockIdx%y-1)*blockDim%y + threadIdx%y if (i<=size(a,1) .and. j<=size(a,2)) & a(i,j) = a(i,j) + b
end subroutine incend module simpleOps_m
© NVIDIA Corporation 2011
CUDA Fortran Features
Variable QualifiersSubroutine/Function QualifiersKernel Loop Directives (CUF Kernels)
© NVIDIA Corporation 2011
Variable Qualifiers
Analogous to CUDA Cdeviceconstant
Read-only memory (device code) cached on-chipshared
On-chip, shared between threads of a thread block
Additionalpinned
Page-locked host memoryvalue
Pass-by-value dummy arguments in device code
Textures will be available in 12.0
© NVIDIA Corporation 2011
Function/Subroutine Qualifiers
Designated by attributes() specifierattributes(host)
called from host and runs on host (default)attributes(global)
kernel, called from host runs on devicesubroutine onlyno other prefixes allowed (recursive, elemental, or pure)
attributes(device)called from and runs on devicecan only appear within a Fortran moduleonly additional prefix allowed is function return type
© NVIDIA Corporation 2011
Kernel Loop Directives (CUF Kernels)
Automatic kernel generation and invocation of host code region containing tightly nested loops
Can specify parts of execution configuration
!$cuf kernel do(2) <<< *,* >>>do j=1, ny do i = 1, nx a_d(i,j) = b_d(i,j) + c_d(i,j) enddoenddo
!$cuf kernel do(2) <<<(*,*),(32,4)>>>
© NVIDIA Corporation 2011
Reduction using CUF Kernels
Compiler recognizes use of scalar reduction and generates one result
rsum = 0.0!$cuf kernel do <<<*,*>>>do i = 1, nx rsum = rsum + a_d(i)enddo
© NVIDIA Corporation 2011
Calling CUBLAS from CUDA Fortran
Module which defines interfaces to CUBLAS from CUDA Fortranuse cublas
Interfaces in three formsOverloaded BLAS interfaces that take device array arguments
call saxpy(n, a_d, x_d, incx, y_d, incy)
Legacy CUBLAS interfacescall cublasSaxpy(n, a_d, x_d, incx, y_d, incy)
Multi-GPU version (CUDA 4.0) that utilizes a handle histat = cublasSaxpy_v2(h, n, a_d, x_d, incx, y_d, incy)
Mixing the three forms is allowed
© NVIDIA Corporation 2011
Calling CUBLAS from CUDA Fortranprogram cublasTest use cublas implicit none
real, allocatable :: a(:,:),b(:,:),c(:,:) real, device, allocatable :: a_d(:,:),b_d(:,:),c_d(:,:) integer :: k=4, m=4, n=4 real :: alpha=1.0, beta=2.0, maxError
allocate(a(m,k), b(k,n), c(m,n), a_d(m,k), b_d(k,n), c_d(m,n))
a = 1; a_d = a b = 2; b_d = b c = 3; c_d = c
call cublasSgemm('N','N',m,n,k,alpha,a_d,m,b_d,k,beta,c_d,m)
c=c_d write(*,*) 'Maximum error: ', maxval(abs(c-14.0))
deallocate (a,b,c,a_d,b_d,c_d)
end program cublasTest
© NVIDIA Corporation 2011
Compilation
Source-to-source compilation (generates CUDA C)pgfortran - PGI’s Fortran compilerAll source code with .cuf or .CUF is compiled as CUDA Fortran enabled automaticallyFlag to target architecture (eg. -Mcuda=cc20)
-Mcuda=emu specifies emulation modeFlag to target toolkit version (eg. -Mcuda=cuda4.0)-Mcuda=fastmath enables faster intrinsics (__sinf())-Mcuda=nofma turns off fused multiply-add-Mcuda=maxregcount:<n> limits register use per thread-Mcuda=ptxinfo prints memory usage per kernel
© NVIDIA Corporation 2011
Summary
CUDA Fortran provides a convenient interface for parallel programmingFortran analog to CUDA C
CUDA Fortran has strong typing that allows simplified data managementFortran 90’s array features carried to GPU
More info available at http://www.pgroup.com/cudafortran
Parallel Programming with CUDA Fortran
© NVIDIA Corporation 2011
Runtime API (Host)
Runtime API defined in cudafor moduleDevice management (cudaGetDeviceCount, cudaSetDevice, ...)Host-device synchronization (cudaDeviceSynchronize)Memory management (cudaMalloc/cudaFree, cudaMemcpy, cudaMemcpyAsync, ...)
Mixing cudaMalloc/cudaFree with Fortran allocate/deallocate on a given array is not supportedFor device data, counts are in units of elements, not bytes
Stream management (cudaStreamCreate, cudaStreamSynchronize, ...)Event management (cudaEventCreate, cudaEventRecord, ...)Error handling (cudaGetLastError, ...)
© NVIDIA Corporation 2011
Device Intrinsics
syncthreads subroutineBarrier synchronization for all threads in thread block
gpu_time subroutineReturns value of clock cycle counter on GPU
Atomic functions