Languages, Libraries and Development Tools for GPU...
Transcript of Languages, Libraries and Development Tools for GPU...
![Page 1: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/1.jpg)
Languages, Libraries and Development Tools for GPU Computing
![Page 2: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/2.jpg)
GPU CPU
![Page 3: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/3.jpg)
GPUs have evolved to the point where many real-world
applications are easily implemented on them and run
significantly faster than on multi-core systems.
Future computing architectures will be hybrid systems with
parallel-core GPUs working in tandem with multi-core CPUs.
“
” Jack Dongarra Professor, University of Tennessee Director of the Innovative Computing Laboratory Author of LINPACK
![Page 4: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/4.jpg)
Small Changes, Big Speed-up
Application Code
+
GPU CPU Use GPU to Parallelize
Compute-Intensive Functions Rest of Sequential
CPU Code
![Page 5: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/5.jpg)
In testing our key applications, the Tesla GPUs
delivered speed-ups that we had never seen before,
sometimes even orders of magnitude.
“ ”
Satoshi Matsuoka Professor, Global Scientific Information and Computing Center, Tokyo Institute of Technology Technical Lead for the TSUBAME Supercomputer
![Page 6: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/6.jpg)
CUDA Parallel Computing Platform
Hardware
Capabilities
GPUDirect SMX Dynamic Parallelism HyperQ
Programming
Approaches Libraries
“Drop-in” Acceleration
Programming
Languages
OpenACC
Directives
Maximum Flexibility Easily Accelerate Apps
Development
Environment
Nsight IDE Linux, Mac and Windows
GPU Debugging and Profiling
CUDA-GDB debugger
NVIDIA Visual Profiler
Open Compiler
Tool Chain Enables compiling new languages to CUDA platform, and
CUDA languages to other architectures
![Page 7: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/7.jpg)
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages
Maximum
Flexibility
OpenACC
Directives
Easily Accelerate
Applications
Libraries
![Page 8: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/8.jpg)
NVIDIA cuFFT NVIDIA cuSPARSE
GPU Accelerated Libraries “Drop-in” Acceleration for your Applications
NVIDIA cuBLAS
NVIDIA cuRAND
NVIDIA NPP
Vector Signal Image Processing
Matrix Algebra on GPU and Multicore
C++ Templated Parallel Algorithms IMSL Library
GPU Accelerated Linear Algebra
Building-block Algorithms CenterSpace NMath
![Page 9: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/9.jpg)
x
1x
2x
3x
4x
5x
6x
7x
8x
9x
10x
Speedup over MKL
cuBLAS Level 2 Performance Up to 1 TFLOPS sustained performance and 7x faster than Intel MKL
• cuBLAS 5.0 on K20X, input and output data on device
• MKL 10.3.6 on Intel SandyBridge E5-2687W @ 3.10GHz
0
500
1000
1500
2000
2500
3000
GFLOPS
Performance may vary based on system configuration
![Page 10: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/10.jpg)
ZGEMM Performance vs. Matrix Size
• cuBLAS 5.0 on K20X, input and output data on device
• MKL 10.3.6 on Intel SandyBridge E5-2687W @ 3.10GHz
0
200
400
600
800
1000
1200
0 512 1024 1536 2048 2560 3072 3584 4096
GF
LO
PS
Matrix Size
cuBLAS MKL
Performance may vary based on system configuration
![Page 11: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/11.jpg)
cuSPARSE: Sparse linear algebra routines
Sparse matrix-vector multiplication & triangular solve
Tri-diagonal solver up to 16x faster vs. Intel MKL
Incomplete-LU & -Cholesky preconditioners (ilu0 and ic0)
Format conversion: HYB, CSR, BlockCSR, COO, CSC, dense
1.0
2.0
3.0
4.0
y1
y2
y3
y4
\alpha + \beta
1.0
6.0
4.0
7.0
3.0 2.0
5.0
y1
y2
y3
y4
developer.nvidia.com/cusparse
![Page 12: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/12.jpg)
cuSPARSE: up to 12x Faster than MKL Sparse Matrix x 6 Dense Vectors
(useful for block iterative solvers)
Performance may vary based on system configuration
• Average of s/d/c/z routines
• cuSPARSE 5.0 on K20X, input and output data on device
• MKL 10.3.6 on Intel SandyBridge E5-2687W @ 3.10GHz
x
2x
4x
6x
8x
10x
12x
14x
Sp
ee
du
p o
ve
r M
KL
![Page 13: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/13.jpg)
NVIDIA Performance Primitives
Thousands of GPU-accelerated image & signal processing functions
Arithmetic, Logic, Conversions, Filters, Statistics, etc.
Up to 40x faster performance than Intel IPP
* NPP 4.1, NVIDIA C2050 (Fermi)
* IPP 6.1, Dual Socket Core™ i7 920 @ 2.67GHz
developer.nvidia.com/content/graphcuts-using-npp
![Page 14: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/14.jpg)
CenterSpace NMath GPU-accelerated math for .NET languages
NMath 5.3 Premium Edition
Dense matrix operations & Fast Fourier Transfoms
Singular Value Decomposition (SVD) and QR decomposition
LU factorization, Least squares, and Eigenvalue routines
No GPU programming experience required
No changes to existing NMath code required
Automatic fallback when there’s no GPU
2-4x faster than CPU-only
Sign up at:
centerspace.net/nmath-premium-beta
![Page 15: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/15.jpg)
GPUDirect™ RDMA GPUDirect™ P2P Transfers
GPU-Aware MPI Libraries Integrated Support for GPU Computing
developer.nvidia.com/gpudirect
OpenMPI IBM Platform MPI MVAPICH
PCI-e
CPU
Chip
set
System
Memory
GPU
GPU
Memory
IB
Cray MPI
![Page 16: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/16.jpg)
// generate 32M random numbers on host thrust::host_vector<int> h_vec(32 << 20); thrust::generate(h_vec.begin(), h_vec.end(), rand); // transfer data to device (GPU) thrust::device_vector<int> d_vec = h_vec; // sort data on device thrust::sort(d_vec.begin(), d_vec.end()); // transfer data back to host thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
Thrust: Rapid Parallel C++ Development
Similar to C++ STL
High-level interface
Enhances developer productivity
Enables performance portability between GPUs and multicore CPUs
Flexible
Backends for CUDA, OpenMP, TBB
Extensible and customizable
Integrates with existing software
Open source
thrust.github.com or developer.nvidia.com/thrust
![Page 17: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/17.jpg)
HEMI: Fully Portable C++ GPU Computing
Functions compile and run on either CPU or GPU
C++ classes can be used on CPU and GPU
Simplified memory management
Minimize platform-specific code
HEMI_KERNEL(solve)(float *out, float *n, int N) { … }
HEMI_KERNEL_LAUNCH(solve, gDim, bDim, 0, 0, output, input, N);
CPU
GPU
NVCC
compiler
MS/GNU
compiler
github.com/harrism/hemi
![Page 18: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/18.jpg)
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages OpenACC
Directives
Maximum
Flexibility
Easily Accelerate
Applications
![Page 19: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/19.jpg)
Focus on Expressing Parallelism
With Directives, tuning work focuses on expressing parallelism,
which makes codes inherently better
Example: Application tuning work using directives for new Titan system at ORNL
S3D Research more efficient combustion with next-generation fuels
CAM-SE Answer questions about specific climate change adaptation and mitigation scenarios
• Tuning top 3 kernels (90% of runtime) • 3 to 6x faster on CPU+GPU vs. CPU+CPU • But also improved all-CPU version by 50%
• Tuning top key kernel (50% of runtime) • 6.5x faster on CPU+GPU vs. CPU+CPU • Improved performance of CPU version by 100%
![Page 20: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/20.jpg)
OpenACC Open Programming Standard for Parallel Computing
“OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan.”
--Buddy Bland, Titan Project Director, Oak Ridge National Lab
“OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP.”
--Michael Wong, CEO OpenMP Directives Board
OpenACC Standard
www.nvidia.com/gpudirectives
![Page 21: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/21.jpg)
subroutine saxpy(n, a, x, y)
real :: x(n), y(n), a
integer :: n, i
do i=1,n
y(i) = a*x(i)+y(i)
enddo
end subroutine saxpy
...
! Perform SAXPY on 1M elements
call saxpy(2**20, 2.0, x_d,
y_d)
...
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
...
// Perform SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y);
...
A Very Simple Exercise: SAXPY SAXPY in C SAXPY in Fortran
![Page 22: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/22.jpg)
subroutine saxpy(n, a, x, y)
real :: x(n), y(n), a
integer :: n, i
!$omp parallel do
do i=1,n
y(i) = a*x(i)+y(i)
enddo
!$omp end parallel do
end subroutine saxpy
...
! Perform SAXPY on 1M elements
call saxpy(2**20, 2.0, x_d,
y_d)
...
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
#pragma omp parallel for
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
...
// Perform SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y);
...
A Very Simple Exercise: SAXPY OpenMP SAXPY in C SAXPY in Fortran
![Page 23: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/23.jpg)
subroutine saxpy(n, a, x, y)
real :: x(n), y(n), a
integer :: n, i
!$acc parallel loop
do i=1,n
y(i) = a*x(i)+y(i)
enddo
!$acc end parallel loop
end subroutine saxpy
...
! Perform SAXPY on 1M elements
call saxpy(2**20, 2.0, x_d,
y_d)
...
void saxpy(int n,
float a,
float *x,
float *restrict y)
{
#pragma acc parallel loop
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
...
// Perform SAXPY on 1M elements
saxpy(1<<20, 2.0, x, y);
...
A Very Simple Exercise: SAXPY OpenACC SAXPY in C SAXPY in Fortran
![Page 24: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/24.jpg)
Real-Time Object Detection
Global Manufacturer of Navigation Systems
Valuation of Stock Portfolios using Monte Carlo
Global Technology Consulting Company
Interaction of Solvents and Biomolecules
University of Texas at San Antonio
Directives: Easy & Powerful
Optimizing code with directives is quite easy, especially compared to CPU threads or writing CUDA kernels. The most important thing is avoiding restructuring of existing code for production applications. ”
-- Developer at the Global Manufacturer of Navigation Systems
“ 5x in 40 Hours 2x in 4 Hours 5x in 8 Hours
![Page 25: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/25.jpg)
3 Ways to Accelerate Applications
Applications
Libraries
“Drop-in”
Acceleration
Programming
Languages OpenACC
Directives
Maximum
Flexibility
Easily Accelerate
Applications
![Page 26: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/26.jpg)
Opening the CUDA Platform with LLVM
CUDA compiler source contributed
to open source LLVM compiler project
SDK includes specification documentation,
examples, and verifier
Anyone can add CUDA support to new languages and processors
Learn more at
developer.nvidia.com/cuda-llvm-compiler
CUDA C, C++, Fortran
LLVM Compiler For CUDA
NVIDIA GPUs
x86 CPUs
New Language Support
New Processor Support
![Page 27: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/27.jpg)
void saxpy_serial(int n,
float a,
float *x,
float *y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
// Perform SAXPY on 1M elements
saxpy_serial(4096*256, 2.0, x, y);
__global__
void saxpy_parallel(int n,
float a,
float *x,
float *y)
{
int i = blockIdx.x * blockDim.x +
threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
// Perform SAXPY on 1M elements
saxpy_parallel<<<4096,256>>>(n,2.0,x,y);
CUDA C
Standard C Code Parallel C Code
developer.nvidia.com/cuda-toolkit
![Page 28: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/28.jpg)
CUDA C++: Develop Generic Parallel Code
Class hierarchies
__device__ methods
Templates
Operator overloading
Functors (function objects)
Device-side new/delete
More…
template <typename T>
struct Functor {
__device__ Functor(_a) : a(_a) {}
__device__ T operator(T x) { return a*x; }
T a;
}
template <typename T, typename Oper>
__global__ void kernel(T *output, int n) {
Oper op(3.7);
output = new T[n]; // dynamic allocation
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n)
output[i] = op(i); // apply functor
} developer.nvidia.com/cuda-toolkit
CUDA C++ features enable sophisticated and flexible
applications and middleware
![Page 29: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/29.jpg)
PGI CUDA Fortran
Simple Fortran extensions
Kernel functions
Thread / block IDs
Device & data
management
Parallel loop directives
Familiar syntax
Use allocate, deallocate
Copy CPU-to-GPU with
assignment (=)
module mymodule contains attributes(global) subroutine saxpy(n,a,x,y) real :: x(:), y(:), a, integer n, i attributes(value) :: a, n i = threadIdx%x+(blockIdx%x-1)*blockDim%x
if (i<=n) y(i) = a*x(i) + y(i);
end subroutine saxpy end module mymodule program main use cudafor; use mymodule real, device :: x_d(2**20), y_d(2**20) x_d = 1.0; y_d = 2.0 call saxpy<<<4096,256>>>(2**20,3.0,x_d,y_d,) y = y_d write(*,*) 'max error=', maxval(abs(y-5.0)) end program main
developer.nvidia.com/cuda-fortran
![Page 30: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/30.jpg)
Compile Python for Parallel Architectures
Anaconda Accelerate from Continuum Analytics
NumbaPro array-oriented compiler for Python & NumPy
Compile for CPUs or GPUs (uses LLVM + NVIDIA Compiler SDK)
Fast Development + Fast Execution: Ideal Combination
http://continuum.io
Free Academic License
![Page 31: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/31.jpg)
10242 Mandelbrot Time Speedup v. Pure Python
Pure Python 4.85s --
NumbaPro (CPU) 0.11s 44x
CUDA Python (K20) .004s 1221x
@cuda.jit(restype=uint8, argtypes=[f8, f8, uint32], device=True)
def mandel(x, y, max_iters):
zr, zi = 0.0, 0.0
for i in range(max_iters):
newzr = (zr*zr-zi*zi)+x
zi = 2*zr*zi+y
zr = newzr
if (zr*zr+zi*zi) >= 4:
return i
return 255
@cuda.jit(argtypes=[uint8[:,:], f8, f8, f8, f8, uint32])
def mandel_kernel(img, xmin, xmax ymin, ymax, iters):
x, y = cuda.grid(2)
if x < img.shape[0] and y < img.shape[1]:
img[y, x] = mandel(min_x+x*((max_x-min_x)/img.shape[0]),
min_y+y*((max_y-min_y)/img.shape[1]), iters)
gimage = np.zeros((1024, 1024), dtype = np.uint8)
d_image = cuda.to_device(gimage)
mandel_kernel[(32,32), (32,32)](d_image, -2.0, 1.0, -1.0, 1.0, 20)
d_image.to_host()
CUDA Python
with NumbaPro CUDA Programming,
Python Syntax
![Page 32: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/32.jpg)
GPU Programming Languages
OpenACC, CUDA Fortran Fortran
OpenACC, CUDA C C
CUDA C++, Thrust, Hemi, ArrayFire C++
Anaconda Accelerate, PyCUDA Python
MATLAB, Mathematica, LabVIEW Numerical analytics
developer.nvidia.com/language-solutions
CUDAfy.NET, Alea.cuBase .NET
![Page 33: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/33.jpg)
Development
Tools
![Page 34: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/34.jpg)
NVIDIA® Nsight™ Eclipse Edition for Linux and MacOS
CUDA-Aware Editor
Automated CPU to GPU code refactoring
Semantic highlighting of CUDA code
Integrated code samples & docs
Nsight Debugger
Simultaneously debug CPU and GPU
Inspect variables across CUDA threads
Use breakpoints & single-step debugging
Nsight Profiler
Quickly identifies performance issues
Integrated expert system
Source line correlation
,
developer.nvidia.com/nsight
![Page 35: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/35.jpg)
NVIDIA® Nsight™ Visual Studio Edition
System Trace
Review CUDA activities across CPU and GPU
Perform deep kernel analysis to detect factors limiting maximum performance
CUDA Profiler
Advanced experiments to measure memory utilization, instruction throughput and stalls
CUDA Debugger
Debug CUDA kernels directly on GPU hardware
Examine thousands of threads executing in parallel
Use on-target conditional breakpoints to locate errors
CUDA Memory Checker
Enables precise error detection
,
![Page 36: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/36.jpg)
NVIDIA CUDA-MEMCHECK for Linux & Mac
NVIDIA Nsight Eclipse & Visual Studio Editions
Allinea DDT with CUDA Distributed Debugging Tool
TotalView for CUDA for Linux Clusters
NVIDIA CUDA-GDB for Linux & Mac
Debugging Solutions Command Line to Cluster-Wide
developer.nvidia.com/debugging-solutions
![Page 37: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/37.jpg)
NVIDIA Visual Profiler NVIDIA Nsight
Eclipse & Visual Studio Editions Vampir Trace Collector
Under Development PAPI CUDA Component TAU Performance System
Performance Analysis Tools Single Node to Hybrid Cluster Solutions
developer.nvidia.com/performance-analysis-tools
![Page 38: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/38.jpg)
Univa Grid Engine
LSF, HPC, Cluster Manager
NVML Plugin for GPUs
Bright Cluster Manager
PBS Professional
Job Scheduling & Cluster Management
developer.nvidia.com/cuda-tools-ecosystem
IBM Platform Computing
![Page 39: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/39.jpg)
GPU Management: nvidia-smi
Command-line access to: - Approximate GPU utilization
- Approximate memory footprint
- Number of GPUs
- ECC state
- Driver version
Inspect and modify GPU state - ECC mode
- Exclusive use mode
Included in the NVIDIA driver, also available via NVML API
Thu Nov 1 09:10:29 2012
+------------------------------------------------------+
| NVIDIA-SMI 4.304.51 Driver Version: 304.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name | Bus-Id Disp. | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K20X | 0000:03:00.0 Off | Off |
| N/A 30C P8 28W / 235W | 0% 12MB / 6143MB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K20X | 0000:85:00.0 Off | Off |
| N/A 28C P8 26W / 235W | 0% 12MB / 6143MB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| No running compute processes found |
+-----------------------------------------------------------------------------+
developer.nvidia.com/nvidia-management-library-nvml
![Page 40: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/40.jpg)
Online Resources
www.udacity.com
docs.nvidia.com developer.nvidia.com
devtalk.nvidia.com
www.stackoverflow.com
![Page 41: Languages, Libraries and Development Tools for GPU …on-demand.gputechconf.com/gtc/2013/presentations/S... · Tokyo Institute of Technology ... SDK includes specification documentation,](https://reader031.fdocuments.net/reader031/viewer/2022022601/5b4c52f07f8b9afe4d8b87be/html5/thumbnails/41.jpg)
Thank You!