Introduction to Productive GPU Programming | GTC 2014 · 2014. 5. 21. · * 0 - 0 1 Active...

46
Introduction to Productive GPU Programming Umar Arshad

Transcript of Introduction to Productive GPU Programming | GTC 2014 · 2014. 5. 21. · * 0 - 0 1 Active...

  • Introduction to Productive GPU Programming

    Umar Arshad

  • ArrayFire

    ● World’s leading GPU experts○ In the industry since 2007○ NVIDIA Partner

    ● Deep experience working with thousands of customers○ Analysis○ Acceleration○ Algorithm development

    ● GPU Training○ Hands on course with a CUDA engineer○ Customized to meet your needs

  • Productivity

    ● Software Development○ Development Costs○ Features○ User Experience

    ● Limited resources● Tools and Libraries

    ○ Reduce R&D costs○ Lower testing and deployment time○ More time for features

  • GPU Libraries

    ● Programmed by GPU experts○ Years of experience

    ● Abstract low level details● Target multiple architectures

    ○ Some kernels might run better on older hardware● Free improvements on new hardware

    ○ Update to the latest version● No need to reinvent the wheel

  • Library Types

    ● Specialized GPU Libs○ Targeted at a specific set of operations○ C interface○ Raw pointer interface

    ● General GPU Libs○ Manage GPU resources using containers○ Targeted for general computation○ Higher level functions○ C++ interface

  • Specialized GPU Libraries

    ● Fast Fourier Transforms○ cuFFT

    ● Random Number Generation○ cuRAND

    ● Linear Algebra○ cuBLAS○ CULA Tools○ MAGMA

    ● Signal and Image Processing○ NPP

  • Specialized GPU Libraries

    ● C Interface○ Use pointers to reference data

    ● Do not manage memory● Mimic existing libraries

    ○ cuBLAS ≈ BLAS○ CULA ≈ BLAS + LAPACK○ cuFFT ≈ FFTW

    ● Minimizes the amount of code necessary to integrate into existing projects

  • cuFFT

    ● 1D, 2D and 3D transforms● Both real and complex data types supported● Single and double-precision supported● Batch execution for multiple transforms● Available as part of the CUDA Toolkit

  • cuRAND

    ● Bulk random number generation on the GPU● Use in Host and Kernel● Single and double precision support● Four different RNG algorithms

    ○ MRG32k3a○ MTGP Merseinne Twister○ XORWOW○ Sobol' quasi-RNG

    ● Multiple RNG distributions (uniform, normal, log-normal, poisson)

  • cuBLAS, CULA and MAGMA

    ● Support most popular linear algebra routines● Real and complex data types support● Single and double-precision support

  • NPP

    ● Signal and image processing functions● Avoid unnecessary data copies

    ○ Can process data that is already on the GPU○ Keeps processed data on GPU for further processing

    ● Arithmetic and logical operations● Color conversions● Filtering● Geometric transforms● Statistical functions

  • General-Purpose GPU Libraries

    ● Thrust● OpenCV● ArrayFire

  • Thrust

    ● GPU library resembling C++ STL○ STL like data structures○ Iterators○ Fully interoperable with CUDA C

    ● Parallel vector operation methods○ Reductions○ Sorting○ Prefix-Sum

    ● Customizable GPU kernels using functors

  • Thrust - Data Structures

    ● Two types of containers

    ● Supports same data types as C++○ host_vector foo(2e6, 4);○ device_vector bar(2e6);

    ● Explicit data transfer○ bar = foo;

    host_vector Stores data on the host

    device_vector Stores data on the device

  • Thrust iterators

    ● Like C++ Thrust uses iterators to define range of operations

    ● Thrust functions require a begin and end iterator○ The begin iterator points to the starting range○ The end iterator points to the ending range

    int sum = thrust::reduce(hdata.begin(), hdata.end()); //sum of the hdata vector

  • Thrust Functions

    ● Thrust includes many basic algorithms for general computation○ Reductions○ Sorting○ Prefix-sum○ Scan○ Reordering○ Transformation○ Generation○ Random number generation

  • Thrust Functions

    ● Many thrust functions can be altered using built-in function objects or custom functions

    ● Basic operations○ plus, minus, multiply, etc.

    ● Custom Functors○ Create your own function objects○ Overload the operator() function

    ■ decorated with __host__ and __device__

  • Thrust Functorsstruct my_plus{ __host__ __device__ float operator()(const float& x, const float& y) const { return x + y; }};

    void thrust_functor_example(){ // Define input vectors x and y and output vector z with equal lengths ...

    // z

  • Thrust Functors

    ● Caveats○ Cannot use shared memory○ Do not have control over block/grid size○ No stream support

  • OpenCV (GPU)

    ● Manipulate matrices● Perform complex image processing operations

    ○ Image filtering○ Resizing

    ● Dozens of available computer vision algorithms○ Object recognition○ Human Detection

  • OpenCV (GPU) - Data Structures

    ● GpuMat container to store data● Signed/unsigned 8, 16 and 32 bit integers, single and

    double-precision floating-point numbers● Multi-channel support● Only 2D matrices - no arbitrary dimensions

  • OpenCV (GPU) - Usage Examplevoid opencv_gpu_example(){

    float f[4] = {100, 200, 400, 800};cv::Mat h(4, 1,CV_32F, f);

    cv::gpu::GpuMat d(h);cv::Scalar sum = cv::gpu::sum(d);

    }

  • ArrayFire

    ● Hundreds of parallel functions○ Targeting image processing, machine learning, etc.

    ● Support for multiple languages○ C/C++, Fortran, Java and R

    ● Linux, Windows, Mac OS X ● OpenGL based graphics● Based around one data structure● JIT

    ○ Combine multiple operations into one kernel● GFOR, the only data parallel loop

  • ArrayFire Functions

    ● Hundreds of parallel functions○ Building blocks

    ■ Reductions■ Scan■ Set operations■ Sorting■ Statistics■ Basic matrix manipulation

  • ArrayFire Functions

    ● Hundreds of parallel functions○ Signal/image processing

    ■ Convolution■ FFT■ Histograms■ Interpolation■ Connected components

    ○ Linear Algebra■ Matrix multiply■ Linear system solving■ Factorization

  • ArrayFire - Data Structures

    ● Built around a flexible data structure named "array"○ Lightweight wrapper around the data on the compute device○ Manages the data and basic metadata such as size, type and

    dimensions

    ● You can transfer data into an array object using one of its constructors

    float hA[] = {0, 1, 2, 3, 4, 5};array A(2, 3, hA);

  • ArrayFire - Indexing#include #include // require for print()

    void af_example(){ float f[4] = {100, 200, 400, 800}; array a(4, f); // 4 rows x 1 col array initialized with f values array b = sum(a); // performs reduce-sum over all elements of a}

  • Case Study 1 — Dot Product

    ● Have two N-dimensional vectors, x and y● Multiply magnitudes of dimensions 1 to N of vectors x

    and y● Sum up all N results

  • Case Study 1 — Dot Product — Thrustdouble dotproduct_thrust(){

    // float vector in device memorythrust::device_vector x(samples);thrust::device_vector y(samples);thrust::device_vector z(samples);

    // generate sequence, starting from 0.0f, with steps of 1.0f and 0.001fthrust::sequence(x.begin(), x.end(), 0.f, 1.f);thrust::sequence(y.begin(), y.end(), 0.f, 0.001f);

    // multiplies vectors x and y, storing result in zthrust::transform(x.begin(), x.end(), y.begin(), z.begin(), thrust::multiplies());

    // returns sum-reduction of zreturn thrust::reduce(z.begin(), z.end());

    }

  • Case Study 1 — Dot Product — ArrayFirestatic double dotproduct_af(){

    // array in device memory, set to sequence {0.0f, 1.0f, 2.0f, ..., samples-1}array x = seq(samples);// array in device memory, set to sequence {0.0f, 1.0f, 2.0f, ..., samples-1} * 0.001farray y(seq(samples)*0.001f);

    // multiplies x and y and returns sum-reduction of resulting vectorreturn sum(x*y);

    }

  • Case Study 2 — Pi Estimation

    ● Generate millions of uniformly distributed random samples

    ● Each sample will include x and y coordinates● Estimate rate of samples that fall within the unit

    radius circle

  • Case Study 2 — Pi Estimation — ArrayFiredouble pi_af(){

    array x = randu(samples,f32), y = randu(samples,f32);return 4 * sum(x*x + y*y

  • Speedups With ArrayFire

    Field Application Speedup

    Academia Power Systems Simulations 35x

    Finance Option Pricing 52x

    Government Radar Image Formation 45x

    Life Sciences Pathology Advances > 100x

    Manufacturing Tomography of Vegetation 10x

    Media & Computer Vision Digital Holography 17x

    Oil & Gas Ground Water Simulations > 20x

  • OpenACC — Programming Standard

    ● Write straightforward (serial) code (better sentence: "Write (almost) serial code")

    ● Use directives to tell the compiler what is executed in parallel

    ● Let the compiler do the parallel work for you● Under constant development — constant

    performance improvements!

  • Implementation From Scratch — When?

    ● You are writing a novel algorithm implementation● Your code uses a modified version of the standard

    algorithm in question● You want to learn about how parallel code works

  • Serial to Parallel

    ● Check if serial -> parallel is feasible (e.g., loops with inputs that do not rely on previous results)

    ● Identify performance bottlenecks (e.g., memory bandwidth, register usage)

    ● Profile the code (e.g., check where your code is spending most of its time and resources)

    ● Optimize the code (e.g., use shared memory where possible)

  • Tools To Improve Productivity

    ● Debugging, memory correctness checking and profiling tools○ Debugging: cuda-gdb○ Memory correctness checking: cuda-memcheck○ Profiling: nvprof

    ● Integrated Development Environments (IDEs)○ NVIDIA Nsight

  • cuda-gdb

    ● Break host code on condition● Check variable values● Do everything else that GNU gdb is capable of

  • cuda-gdb - Example Code#include // Include for debugging

    __global__ void vadd(int * a, int * b, int * c, int length){ int idx = blockDim.x * blockIdx.x + threadIdx.x; if (idx < length) c[idx] = a[idx] + b[idx]; // Line 6}

    int main(){ int samples = 1000;... // d_A = {0, 1, 2, 3, ..., 999}; // d_B = {0, 2, 4, 6, ..., 1998}; vadd(d_A, d_B, d_C, samples);

    return 0;}

  • cuda-gdb - Usage$ cuda-gdb ./vadd(cuda-gdb) break vadd.cu:6 if idx == 900Breakpoint 1 (vadd.cu:6 if idx == 900) pending.(cuda-gdb) run[Launch of CUDA Kernel 0 (vadd) on Device 0][Switching focus to CUDA kernel 0, grid 1, block (1,0,0), thread (388,0,0), device 0, sm 6, warp 12, lane 4]

    Breakpoint 1, vadd(int * @global, int * @global, int * @global, int) (a=0x500140000, b=0x500141000, c=0x500142000, length=1000) at vadd.cu:66 if (idx < length) c[idx] = a[idx] + b[idx];(cuda-gdb) print a[idx]$1 = 900(cuda-gdb) print b[idx]$2 = 1800(cuda-gdb) print c[idx]$3 = 2700

  • cuda-gdb - CUDA Information

    ● Get information on device running your code(cuda-gdb) info cuda devices Dev Description SM Type SMs Warps/SM Lanes/Warp Max Regs/Lane Active SMs Mask* 0 gk104 sm_30 8 64 32 64 0x000000c0

    ● Get information on how your kernel is called(cuda-gdb) info cuda kernels Kernel Parent Dev Grid Status SMs Mask GridDim BlockDim Invocation* 0 - 0 1 Active 0x000000c0 (2,1,1) (512,1,1) vadd(a=0x500140000, b=0x500141000, c=0x500142000, length=1000)

    ● Get information on the threads running your kernel(cuda-gdb) info cuda threads BlockIdx ThreadIdx To BlockIdx ThreadIdx Count Virtual PC Filename LineKernel 0* (0,0,0) (0,0,0) (1,0,0) (511,0,0) 1024 0x0000000000678f88 vadd.cu 6

    ● There is even more: SMs, warps, lanes, blocks, ...

  • cuda-memcheck

    ● Check for memory out of bounds● Errors will be generated when reading or writing

    unallocated memory

  • cuda-memcheck - Example Code#include // Include for debugging

    __global__ void vadd(int * a, int * b, int * c, int length){

    int idx = blockDim.x * blockIdx.x + threadIdx.x;if (idx < length) c[idx] = a[idx] + b[idx]; // Line 6

    }

    int main(){

    int samples = 1000;...

    // d_A = {0, 1, 2, 3, ..., 999};// d_B = {0, 2, 4, 6, ..., 1998};vadd(d_A, d_B, d_C, samples);

  • cuda-memcheck - Usage$ cuda-memcheck ./vadd========= CUDA-MEMCHECK========= Invalid __global__ read of size 4========= at 0x000000e0 in vadd.cu:9:vadd(int*, int*, int*, int)========= by thread (511,0,0) in block (1,0,0)========= Address 0x500140ffc is out of bounds================= Invalid __global__ read of size 4========= at 0x000000e0 in vadd.cu:9:vadd(int*, int*, int*, int)========= by thread (510,0,0) in block (1,0,0)========= Address 0x500140ff8 is out of bounds...========= Program hit error 4 on CUDA API call to cudaMemcpy========= Saved host backtrace up to driver entry point at error================== ERROR SUMMARY: 25 errors

  • nvprof

    ● Verify time consumed by each call to an operation involving a device

    ● Useful information to identify bottlenecks

  • nvprof - Usage$ nvprof ./vadd==10506== NVPROF is profiling process 10506, command: ./vadd==10506== Profiling application: ./vadd==10506== Profiling result:Time(%) Time Calls Avg Min Max Name 38.03% 2.5920us 2 1.2960us 1.2800us 1.3120us [CUDA memcpy HtoD] 37.09% 2.5280us 1 2.5280us 2.5280us 2.5280us [CUDA memcpy DtoH] 24.88% 1.6960us 1 1.6960us 1.6960us 1.6960us vadd(int*, int*, int*, int)