Intro to CUDA C - Nvidiadeveloper.download.nvidia.com/GTC/PDF/1060_Woolley.pdf · 2011. 12. 29. ·...

60
© NVIDIA Corporation 2011 Intro to CUDA C Cliff Woolley, NVIDIA Corporation GTC Asia 2011

Transcript of Intro to CUDA C - Nvidiadeveloper.download.nvidia.com/GTC/PDF/1060_Woolley.pdf · 2011. 12. 29. ·...

  • © NVIDIA Corporation 2011

    Intro to CUDA C

    Cliff Woolley, NVIDIA Corporation

    GTC Asia 2011

  • © NVIDIA Corporation 2011

    What is CUDA?

    CUDA Architecture

    Expose GPU computing for general purpose

    Retain performance

    CUDA C/C++

    Based on industry-standard C/C++

    Small set of extensions to enable heterogeneous programming

    Straightforward APIs to manage devices, memory etc.

    This session introduces CUDA C/C++

  • © NVIDIA Corporation 2011

    Introduction to CUDA C/C++

    What will you learn in this session?

    Start from “Hello World!”

    Write and launch CUDA C/C++ kernels

    Manage GPU memory

    Manage communication and synchronization

  • © NVIDIA Corporation 2011

    Prerequisites

    You (probably) need experience with C or C++

    You don’t need GPU experience

    You don’t need parallel programming experience

    You don’t need graphics experience

  • © NVIDIA Corporation 2011

    Heterogeneous Computing

    Blocks

    Threads

    Indexing

    Shared memory

    __syncthreads()

    CONCEPTS

  • © NVIDIA Corporation 2011

    HELLO WORLD!

    Heterogeneous Computing

    Blocks

    Threads

    Indexing

    Shared memory

    __syncthreads()

    CONCEPTS

  • © NVIDIA Corporation 2011

    Heterogeneous Computing

    Terminology:

    Host The CPU and its memory (host memory)

    Device The GPU and its memory (device memory)

    Host Device

  • © NVIDIA Corporation 2011

    Heterogeneous Computing

    #include

    #include

    using namespace std;

    #define N 1024

    #define RADIUS 3

    #define BLOCK_SIZE 16

    __global__ void stencil_1d(int *in, int *out) {

    __shared__ int temp[BLOCK_SIZE + 2 * RADIUS];

    int gindex = threadIdx.x + blockIdx.x * blockDim.x;

    int lindex = threadIdx.x + RADIUS;

    // Read input elements into shared memory

    temp[lindex] = in[gindex];

    if (threadIdx.x < RADIUS) {

    temp[lindex - RADIUS] = in[gindex - RADIUS];

    temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

    }

    // Synchronize (ensure all the data is available)

    __syncthreads();

    // Apply the stencil

    int result = 0;

    for (int offset = -RADIUS ; offset

  • © NVIDIA Corporation 2011

    Simple Processing Flow

    1. Copy input data from CPU memory to GPU

    memory

    PCI Bus

  • © NVIDIA Corporation 2011

    Simple Processing Flow

    1. Copy input data from CPU memory to GPU

    memory

    2. Load GPU program and execute,

    caching data on chip for performance

    PCI Bus

  • © NVIDIA Corporation 2011

    Simple Processing Flow

    1. Copy input data from CPU memory to GPU

    memory

    2. Load GPU program and execute,

    caching data on chip for performance

    3. Copy results from GPU memory to CPU

    memory

    PCI Bus

  • © NVIDIA Corporation 2011

    Hello World!

    int main(void) {

    printf("Hello World!\n");

    return 0;

    }

    Standard C that runs on the host

    NVIDIA compiler (nvcc) can be used to compile

    programs with no device code

    Output:

    $ nvcc

    hello_world.cu

    $ a.out

    Hello World!

    $

  • © NVIDIA Corporation 2011

    Hello World! with Device Code

    __global__ void mykernel(void) {

    }

    int main(void) {

    mykernel();

    printf("Hello World!\n");

    return 0;

    }

    Two new syntactic elements…

  • © NVIDIA Corporation 2011

    Hello World! with Device Code

    __global__ void mykernel(void) {

    }

    CUDA C/C++ keyword __global__ indicates a function that:

    Runs on the device

    Is called from host code

    nvcc separates source code into host and device components

    Device functions (e.g. mykernel()) processed by NVIDIA compiler

    Host functions (e.g. main()) processed by standard host compiler

    - gcc, cl.exe

  • © NVIDIA Corporation 2011

    Hello World! with Device COde

    mykernel();

    Triple angle brackets mark a call from host code to device code

    Also called a “kernel launch”

    We’ll return to the parameters (1,1) in a moment

    That’s all that is required to execute a function on the GPU!

  • © NVIDIA Corporation 2011

    Hello World! with Device Code

    __global__ void mykernel(void) {

    }

    int main(void) {

    mykernel();

    printf("Hello World!\n");

    return 0;

    }

    But mykernel() does not do anything yet...

    Output:

    $ nvcc hello.cu

    $ a.out

    Hello World!

    $

  • © NVIDIA Corporation 2011

    Parallel Programming in CUDA C/C++

    But wait… GPU computing is about massive

    parallelism!

    We need a more interesting example…

    We’ll start by adding two integers and build up

    to vector addition

    a b c

  • © NVIDIA Corporation 2011

    Addition on the Device

    A simple kernel to add two integers

    __global__ void add(int *a, int *b, int *c) {

    *c = *a + *b;

    }

    As before __global__ is a CUDA C/C++ keyword meaning

    add() will execute on the device

    add() will be called from the host

  • © NVIDIA Corporation 2011

    Addition on the Device

    Note that we use pointers for the variables

    __global__ void add(int *a, int *b, int *c) {

    *c = *a + *b;

    }

    add() runs on the device, so a, b and c must point to device memory

    We need to allocate memory on the GPU

  • © NVIDIA Corporation 2011

    Memory Management

    Host and device memory are separate entities

    Device pointers point to GPU memory

    May be passed to/from host code

    May not be dereferenced in host code

    Host pointers point to CPU memory

    May be passed to/from device code

    May not be dereferenced in device code

    Simple CUDA API for handling device memory

    cudaMalloc(), cudaFree(), cudaMemcpy()

    Similar to the C equivalents malloc(), free(), memcpy()

  • © NVIDIA Corporation 2011

    Addition on the Device: add()

    Returning to our add() kernel

    __global__ void add(int *a, int *b, int *c) {

    *c = *a + *b;

    }

    Let’s take a look at main()…

  • © NVIDIA Corporation 2011

    Addition on the Device: main()

    int main(void) {

    int a, b, c; // host copies of a, b, c

    int *d_a, *d_b, *d_c; // device copies of a, b, c

    int size = sizeof(int);

    // Allocate space for device copies of a, b, c

    cudaMalloc((void **)&d_a, size);

    cudaMalloc((void **)&d_b, size);

    cudaMalloc((void **)&d_c, size);

    // Setup input values

    a = 2;

    b = 7;

  • © NVIDIA Corporation 2011

    Addition on the Device: main()

    // Copy inputs to device

    cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);

    cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);

    // Launch add() kernel on GPU

    add(d_a, d_b, d_c);

    // Copy result back to host

    cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);

    // Cleanup

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

    return 0;

    }

  • © NVIDIA Corporation 2011

    RUNNING IN PARALLEL

    Heterogeneous Computing

    Blocks

    Threads

    Indexing

    Shared memory

    __syncthreads()

    CONCEPTS

  • © NVIDIA Corporation 2011

    Moving to Parallel

    GPU computing is about massive parallelism

    So how do we run code in parallel on the device?

    add>();

    add>();

    Instead of executing add() once, execute N times in parallel

  • © NVIDIA Corporation 2011

    Vector Addition on the Device

    With add() running in parallel we can do vector addition

    Terminology: each parallel invocation of add() is referred to as a block

    The set of blocks is referred to as a grid

    Each invocation can refer to its block index using blockIdx.x

    __global__ void add(int *a, int *b, int *c) {

    c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];

    }

    By using blockIdx.x to index into the array, each block handles a

    different index

  • © NVIDIA Corporation 2011

    Vector Addition on the Device

    __global__ void add(int *a, int *b, int *c) {

    c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];

    }

    On the device, each block can execute in parallel:

    c[0] = a[0] + b[0]; c[1] = a[1] + b[1];

    c[2] = a[2] + b[2]; c[3] = a[3] + b[3];

    Block 0 Block 1

    Block 2 Block 3

  • © NVIDIA Corporation 2011

    Vector Addition on the Device: add()

    Returning to our parallelized add() kernel

    __global__ void add(int *a, int *b, int *c) {

    c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];

    }

    Let’s take a look at main()…

  • © NVIDIA Corporation 2011

    Vector Addition on the Device: main()

    #define N 512

    int main(void) {

    int *a, *b, *c; // host copies of a, b, c

    int *d_a, *d_b, *d_c; // device copies of a, b, c

    int size = N * sizeof(int);

    // Alloc space for device copies of a, b, c

    cudaMalloc((void **)&d_a, size);

    cudaMalloc((void **)&d_b, size);

    cudaMalloc((void **)&d_c, size);

    // Alloc space for host copies of a, b, c and setup input values

    a = (int *)malloc(size); random_ints(a, N);

    b = (int *)malloc(size); random_ints(b, N);

    c = (int *)malloc(size);

  • © NVIDIA Corporation 2011

    Vector Addition on the Device: main()

    // Copy inputs to device

    cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);

    cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

    // Launch add() kernel on GPU with N blocks

    add(d_a, d_b, d_c);

    // Copy result back to host

    cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

    // Cleanup

    free(a); free(b); free(c);

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

    return 0;

    }

  • © NVIDIA Corporation 2011

    INTRODUCING THREADS

    Heterogeneous Computing

    Blocks

    Threads

    Indexing

    Shared memory

    __syncthreads()

    CONCEPTS

  • © NVIDIA Corporation 2011

    __global__ void add(int *a, int *b, int *c) {

    c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];

    }

    CUDA Threads

    Terminology: a block can be split into parallel threads

    Let’s change add() to use parallel threads instead of parallel blocks

    __global__ void add(int *a, int *b, int *c) {

    c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];

    }

    We use threadIdx.x instead of blockIdx.x

    Need to make one change in main()…

  • © NVIDIA Corporation 2011

    Vector Addition Using Threads: main()

    #define N 512

    int main(void) {

    int *a, *b, *c; // host copies of a, b, c

    int *d_a, *d_b, *d_c; // device copies of a, b, c

    int size = N * sizeof(int);

    // Alloc space for device copies of a, b, c

    cudaMalloc((void **)&d_a, size);

    cudaMalloc((void **)&d_b, size);

    cudaMalloc((void **)&d_c, size);

    // Alloc space for host copies of a, b, c and setup input values

    a = (int *)malloc(size); random_ints(a, N);

    b = (int *)malloc(size); random_ints(b, N);

    c = (int *)malloc(size);

  • © NVIDIA Corporation 2011

    // Copy inputs to device

    cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);

    cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

    // Launch add() kernel on GPU with N blocks

    add(d_a, d_b, d_c);

    // Copy result back to host

    cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

    // Cleanup

    free(a); free(b); free(c);

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

    return 0;

    }

    Vector Addition Using Threads: main()

    // Copy inputs to device

    cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);

    cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

    // Launch add() kernel on GPU with N threads

    add(d_a, d_b, d_c);

    // Copy result back to host

    cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

    // Cleanup

    free(a); free(b); free(c);

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

    return 0;

    }

  • © NVIDIA Corporation 2011

    COMBINING THREADS

    AND BLOCKS

    Heterogeneous Computing

    Blocks

    Threads

    Indexing

    Shared memory

    __syncthreads()

    CONCEPTS

  • © NVIDIA Corporation 2011

    Combining Blocks and Threads

    We’ve seen parallel vector addition using:

    Many blocks with one thread each

    One block with many threads

    Let’s adapt vector addition to use both blocks and threads

    Why? We’ll come to that…

    First let’s discuss data indexing…

  • © NVIDIA Corporation 2011

    Indexing Arrays with Blocks and Threads

    No longer as simple as using blockIdx.x and threadIdx.x

    Consider indexing an array with one element per thread (8 threads/block)

    With M threads/block a unique index for each thread is given by:int index = threadIdx.x + blockIdx.x * M;

    0 1 72 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6

    threadIdx.x threadIdx.x threadIdx.x threadIdx.x

    blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

  • © NVIDIA Corporation 2011

    Indexing Arrays: Example

    Which thread will operate on the red element?

    int index = threadIdx.x + blockIdx.x * M;

    = 5 + 2 * 8;

    = 21;

    0 1 72 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6

    threadIdx.x = 5

    blockIdx.x = 2

    0 1 312 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

    M = 8

  • © NVIDIA Corporation 2011

    Vector Addition with Blocks and Threads

    Use the built-in variable blockDim.x for threads per block

    int index = threadIdx.x + blockIdx.x * blockDim.x;

    Combined version of add() to use parallel threads and parallel blocks

    __global__ void add(int *a, int *b, int *c) {

    int index = threadIdx.x + blockIdx.x * blockDim.x;

    c[index] = a[index] + b[index];

    }

    What changes need to be made in main()?

  • © NVIDIA Corporation 2011

    Addition with Blocks and Threads: main()

    #define N (2048*2048)

    #define THREADS_PER_BLOCK 512

    int main(void) {

    int *a, *b, *c; // host copies of a, b, c

    int *d_a, *d_b, *d_c; // device copies of a, b, c

    int size = N * sizeof(int);

    // Alloc space for device copies of a, b, c

    cudaMalloc((void **)&d_a, size);

    cudaMalloc((void **)&d_b, size);

    cudaMalloc((void **)&d_c, size);

    // Alloc space for host copies of a, b, c and setup input values

    a = (int *)malloc(size); random_ints(a, N);

    b = (int *)malloc(size); random_ints(b, N);

    c = (int *)malloc(size);

  • © NVIDIA Corporation 2011

    Addition with Blocks and Threads: main()

    // Copy inputs to device

    cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);

    cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

    // Launch add() kernel on GPU

    add(d_a, d_b, d_c);

    // Copy result back to host

    cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

    // Cleanup

    free(a); free(b); free(c);

    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);

    return 0;

    }

  • © NVIDIA Corporation 2011

    Handling Arbitrary Vector Sizes

    Typical problems are not friendly multiples of blockDim.x

    Avoid accessing beyond the end of the arrays:

    __global__ void add(int *a, int *b, int *c, int n) {

    int index = threadIdx.x + blockIdx.x * blockDim.x;

    if (index < n)

    c[index] = a[index] + b[index];

    }

    Update the kernel launch:add(d_a, d_b, d_c, N);

  • © NVIDIA Corporation 2011

    Why Bother with Threads?

    Threads seem unnecessary

    They add a level of complexity

    What do we gain?

    Unlike parallel blocks, threads have mechanisms to:

    Communicate

    Synchronize

    To look closer, we need a new example…

  • © NVIDIA Corporation 2011

    COOPERATING

    THREADS

    Heterogeneous Computing

    Blocks

    Threads

    Indexing

    Shared memory

    __syncthreads()

    CONCEPTS

  • © NVIDIA Corporation 2011

    in

    out

    1D Stencil

    Consider applying a 1D stencil to a 1D array of elements

    Each output element is the sum of input elements within a radius

    If radius is 3, then each output element is the sum of 7 input elements:

    radius radius

  • © NVIDIA Corporation 2011

    01234567

    Implementing Within a Block

    Each thread processes one output element

    blockDim.x elements per block

    Input elements are read several times

    With radius 3, each input element is read seven times

    in

    out

  • © NVIDIA Corporation 2011

    Sharing Data Between Threads

    Terminology: within a block, threads share data via shared memory

    Extremely fast on-chip memory, user-managed

    Declare using __shared__, allocated per block

    Data is not visible to threads in other blocks

  • © NVIDIA Corporation 2011

    Implementing With Shared Memory

    Cache data in shared memory

    Read (blockDim.x + 2 * radius) input elements from global memory to

    shared memory

    Compute blockDim.x output elements

    Write blockDim.x output elements to global memory

    Each block needs a halo of radius elements at each boundary

    blockDim.x output elements

    halo on left halo on right

    in

    out

  • © NVIDIA Corporation 2011

    __global__ void stencil_1d(int *in, int *out) {

    __shared__ int temp[BLOCK_SIZE + 2 * RADIUS];

    int gindex = threadIdx.x + blockIdx.x * blockDim.x;

    int lindex = threadIdx.x + RADIUS;

    // Read input elements into shared memory

    temp[lindex] = in[gindex];

    if (threadIdx.x < RADIUS) {

    temp[lindex - RADIUS] = in[gindex - RADIUS];

    temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

    }

    Stencil Kernel

  • © NVIDIA Corporation 2011

    Stencil Kernel

    // Apply the stencil

    int result = 0;

    for (int offset = -RADIUS ; offset

  • © NVIDIA Corporation 2011

    Data Race!

    The stencil example will not work…

    Suppose thread 15 reads the halo before thread 0 has fetched it…

    temp[lindex] = in[gindex];

    if (threadIdx.x < RADIUS) {

    temp[lindex – RADIUS = in[gindex – RADIUS];

    temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

    }

    int result = 0;

    result += temp[lindex + 1];

    Store at temp[18]

    Load from temp[19]

    Skipped, threadIdx > RADIUS

  • © NVIDIA Corporation 2011

    __syncthreads()

    void __syncthreads();

    Synchronizes all threads within a block

    Used to prevent data hazards (e.g., read-after-write)

    All threads must reach the barrier

    In conditional code, the condition must be uniform across the block

  • © NVIDIA Corporation 2011

    __global__ void stencil_1d(int *in, int *out) {

    __shared__ int temp[BLOCK_SIZE + 2 * RADIUS];

    int gindex = threadIdx.x + blockIdx.x * blockDim.x;

    int lindex = threadIdx.x + radius;

    // Read input elements into shared memory

    temp[lindex] = in[gindex];

    if (threadIdx.x < RADIUS) {

    temp[lindex – RADIUS] = in[gindex – RADIUS];

    temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

    }

    // Synchronize (ensure all the data is available)

    __syncthreads();

    Stencil Kernel

  • © NVIDIA Corporation 2011

    Stencil Kernel

    // Apply the stencil

    int result = 0;

    for (int offset = -RADIUS ; offset

  • © NVIDIA Corporation 2011

    Introduction to CUDA C/C++

    What have we learned?

    Write and launch CUDA C/C++ kernels

    - __global__, blockIdx.x, threadIdx.x,

    Manage GPU memory

    - cudaMalloc(), cudaMemcpy(), cudaFree()

    Manage communication and synchronization

    - __shared__, __syncthreads()

  • © NVIDIA Corporation 2011

    Topics we skipped

    We skipped some details, you can learn more:

    Other talks here at GTC Asia

    CUDA C Programming Guide

    CUDA Zone – tools, training, webinars and more

    http://developer.nvidia.com/cuda

  • © NVIDIA Corporation 2011

    Questions?

  • © NVIDIA Corporation 2011

    The compute capability of a device describes its architecture, e.g.

    Number of registers

    Sizes of memories

    Features & capabilities

    The following presentations concentrate on Fermi devices

    Compute Capability >= 2.0

    Compute Capability

    Compute

    Capability

    Selected Features

    (see CUDA C Programming Guide for complete list)

    Tesla models

    1.0 Fundamental CUDA support 870

    1.3 Double precision, improved memory accesses, atomics 10-series

    2.0 Caches, fused multiply-add, 3D grids, surfaces, ECC, P2P,

    concurrent kernels/copies, function pointers, recursion

    20-series

  • © NVIDIA Corporation 2011

    A kernel is launched as a grid of blocks

    of threads

    - blockIdx and threadIdx are 3D

    - We showed only one dimension (x)

    Built-in variables: threadIdx

    blockIdx

    blockDim

    gridDim

    IDs and Dimensions

    Device

    Grid 1

    Block

    (0,0,0)

    Block

    (1,0,0)

    Block

    (2,0,0)

    Block

    (1,1,0)

    Block

    (2,1,0)

    Block

    (0,1,0)

    Block (1,1,0)

    Thread

    (0,0,0)

    Thread

    (1,0,0)

    Thread

    (2,0,0)

    Thread

    (3,0,0)

    Thread

    (4,0,0)

    Thread

    (0,1,0)

    Thread

    (1,1,0)

    Thread

    (2,1,0)

    Thread

    (3,1,0)

    Thread

    (4,1,0)

    Thread

    (0,2,0)

    Thread

    (1,2,0)

    Thread

    (2,2,0)

    Thread

    (3,2,0)

    Thread

    (4,2,0)

  • © NVIDIA Corporation 2011

    Read-only object

    Dedicated cache

    Dedicated filtering hardware

    (Linear, bilinear, trilinear)

    Addressable as 1D, 2D or 3D

    Out-of-bounds address handling

    (Wrap, clamp)

    Textures

    0 1 2 30

    1

    2

    4

    (2.5, 0.5)

    (1.0, 1.0)