PROGRAMMING GPGPUS USING CUDA - About us | …research.nesc.ac.uk/files/CUDA_PROGRAMMING.pdf · WHY...

F A N Z H U 2 0 1 2 - 1 1 - 2 0

PROGRAMMING GPGPUS USING CUDA

WHY GPGPUS

•  GPGPUs - General Purpose Computing on Graphics Processing Units (GPUs)

From NVIDIA: CUDA C Programming Guide

GPUS VS. CPUS

•  NVIDEA: 10x to 1000x speedups

•  Intel: 2.5x speedups

CUDA

•  CUDA - Compute Unified Device Architecture •  C for CUDA is the programming language •  Fortran for CUDA •  Version 1.0 in 2007 •  Version 5.0 in 2012

•  Shared Memory Architecture

CUDA CODE PORTABILITY

•  Hardware independent. •  Change configuration to achieve best performance

CUDA WORKFLOW

1.  A CPU thread copies data from main memory to GPU memory.

2.  A CPU thread instructs GPU threads to start processing.

3.  GPU threads execute in parallel on different GPU cores.

3.∗ The CPU thread and all of the idle GPU threads wait for completion of the running GPU threads. This step happens at the same time as step 3.

4.  The CPU thread copies the results from GPU memory to main memory.

5.  The CPU thread acts on the results, and may return to step 1 in order to execute another GPU function.

FUNCTION TYPES

•  __host__ •  Executed on the host (CPU) •  Callable from the host only

•  __global__ •  Executed on the device (GPU) •  Callable from the host only

•  __device__ •  Executed on the device •  Callable from the device only

FUNCTIONS: MEMORY COPY

•  Executed on CPU

•  Allocate and free GPU memory •  cudaMalloc() and cudaFree()

•  Copy CPU memory to GPU memory •  cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

•  Copy GPU memory to CPU memory •  cudaMemcpy(h_B, d_B, size, cudaMemcpyDeviceToHost);

FUNCTIONS

•  __syncthreads() •  Called from the host

•  clock(); clock64(); •  Called from device code

CUDA EXAMPLES: VECTOR ADD

On GPU

On CPU

You can ask for memory here. 16 KB limitation

GRID AND BLOCK

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

Grid

Block(0,0)

Block(0,1)

Block(1,0)

Block(1,1)

•  GRID •  Share

memory •  Block (<=1024

threads) •  Share cache

BLOCKS

0,0 0,1 0,2 0,3

1,0 1,1 1,2 1,3

2,0 2,1 2,2 2,3

3,0 3,1 3,2 3,3

CUDA EXAMPLE: MATRIX ADD

Block = 1x1 Block = 16x16

COMPLETE CODE

In a same .cu file!

THANK YOU.

•  CUDA C Programming Guide •  http://docs.nvidia.com/cuda/index.html

PROGRAMMING GPGPUS USING CUDA - About us | …research.nesc.ac.uk/files/CUDA_PROGRAMMING.pdf · WHY...

Documents

Transcript of PROGRAMMING GPGPUS USING CUDA - About us | …research.nesc.ac.uk/files/CUDA_PROGRAMMING.pdf · WHY...