GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila...

29
 GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa

Transcript of GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila...

Page 1: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

GPGPU Programming with CUDA

Leandro Avila ­ University of Northern Iowa

Mentor:

Dr. Paul Gray 

Computer Science Department 

University of Northern Iowa

Page 2: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Outline

Introduction Architecture Description Introduction to CUDA API

Page 3: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Introduction

Shift in the traditional paradigm of sequential programming, towards parallel processing.

Scientific computing needs to change in order to deal with vast amounts of data.

Hardware changes contributed to move towards parallel processing.

Page 4: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Three Walls of Serial Performance

Manferdelli, J. (2007)  ­ The Many­Core Inflection Point for Mass Market Computer Systems

Memory Wall

Discrepancy between memory and CPU performance

Instruction Level Parallelism Wall

Effort put into ILP increases with not enough returns Power Wall

Clock frequency vs. Heat dissipation efforts.

Page 5: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Accelerators

In HPC, an accelerator is a hardware component whose role is to speed up some aspect of the computing workload.

In the old days (1980s), supercomputers we had array processors, for vector operations on arrays, and floating point accelerators.

More recently, Field Programmable Gate Arrays (FPGAs) allow reprogramming deep into the hardware.

 Courtesy of Henry Neeman ­ http://www.oscer.ou.edu/

Page 6: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Accelerators

Advantages

They make your code run faster Disadvantages

More expensive Harder to program Code is not portable from one accelerator to 

another. (OpenCL attempts to change this)

 Courtesy of Henry Neeman ­ http://www.oscer.ou.edu/

Page 7: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Introducing GPGPU 

General Purpose Computing on Graphics Processing Units

Great example of the trend of moving away from the traditional model.

Page 8: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Why GPUs?

Graphics Processing Units (GPUs) were originally designed to accelerate graphics tasks like image rendering.

They became very popular with videogamers, because they’ve produced better and better images, and lightning fast.

And, prices have been extremely good, ranging from three figures at the low end to four figures at the high end.

GPUs mostly do stuff like rendering images. This is done through mostly floating point arithmetic – the same 

stuff people use supercomputing for!

 Courtesy of Henry Neeman ­ http://www.oscer.ou.edu/

Page 9: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

GPU vs. CPU Flop Rate 

From Nvidia CUDA Programing Guide

Page 10: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Architecture

Page 11: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Architecture Comparison

Nvidia Tesla C1060 Intel i7 975 Extreme

Processing Cores 240 4

Memory 4GB L1 Cache – 32KB/coreL2 Cache – 256KB/coreL3 Cache – 8MB (shared)

Clock Speed1.3 GHz 3.33.GHz

Memory Bandwidth 102 GB/Sec 25 GB/sec

Floating Point Operations / Sec

933 Single Precision78 Double Precision

70 Double Precision

Page 12: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

CPU vs. GPU

From Nvidia CUDA Programing Guide

Page 13: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Components

Texture Processors Clusters

Streaming Multiprocessors

Streaming Processor

From http://www.tomshardware.com/reviews/nvidia­cuda­gpu,1954­7.html

Page 14: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Streaming Multiprocessors

Blocks of threads are assigned to SMs

A SM contains 8 Scalar Processors

Tesla C1060 Number of SM = 30

Number of Cores = 240

The more SM you have the better

Page 15: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Hardware Hierarchy

Stream Processor Array Contains 10 Texture Processor Clusters

Texture Processor Clusters Contains 3 Streaming Multiprocessors

Streaming Multiprocessors Contains 8 Scalar Processors

Scalar Processors They do the work :)

Page 16: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Connecting some dots...

Great! We see the GPU architecture is different from what we see in the traditional CPU.

So... Now what? What this all means? How do we use it?

Page 17: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Glossary

The HOST – Is the machine executing main program The DEVICE – Is the card with the GPU

The KERNEL – Is the routine that runs on the GPU

A THREAD – Is the basic execution unit in the GPU

A BLOCK – Is a group of threads 

A GRID – Is a group of blocks

A WARP – Is a group of 32 threads

Page 18: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

CUDA Kernel Execution

Recall that threads are organized in BLOCKS and at the same time BLOCKS are organized in a GRID.

The GRID can have 2 dimensions. X and Y Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

The BLOCK(S) can have 3 dimensions X,Y,Z Maximum sizes of each dimension of a block:    512 x 512 x 64

Prior to kernel execution we need to set it up by setting the dimensions of the GRID and the dimensions of the BLOCKS

Page 19: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Scheduling in Hardware

Grid is launched

Blocks are distributed to the necessary SMs

SM initiates processing of warps

SM schedules warps that are ready

As warps finish and resources are liberated, then new warps are scheduled.

SM can take 1024 threads

Ex: 256 x 4 OR 128 x 8

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Kirk & Hwu – University of Illinois Urbana- Champaign

Page 20: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Memory Layout

Registers and shared memory are the fastest

Local Memory is virtual memory

Global Memory is the slowest.

From Nvidia CUDA Programing Guide

Page 21: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Thread Memory Access

Threads access memory as follows Registers – Read & Write Local Memory – Read & Write Shared Memory – Read & Write (block level) Global Memory – Read & Write (grid level) Constant Memory – Read (grid level)

Remember that Local Memory is implemented as virtual memory from a region that resides in Global Memory.

Page 22: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

CUDA API

Page 23: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Programming Pattern

Host reads input and allocates memory in the device Host copies data to the device Host invokes a kernel that gets executed in parallel, using the 

data and hardware in the device, to do some useful work. Host copies back the results from the device for post 

processing.

Page 24: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Kernel Setup

_global_ void myKernel(); //declaration

dim3 dimGrid(2,2,1);

dim3 dimBlock(4,8,8);

 

 myKernel<<< dimGrid, dimBlock >>>( d_b, d_a );

Page 25: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Device Memory Allocation

cudaMalloc(&myDataAddress,sizeOfData)

Address of a pointer to the allocated data and the size of such data.

cudaFree(myDataPointer)

Used to free the allocated memory on the device.

Also check cudaMallocHost() and cudaFreeHost() in the CUDA Refrence Manual.

Page 26: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Device Data Transfer

cudaMemcpy()

Requires: pointer to destination, pointer to source, size, type of transfer

Examples:

cudaMemcpy(elements_d, elements_h,size,cudaMemcpyHostToDevice);

cudaMemcpy(elements_h,elements_d,size,cudaMemcpyDeviceToHost);

Page 27: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Function Declaration

_ global _ is used to declare a kernel. It must be void.

Executes On Callable From

_device_ float myDeviceFunc() Device Device

_host_ float myHostFunc() Host Host

_global_ void myKernel() Device Host

Page 28: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Useful Variables

gridDim.(x|y) = grid dimension on x and y

blockDim  = number of threads in a block

blockIdx   = block index whithin the grid

blockIdx.(x|y)

threadIdx  =  Thread index within a block

threadIdx.(x|y|z)

Page 29: GPGPU Programming with CUDA - shodor.org€¦ · GPGPU Programming with CUDA Leandro Avila University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University

 

Variable Type Qualifiers

Variable type qualifiers specify the memory location of a variable on the device’s memory

__device__ Declares a variable in the device

__constant__ Declares a constant in the device

__shared__ Declares a variable in thread shared memory

Note: All shared memory variables start at the same address. You must use offsets if multiple variables are declared in shared memory.