Introduction to CUDA (1 of n) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 Where...

Introduction to CUDA (1 of n*)

Joseph KiderUniversity of PennsylvaniaCIS 565 - Spring 2011

* Where n is 2 or 3

Agenda

GPU architecture review CUDA

First of two or three dedicated classes

Acknowledgements

Many slides are fromKayvon Fatahalian's From Shader Code to a

Teraflop: How GPU Shader Cores Work: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_g

puArchTeraflop_BPS_SIGGRAPH2010.pdf

David Kirk and Wen-mei Hwu’s UIUC course: http://courses.engr.illinois.edu/ece498/al/

GPU Architecture Review

GPUs are:ParallelMultithreadedMany-core

GPUs have:Tremendous computational horsepowerHigh memory bandwidth


GPUs are specialized forCompute-intensive, highly parallel computationGraphics!

Transistors are devoted to:ProcessingNot:

Data caching Flow control


Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf

Transistor Usage

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Threading Hardware in G80

Sources

Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu

John Nickolls, NVIDIA

VertexIndex

Stream

3D APICommands

AssembledPrimitives

PixelUpdates

PixelLocationStream

ProgrammableFragmentProcessor


Tra

nsf

orm

ed

Vert

ices

ProgrammableVertex

Processor

ProgrammableVertex

Processor

GPUFront End

GPUFront End

PrimitiveAssembly

PrimitiveAssembly

Frame Buffer

Frame Buffer

RasterOperations

Rasterizationand

Interpolation

3D API:OpenGL orDirect3D


3DApplication

Or Game

3DApplication

Or Game

Pre

-transfo

rmed

Vertice

s

Pre

-transfo

rmed

Fragm

en

ts

Tra

nsf

orm

ed

Fragm

en

ts

GPU

Com

mand &

Data

Stre

am

CPU-GPU Boundary (AGP/PCIe)

Fixed-function pipeline

VertexIndex

Stream

3D APICommands

AssembledPrimitives

PixelUpdates

PixelLocationStream



Tra

nsf

orm

ed

Vert

ices

ProgrammableVertex

Processor

ProgrammableVertex

Processor

GPUFront End

GPUFront End

PrimitiveAssembly

PrimitiveAssembly

Frame Buffer

Frame Buffer

RasterOperations

Rasterizationand

Interpolation



3DApplication

Or Game

3DApplication

Or Game

Pre

-transfo

rmed

Vertice

s

Pre

-transfo

rmed

Fragm

en

ts

Tra

nsf

orm

ed

Fragm

en

ts

GPU

Com

mand &

Data

Stre

am


Programmable pipeline

VertexIndex

Stream

3D APICommands

AssembledPrimitives

PixelUpdates

PixelLocationStream

Unified Vertex,Fragment, GeometryProcessor

Unified Vertex,Fragment, GeometryProcessor

Tra

nsf

orm

ed

Vert

ices

GPUFront End

GPUFront End

PrimitiveAssembly

PrimitiveAssembly

Frame Buffer

Frame Buffer

RasterOperations

Rasterizationand

Interpolation



3DApplication

Or Game

3DApplication

Or Game

Pre

-transfo

rmed

Vertice

s

Pre

-transfo

rmed

Fragm

en

ts

Tra

nsf

orm

ed

Fragm

en

ts

GPU

Com

mand &

Data

Stre

am


Unified Programmable pipeline

General Diagram (6800/NV40)

TurboCache

Uses PCI-Express bandwidth to render directly to system memory

Card needs less memory Performance boost while lowering cost TurboCache Manager dynamically allocates

from main memory Local memory used to cache data and to

deliver peak performance when needed

NV40 Vertex Processor

An NV40 vertex processor is able to execute one vector operation (up to four FP32 components), one scalar FP32 operation, and make one access to the texture per clock cycle

NV40 Fragment ProcessorsEarly termination from mini z buffer and z buffer checks; resulting sets of 4 pixels

(quads) passed on to fragment units

Why NV40 series was better

Massive parallelism Scalability

Lower end products have fewer pixel pipes and fewer vertex shader units

Computation Power 222 million transistors First to comply with Microsoft’s DirectX 9 spec

Dynamic Branching in pixel shaders

Dynamic Branching

Helps detect if pixel needs shading Instruction flow handled in groups of pixels Specify branch granularity (the number of

consecutive pixels that take the same branch)

Better distribution of blocks of pixels between the different quad engines

General Diagram (7800/G70)

General Diagram (7800/G70)

General Diagram (6800/NV40)

GeForce Go 7800 – Power Issues

Power consumption and package are the same as the 6800 Ultra chip, meaning notebook designers do not have to change very much about their thermal designs

Dynamic clock scaling can run as slow as 16 MHz This is true for the engine, memory, and pixel clocks

Heavier use of clock gating than the desktop version Runs at voltages lower than any other mobile performance part Regardless, you won’t get much battery-based runtime for a 3D game

Triangle Setup/RasterTriangle Setup/Raster

Shader Instruction DispatchShader Instruction Dispatch

Fragment CrossbarFragment Crossbar

MemoryPartitionMemoryPartition




Z-CullZ-Cull

8 Vertex Engines

24 Pixel Shaders

16 Raster Operation Pipelines

GeForce 7800 GTX ParallelismGeForce 7800 GTX Parallelism

L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

The future of GPUs is programmable processing

So – build the architecture around the processor

G80 – Graphics Mode

G80 CUDA mode – A Device Example

Processors execute computing threads New operating mode/HW interface for

computing

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

The GPU has evolved into a very flexible and powerful processor: It’s programmable using high-level languages It supports 32-bit floating point precision It offers lots of GFLOPS:

GPU in every PC and workstation

GF

LOP

S

G80 = GeForce 8800 GTX



NV40 = GeForce 6800 Ultra

NV35 = GeForce FX 5950 Ultra

NV30 = GeForce FX 5800

Why Use the GPU for Computing ?

What is Behind such an Evolution? The GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about) So, more transistors can be devoted to data processing

rather than data caching and flow control

The fast-growing video game industry exerts strong economic pressure that forces constant innovation

DRAM

Cache

ALUControl

ALU

ALU

ALU

DRAM

CPU GPU

What is (Historical) GPGPU ?

General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates critical path of application

Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation

Applications – see //GPGPU.org Game effects (FX) physics, image processing Physical modeling, computational engineering, matrix algebra, convolution,

correlation, sorting

Previous GPGPU Constraints Dealing with graphics API Working with the corner cases of

the graphics API

Addressing modes Limited texture size/dimension

Shader capabilities Limited outputs

Instruction sets Lack of Integer & bit ops

Communication limited Between pixels Scatter a[i] = p

Input Registers

Fragment Program

Output Registers

Constants

Texture

Temp Registers

per threadper Shaderper Context

FB Memory

An Example of Physical Reality Behind CUDA CPU

(host)GPU w/

local DRAM(device)

Arrays of Parallel Threads

• A CUDA kernel is executed by an array of threads– All threads run the same code (SPMD)– Each thread has an ID that it uses to compute

memory addresses and make control decisions

76543210

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID


threadID

Thread Block 0

……float x = input[threadID];float y = func(x);output[threadID] = y;…

Thread Block 0


Thread Block N - 1

Thread Blocks: Scalable Cooperation Divide monolithic thread array into multiple

blocks Threads within a block cooperate via shared

memory, atomic operations and barrier synchronization

Threads in different blocks cannot cooperate76543210 76543210 76543210

Thread Batching: Grids and Blocks A kernel is executed as a

grid of thread blocks All threads share data

memory space A thread block is a batch of

threads that can cooperate with each other by: Synchronizing their execution

For hazard-free shared memory accesses

Efficiently sharing data through a low latency shared memory

Two threads from two different blocks cannot cooperate

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Courtesy: NDVIA

Block and Thread IDs

Threads and blocks have IDs So each thread can decide

what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D

Simplifies memoryaddressing when processingmultidimensional data Image processing Solving PDEs on volumes …

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Courtesy: NDVIA

CUDA Device Memory Space Overview Each thread can:

R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host The host can R/W

global, constant, and texture memories

Global, Constant, and Texture Memories(Long Latency Accesses)

Global memory Main means of

communicating R/W Data between host and device

Contents visible to all threads

Texture and Constant Memories Constants initialized by

host Contents visible to all

threads

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

Courtesy: NDVIA

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Block IDs and Thread IDs Each thread uses IDs to

decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D

Simplifies memoryaddressing when processingmultidimensional data Image processing Solving PDEs on volumes …

CUDA Memory Model Overview Global memory

Main means of communicating R/W Data between host and device

Contents visible to all threads

Long latency access We will focus on global

memory for nowConstant and texture

memory will come later

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Parallel Computing on a GPU

8-series GPUs deliver 25 to 200+ GFLOPSon compiled parallel C applications Available in laptops, desktops, and clusters

GPU parallelism is doubling every year Programming model scales transparently

Programmable in C with CUDA tools Multithreaded SPMD model uses

application data parallelism and thread parallelism

GeForce 8800

Tesla S870

Tesla D870

Single-Program Multiple-Data (SPMD)

CUDA integrated CPU + GPU application C program Serial C code executes on CPU Parallel Kernel C code executes on GPU thread blocks

CPU Serial CodeGrid 0

. . .

. . .

GPU Parallel Kernel

KernelA<<< nBlk, nTid >>>(args);

Grid 1CPU Serial Code

GPU Parallel Kernel

KernelB<<< nBlk, nTid >>>(args);

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Grids and Blocks A kernel is executed as a grid

of thread blocks All threads share global

memory space A thread block is a batch of

threads that can cooperate with each other by: Synchronizing their execution

using barrier Efficiently sharing data through

a low latency shared memory Two threads from two different

blocks cannot cooperate

CUDA Thread Block Programmer declares (Thread) Block:

Block size 1 to 512 concurrent threads Block shape 1D, 2D, or 3D Block dimensions in threads

All threads in a Block execute the same thread program

Threads share data and synchronize while doing their share of the work

Threads have thread id numbers within Block

Thread program uses thread id to select work and address shared data

CUDA Thread Block

Thread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA

GeForce-8 Series HW Overview

TPC TPC TPC TPC TPC TPC

TEX

SM

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1 Data L1

Texture Processor Cluster Streaming Multiprocessor

SM

Shared Memory

Streaming Processor Array

…

SPA Streaming Processor Array (variable across GeForce 8-

series, 8 in GeForce8800)

TPC Texture Processor Cluster (2 SM + TEX)

SM Streaming Multiprocessor (8 SP) Multi-threaded processor core Fundamental processing unit for CUDA thread block

SP Streaming Processor Scalar ALU for a single CUDA thread

CUDA Processor Terminology

Streaming Multiprocessor (SM)

Streaming Multiprocessor (SM)

8 Streaming Processors (SP) 2 Super Function Units (SFU)

Multi-threaded instruction dispatch 1 to 512 threads active Shared instruction fetch per 32

threads Cover latency of texture/memory

loads

20+ GFLOPS 16 KB shared memory texture and global memory

access

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU



Streaming Multiprocessor

Shared Memory

G80 Thread Computing Pipeline Processors execute computing threads Alternative operating mode specifically for

computing

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

The future of GPUs is programmable processing

So – build the architecture around the processor

L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Generates Thread grids

based on kernel calls

Thread Life Cycle in HW Grid is launched on the SPA Thread Blocks are serially

distributed to all the SM’s Potentially >1 Thread Block

per SM Each SM launches Warps of

Threads 2 levels of parallelism

SM schedules and executes Warps that are ready to run

As Warps and Thread Blocks complete, resources are freed SPA can distribute more

Thread Blocks

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

SM Executes Blocks

Threads are assigned to SMs in Block granularity Up to 8 Blocks to each SM as

resource allows SM in G80 can take up to 768

threads Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc.

Threads run concurrently SM assigns/maintains thread id #s SM manages/schedules thread

execution

t0 t1 t2 … tm

Blocks

Texture L1

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

TF

L2

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

Thread Scheduling/Execution

Each Thread Blocks is divided

in 32-thread Warps This is an implementation

decision, not part of the CUDA programming model

Warps are scheduling units in SM

If 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM? Each Block is divided into 256/32

= 8 Warps There are 8 * 3 = 24 Warps At any point in time, only one of

the 24 Warps will be selected for instruction fetch and execution.

…t0 t1 t2 … t31…

…t0 t1 t2 … t31…Block 1 Warps Block 2 Warps

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU



Streaming Multiprocessor

Shared Memory

SM Warp Scheduling SM hardware implements zero-

overhead Warp scheduling Warps whose next instruction has its

operands ready for consumption are eligible for execution

Eligible Warps are selected for execution on a prioritized scheduling policy

All threads in a Warp execute the same instruction when selected

4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 If one global memory access is needed

for every 4 instructions A minimal of 13 Warps are needed to

fully tolerate 200-cycle memory latency

warp 8 instruction 11

SM multithreadedWarp scheduler




...

time


SM Instruction Buffer – Warp Scheduling Fetch one warp instruction/cycle

from instruction L1 cache into any instruction buffer slot

Issue one “ready-to-go” warp instruction/cycle from any warp - instruction buffer slot operand scoreboarding used to prevent hazards

Issue selection based on round-robin/age of warp

SM broadcasts the same instruction to 32 Threads of a Warp

I$L1

MultithreadedInstruction Buffer

RF

C$L1

SharedMem

Operand Select

MAD SFU

Scoreboarding

All register operands of all instructions in the Instruction Buffer are scoreboarded Instruction becomes ready after the needed values are

deposited prevents hazards cleared instructions are eligible for issue

Decoupled Memory/Processor pipelines any thread can continue to issue instructions until

scoreboarding prevents issue allows Memory/Processor ops to proceed in shadow of

other waiting Memory/Processor ops

TB1W1

TB = Thread Block, W = Warp

TB2W1

TB3W1

TB2W1

TB1W1

TB3W2

TB1W2

TB1W3

TB3W2

Time

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

Granularity Considerations For Matrix Multiplication, should I use 4X4, 8X8, 16X16 or 32X32 tiles?

For 4X4, we have 16 threads per block, Since each SM can take up to 768 threads, the thread capacity allows 48 blocks. However, each SM can only take up to 8 blocks, thus there will be only 128 threads in each SM!

There are 8 warps but each warp is only half full.

For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, it could take up to 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM!

There are 16 warps available for scheduling in each SM Each warp spans four slices in the y dimension

For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule.

There are 24 warps available for scheduling in each SM Each warp spans two slices in the y dimension

For 32X32, we have 1024 threads per Block. Not even one can fit into an SM!

Memory Hardware in G80

CUDA Device Memory Space: Review Each thread can:

R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

The host can R/W global, constant, and texture memories

Parallel Memory Sharing Local Memory: per-thread

Private per thread Auto variables, register spill

Shared Memory: per-Block Shared by threads of the same block Inter-thread communication

Global Memory: per-application Shared by all threads Inter-Grid communication

Thread

Local Memory

Grid 0

. . .

GlobalMemory

. . .

Grid 1SequentialGridsin Time

Block

SharedMemory

SM Memory Architecture

Threads in a block share data & results In Memory and Shared Memory Synchronize at barrier instruction

Per-Block Shared Memory Allocation Keeps data close to processor Minimize trips to global Memory Shared Memory is dynamically

allocated to blocks, one of the limiting resources

t0 t1 t2 … tm

Blocks

Texture L1

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

TF

L2

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

Courtesy: John Nicols, NVIDIA

SM Register File

Register File (RF) 32 KB (8K entries) for each SM in

G80

TEX pipe can also read/write RF 2 SMs share 1 TEX

Load/Store pipe can also read/write RF

I$L1

MultithreadedInstruction Buffer

RF

C$L1

SharedMem

Operand Select

MAD SFU

Programmer View of Register File There are 8192 registers in

each SM in G80 This is an implementation

decision, not part of CUDA Registers are dynamically

partitioned across all blocks assigned to the SM

Once assigned to a block, the register is NOT accessible by threads in other blocks

Each thread in the same block only access registers assigned to itself

4 blocks 3 blocks

Matrix Multiplication Example If each Block has 16X16 threads and each thread

uses 10 registers, how many thread can run on each SM?Each block requires 10*256 = 2560 registers8192 = 3 * 2560 + changeSo, three blocks can run on an SM as far as registers

are concerned How about if each thread increases the use of

registers by 1?Each Block now requires 11*256 = 2816 registers8192 < 2816 *3Only two Blocks can run on an SM, 1/3 reduction of

parallelism!!!

More on Dynamic Partitioning

Dynamic partitioning gives more flexibility to compilers/programmersOne can run a smaller number of threads that

require many registers each or a large number of threads that require few registers each

This allows for finer grain threading than traditional CPU threading models.

The compiler can tradeoff between instruction-level parallelism and thread level parallelism

Let’s program this thing!

GPU Computing History

2001/2002 – researchers see GPU as data-parallel coprocessorThe GPGPU field is born

2007 – NVIDIA releases CUDACUDA – Compute Uniform Device ArchitectureGPGPU shifts to GPU Computing

2008 – Khronos releases OpenCL specification

CUDA Abstractions

A hierarchy of thread groups Shared memories Barrier synchronization

CUDA Terminology

Host – typically the CPUCode written in ANSI C

Device – typically the GPU (data-parallel)Code written in extended ANSI C

Host and device have separate memories CUDA Program

Contains both host and device code

CUDA Terminology

Kernel – data-parallel function Invoking a kernel creates lightweight threads

on the device Threads are generated and scheduled with

hardware

Does a kernel remind you of a shader in OpenGL?

CUDA Kernels

Executed N times in parallel by N different CUDA threads

Thread ID

ExecutionConfiguration

DeclarationSpecifier

CUDA Program Execution

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Thread Hierarchies

Grid – one or more thread blocks1D or 2D

Block – array of threads1D, 2D, or 3DEach block in a grid has the same number of

threadsEach thread in a block can

Synchronize Access shared memory

Thread Hierarchies


Thread Hierarchies

Block – 1D, 2D, or 3DExample: Index into vector, matrix, volume

Thread Hierarchies

Thread ID: Scalar thread identifier Thread Index: threadIdx

1D: Thread ID == Thread Index 2D with size (Dx, Dy)

Thread ID of index (x, y) == x + y Dy

3D with size (Dx, Dy, Dz)Thread ID of index (x, y, z) == x + y Dy + z Dx Dy

Thread Hierarchies

1 Thread Block 2D Block

2D Index

Thread Hierarchies

Thread BlockGroup of threads

G80 and GT200: Up to 512 threads Fermi: Up to 1024 threads

Reside on same processor coreShare memory of that core

Thread Hierarchies

Block Index: blockIdx Dimension: blockDim

1D or 2D

Thread Hierarchies

2D Thread Block

16x16Threads per block

Thread Hierarchies

Example: N = 3216x16 threads per block (independent of N)

threadIdx ([0, 15], [0, 15])

2x2 thread blocks in grid blockIdx ([0, 1], [0, 1]) blockDim = 16

i = [0, 1] * 16 + [0, 15]

Thread Hierarchies

Thread blocks execute independently In any order: parallel or seriesScheduled in any order by any number of

cores Allows code to scale with core count

Thread Hierarchies

Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf

Thread Hierarchies

Threads in a blockShare (limited) low-latency memorySynchronize execution

To coordinate memory accesses __syncThreads()

Barrier – threads in block wait until all threads reach this Lightweight


CUDA Memory Transfers



Host can transfer to/from deviceGlobal memoryConstant memory



cudaMalloc()Allocate global memory on device

cudaFree()Frees memory



Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf



Pointer to device memory



Size in bytes


cudaMemcpy()Memory transfer

Host to host Host to device Device to host Device to device

Host Device

Global Memory

Does this remind you of VBOs in OpenGL?




Host Device

Global Memory




Host Device

Global Memory

All transfers are asynchronous



Host Device

Global Memory

Host to device



Host Device

Global Memory

Source (host)Destination (device)



Host Device

Global Memory

Matrix Multiply


P = M * N Assume M and N are

square for simplicity Is this data-parallel?

Matrix Multiply

1,000 x 1,000 matrix 1,000,000 dot products

Each 1,000 multiples and 1,000 adds

Matrix Multiply: CPU Implementation

Code from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture3%20cuda%20threads%20spring%202010.ppt

void MatrixMulOnHost(float* M, float* N, float* P, int width){ for (int i = 0; i < width; ++i) for (int j = 0; j < width; ++j) { float sum = 0; for (int k = 0; k < width; ++k) { float a = M[i * width + k]; float b = N[k * width + j]; sum += a * b; } P[i * width + j] = sum; }}

Matrix Multiply: CUDA Skeleton


Matrix Multiply

Step 1Add CUDA memory transfers to the skeleton

Matrix Multiply: Data Transfer


Allocate input



Allocate output



Read back from device



Does this remind you of GPGPU with GLSL?

Matrix Multiply

Step 2 Implement the kernel in CUDA C

Matrix Multiply: CUDA Kernel


Accessing a matrix, so using a 2D block



Each kernel computes one output



Where did the two outer for loops in the CPU implementation go?



No locks or synchronization, why?

Matrix Multiply

Step 3 Invoke the kernel in CUDA C

Matrix Multiply: Invoke Kernel


One block with width by width threads

Matrix Multiply

125

One Block of threads compute matrix Pd Each thread computes one element

of Pd Each thread

Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition

for each pair of Md and Nd elements

Compute to off-chip memory access ratio close to 1:1 (not very high)

Size of matrix limited by the number of threads allowed in a thread block

Grid 1

Block 1

3 2 5 4

2

4

2

6

48

Thread(2, 2)

WIDTH

Md Pd

Nd

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign

Slide from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture2%20cuda%20spring%2009.ppt

Matrix Multiply

What is the major performance problem with our implementation?

What is the major limitation?

Introduction to CUDA (1 of n) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 Where...

Documents

Transcript of Introduction to CUDA (1 of n) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 Where...

Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where...

Documents

Transcript of Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where...

Introduction to CUDA (1 of n) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 Where...

Transcript of Introduction to CUDA (1 of n) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 Where...