Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where...

126
Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    0

Transcript of Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where...

Page 1: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Introduction to CUDA (1 of n*)

Joseph KiderUniversity of PennsylvaniaCIS 565 - Spring 2011

* Where n is 2 or 3

Page 2: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Agenda

GPU architecture review CUDA

First of two or three dedicated classes

Page 3: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Acknowledgements

Many slides are fromKayvon Fatahalian's From Shader Code to a

Teraflop: How GPU Shader Cores Work: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_g

puArchTeraflop_BPS_SIGGRAPH2010.pdf

David Kirk and Wen-mei Hwu’s UIUC course: http://courses.engr.illinois.edu/ece498/al/

Page 4: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

GPU Architecture Review

GPUs are:ParallelMultithreadedMany-core

GPUs have:Tremendous computational horsepowerHigh memory bandwidth

Page 5: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

GPU Architecture Review

GPUs are specialized forCompute-intensive, highly parallel computationGraphics!

Transistors are devoted to:ProcessingNot:

Data caching Flow control

Page 6: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

GPU Architecture Review

Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf

Transistor Usage

Page 7: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Page 8: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Page 9: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Page 10: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Page 11: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Page 12: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Page 13: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Page 14: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Page 15: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Page 16: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Page 17: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf

Page 18: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Threading Hardware in G80

Page 19: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Sources

Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu

John Nickolls, NVIDIA

Page 20: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

VertexIndex

Stream

3D APICommands

AssembledPrimitives

PixelUpdates

PixelLocationStream

ProgrammableFragmentProcessor

ProgrammableFragmentProcessor

Tra

nsf

orm

ed

Vert

ices

ProgrammableVertex

Processor

ProgrammableVertex

Processor

GPUFront End

GPUFront End

PrimitiveAssembly

PrimitiveAssembly

Frame Buffer

Frame Buffer

RasterOperations

Rasterizationand

Interpolation

3D API:OpenGL orDirect3D

3D API:OpenGL orDirect3D

3DApplication

Or Game

3DApplication

Or Game

Pre

-transfo

rmed

Vertice

s

Pre

-transfo

rmed

Fragm

en

ts

Tra

nsf

orm

ed

Fragm

en

ts

GPU

Com

mand &

Data

Stre

am

CPU-GPU Boundary (AGP/PCIe)

Fixed-function pipeline

Page 21: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

VertexIndex

Stream

3D APICommands

AssembledPrimitives

PixelUpdates

PixelLocationStream

ProgrammableFragmentProcessor

ProgrammableFragmentProcessor

Tra

nsf

orm

ed

Vert

ices

ProgrammableVertex

Processor

ProgrammableVertex

Processor

GPUFront End

GPUFront End

PrimitiveAssembly

PrimitiveAssembly

Frame Buffer

Frame Buffer

RasterOperations

Rasterizationand

Interpolation

3D API:OpenGL orDirect3D

3D API:OpenGL orDirect3D

3DApplication

Or Game

3DApplication

Or Game

Pre

-transfo

rmed

Vertice

s

Pre

-transfo

rmed

Fragm

en

ts

Tra

nsf

orm

ed

Fragm

en

ts

GPU

Com

mand &

Data

Stre

am

CPU-GPU Boundary (AGP/PCIe)

Programmable pipeline

Page 22: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

VertexIndex

Stream

3D APICommands

AssembledPrimitives

PixelUpdates

PixelLocationStream

Unified Vertex,Fragment, GeometryProcessor

Unified Vertex,Fragment, GeometryProcessor

Tra

nsf

orm

ed

Vert

ices

GPUFront End

GPUFront End

PrimitiveAssembly

PrimitiveAssembly

Frame Buffer

Frame Buffer

RasterOperations

Rasterizationand

Interpolation

3D API:OpenGL orDirect3D

3D API:OpenGL orDirect3D

3DApplication

Or Game

3DApplication

Or Game

Pre

-transfo

rmed

Vertice

s

Pre

-transfo

rmed

Fragm

en

ts

Tra

nsf

orm

ed

Fragm

en

ts

GPU

Com

mand &

Data

Stre

am

CPU-GPU Boundary (AGP/PCIe)

Unified Programmable pipeline

Page 23: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

General Diagram (6800/NV40)

Page 24: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

TurboCache

Uses PCI-Express bandwidth to render directly to system memory

Card needs less memory Performance boost while lowering cost TurboCache Manager dynamically allocates

from main memory Local memory used to cache data and to

deliver peak performance when needed

Page 25: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

NV40 Vertex Processor

An NV40 vertex processor is able to execute one vector operation (up to four FP32 components), one scalar FP32 operation, and make one access to the texture per clock cycle

Page 26: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

NV40 Fragment ProcessorsEarly termination from mini z buffer and z buffer checks; resulting sets of 4 pixels

(quads) passed on to fragment units

Page 27: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Why NV40 series was better

Massive parallelism Scalability

Lower end products have fewer pixel pipes and fewer vertex shader units

Computation Power 222 million transistors First to comply with Microsoft’s DirectX 9 spec

Dynamic Branching in pixel shaders

Page 28: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Dynamic Branching

Helps detect if pixel needs shading Instruction flow handled in groups of pixels Specify branch granularity (the number of

consecutive pixels that take the same branch)

Better distribution of blocks of pixels between the different quad engines

Page 29: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

General Diagram (7800/G70)

Page 30: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

General Diagram (7800/G70)

General Diagram (6800/NV40)

Page 31: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.
Page 32: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

GeForce Go 7800 – Power Issues

Power consumption and package are the same as the 6800 Ultra chip, meaning notebook designers do not have to change very much about their thermal designs

Dynamic clock scaling can run as slow as 16 MHz This is true for the engine, memory, and pixel clocks

Heavier use of clock gating than the desktop version Runs at voltages lower than any other mobile performance part Regardless, you won’t get much battery-based runtime for a 3D game

Page 33: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Triangle Setup/RasterTriangle Setup/Raster

Shader Instruction DispatchShader Instruction Dispatch

Fragment CrossbarFragment Crossbar

MemoryPartitionMemoryPartition

MemoryPartitionMemoryPartition

MemoryPartitionMemoryPartition

MemoryPartitionMemoryPartition

Z-CullZ-Cull

8 Vertex Engines

24 Pixel Shaders

16 Raster Operation Pipelines

GeForce 7800 GTX ParallelismGeForce 7800 GTX Parallelism

Page 34: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

The future of GPUs is programmable processing

So – build the architecture around the processor

G80 – Graphics Mode

Page 35: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

G80 CUDA mode – A Device Example

Processors execute computing threads New operating mode/HW interface for

computing

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

Page 36: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

The GPU has evolved into a very flexible and powerful processor: It’s programmable using high-level languages It supports 32-bit floating point precision It offers lots of GFLOPS:

GPU in every PC and workstation

GF

LOP

S

G80 = GeForce 8800 GTX

G71 = GeForce 7900 GTX

G70 = GeForce 7800 GTX

NV40 = GeForce 6800 Ultra

NV35 = GeForce FX 5950 Ultra

NV30 = GeForce FX 5800

Why Use the GPU for Computing ?

Page 37: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

What is Behind such an Evolution? The GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about) So, more transistors can be devoted to data processing

rather than data caching and flow control

The fast-growing video game industry exerts strong economic pressure that forces constant innovation

DRAM

Cache

ALUControl

ALU

ALU

ALU

DRAM

CPU GPU

Page 38: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

What is (Historical) GPGPU ?

General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates critical path of application

Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation

Applications – see //GPGPU.org Game effects (FX) physics, image processing Physical modeling, computational engineering, matrix algebra, convolution,

correlation, sorting

Page 39: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Previous GPGPU Constraints Dealing with graphics API Working with the corner cases of

the graphics API

Addressing modes Limited texture size/dimension

Shader capabilities Limited outputs

Instruction sets Lack of Integer & bit ops

Communication limited Between pixels Scatter a[i] = p

Input Registers

Fragment Program

Output Registers

Constants

Texture

Temp Registers

per threadper Shaderper Context

FB Memory

Page 40: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

An Example of Physical Reality Behind CUDA CPU

(host)GPU w/

local DRAM(device)

Page 41: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Arrays of Parallel Threads

• A CUDA kernel is executed by an array of threads– All threads run the same code (SPMD)– Each thread has an ID that it uses to compute

memory addresses and make control decisions

76543210

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID

Page 42: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

…float x = input[threadID];float y = func(x);output[threadID] = y;…

threadID

Thread Block 0

……float x = input[threadID];float y = func(x);output[threadID] = y;…

Thread Block 0

…float x = input[threadID];float y = func(x);output[threadID] = y;…

Thread Block N - 1

Thread Blocks: Scalable Cooperation Divide monolithic thread array into multiple

blocks Threads within a block cooperate via shared

memory, atomic operations and barrier synchronization

Threads in different blocks cannot cooperate76543210 76543210 76543210

Page 43: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Thread Batching: Grids and Blocks A kernel is executed as a

grid of thread blocks All threads share data

memory space A thread block is a batch of

threads that can cooperate with each other by: Synchronizing their execution

For hazard-free shared memory accesses

Efficiently sharing data through a low latency shared memory

Two threads from two different blocks cannot cooperate

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Courtesy: NDVIA

Page 44: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Block and Thread IDs

Threads and blocks have IDs So each thread can decide

what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D

Simplifies memoryaddressing when processingmultidimensional data Image processing Solving PDEs on volumes …

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Courtesy: NDVIA

Page 45: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Device Memory Space Overview Each thread can:

R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host The host can R/W

global, constant, and texture memories

Page 46: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Global, Constant, and Texture Memories(Long Latency Accesses)

Global memory Main means of

communicating R/W Data between host and device

Contents visible to all threads

Texture and Constant Memories Constants initialized by

host Contents visible to all

threads

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

Courtesy: NDVIA

Page 47: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Block IDs and Thread IDs Each thread uses IDs to

decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D

Simplifies memoryaddressing when processingmultidimensional data Image processing Solving PDEs on volumes …

Page 48: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Memory Model Overview Global memory

Main means of communicating R/W Data between host and device

Contents visible to all threads

Long latency access We will focus on global

memory for nowConstant and texture

memory will come later

Grid

Global Memory

Block (0, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

Page 49: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Parallel Computing on a GPU

8-series GPUs deliver 25 to 200+ GFLOPSon compiled parallel C applications Available in laptops, desktops, and clusters

GPU parallelism is doubling every year Programming model scales transparently

Programmable in C with CUDA tools Multithreaded SPMD model uses

application data parallelism and thread parallelism

GeForce 8800

Tesla S870

Tesla D870

Page 50: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Single-Program Multiple-Data (SPMD)

CUDA integrated CPU + GPU application C program Serial C code executes on CPU Parallel Kernel C code executes on GPU thread blocks

CPU Serial CodeGrid 0

. . .

. . .

GPU Parallel Kernel

KernelA<<< nBlk, nTid >>>(args);

Grid 1CPU Serial Code

GPU Parallel Kernel

KernelB<<< nBlk, nTid >>>(args);

Page 51: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Grids and Blocks A kernel is executed as a grid

of thread blocks All threads share global

memory space A thread block is a batch of

threads that can cooperate with each other by: Synchronizing their execution

using barrier Efficiently sharing data through

a low latency shared memory Two threads from two different

blocks cannot cooperate

Page 52: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Thread Block Programmer declares (Thread) Block:

Block size 1 to 512 concurrent threads Block shape 1D, 2D, or 3D Block dimensions in threads

All threads in a Block execute the same thread program

Threads share data and synchronize while doing their share of the work

Threads have thread id numbers within Block

Thread program uses thread id to select work and address shared data

CUDA Thread Block

Thread Id #:0 1 2 3 … m

Thread program

Courtesy: John Nickolls, NVIDIA

Page 53: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

GeForce-8 Series HW Overview

TPC TPC TPC TPC TPC TPC

TEX

SM

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1 Data L1

Texture Processor Cluster Streaming Multiprocessor

SM

Shared Memory

Streaming Processor Array

Page 54: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

SPA Streaming Processor Array (variable across GeForce 8-

series, 8 in GeForce8800)

TPC Texture Processor Cluster (2 SM + TEX)

SM Streaming Multiprocessor (8 SP) Multi-threaded processor core Fundamental processing unit for CUDA thread block

SP Streaming Processor Scalar ALU for a single CUDA thread

CUDA Processor Terminology

Page 55: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Streaming Multiprocessor (SM)

Streaming Multiprocessor (SM)

8 Streaming Processors (SP) 2 Super Function Units (SFU)

Multi-threaded instruction dispatch 1 to 512 threads active Shared instruction fetch per 32

threads Cover latency of texture/memory

loads

20+ GFLOPS 16 KB shared memory texture and global memory

access

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1 Data L1

Streaming Multiprocessor

Shared Memory

Page 56: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

G80 Thread Computing Pipeline Processors execute computing threads Alternative operating mode specifically for

computing

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

The future of GPUs is programmable processing

So – build the architecture around the processor

L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Generates Thread grids

based on kernel calls

Page 57: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Thread Life Cycle in HW Grid is launched on the SPA Thread Blocks are serially

distributed to all the SM’s Potentially >1 Thread Block

per SM Each SM launches Warps of

Threads 2 levels of parallelism

SM schedules and executes Warps that are ready to run

As Warps and Thread Blocks complete, resources are freed SPA can distribute more

Thread Blocks

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(2, 0)

Block(0, 1)

Block(1, 1)

Block(2, 1)

Grid 2

Block (1, 1)

Thread(0, 1)

Thread(1, 1)

Thread(2, 1)

Thread(3, 1)

Thread(4, 1)

Thread(0, 2)

Thread(1, 2)

Thread(2, 2)

Thread(3, 2)

Thread(4, 2)

Thread(0, 0)

Thread(1, 0)

Thread(2, 0)

Thread(3, 0)

Thread(4, 0)

Page 58: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

SM Executes Blocks

Threads are assigned to SMs in Block granularity Up to 8 Blocks to each SM as

resource allows SM in G80 can take up to 768

threads Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc.

Threads run concurrently SM assigns/maintains thread id #s SM manages/schedules thread

execution

t0 t1 t2 … tm

Blocks

Texture L1

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

TF

L2

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

Page 59: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Thread Scheduling/Execution

Each Thread Blocks is divided

in 32-thread Warps This is an implementation

decision, not part of the CUDA programming model

Warps are scheduling units in SM

If 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM? Each Block is divided into 256/32

= 8 Warps There are 8 * 3 = 24 Warps At any point in time, only one of

the 24 Warps will be selected for instruction fetch and execution.

…t0 t1 t2 … t31…

…t0 t1 t2 … t31…Block 1 Warps Block 2 Warps

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Instruction Fetch/Dispatch

Instruction L1 Data L1

Streaming Multiprocessor

Shared Memory

Page 60: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

SM Warp Scheduling SM hardware implements zero-

overhead Warp scheduling Warps whose next instruction has its

operands ready for consumption are eligible for execution

Eligible Warps are selected for execution on a prioritized scheduling policy

All threads in a Warp execute the same instruction when selected

4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 If one global memory access is needed

for every 4 instructions A minimal of 13 Warps are needed to

fully tolerate 200-cycle memory latency

warp 8 instruction 11

SM multithreadedWarp scheduler

warp 1 instruction 42

warp 3 instruction 95

warp 8 instruction 12

...

time

warp 3 instruction 96

Page 61: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

SM Instruction Buffer – Warp Scheduling Fetch one warp instruction/cycle

from instruction L1 cache into any instruction buffer slot

Issue one “ready-to-go” warp instruction/cycle from any warp - instruction buffer slot operand scoreboarding used to prevent hazards

Issue selection based on round-robin/age of warp

SM broadcasts the same instruction to 32 Threads of a Warp

I$L1

MultithreadedInstruction Buffer

RF

C$L1

SharedMem

Operand Select

MAD SFU

Page 62: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Scoreboarding

All register operands of all instructions in the Instruction Buffer are scoreboarded Instruction becomes ready after the needed values are

deposited prevents hazards cleared instructions are eligible for issue

Decoupled Memory/Processor pipelines any thread can continue to issue instructions until

scoreboarding prevents issue allows Memory/Processor ops to proceed in shadow of

other waiting Memory/Processor ops

TB1W1

TB = Thread Block, W = Warp

TB2W1

TB3W1

TB2W1

TB1W1

TB3W2

TB1W2

TB1W3

TB3W2

Time

TB1, W1 stallTB3, W2 stallTB2, W1 stall

Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4

Page 63: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Granularity Considerations For Matrix Multiplication, should I use 4X4, 8X8, 16X16 or 32X32 tiles?

For 4X4, we have 16 threads per block, Since each SM can take up to 768 threads, the thread capacity allows 48 blocks. However, each SM can only take up to 8 blocks, thus there will be only 128 threads in each SM!

There are 8 warps but each warp is only half full.

For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, it could take up to 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM!

There are 16 warps available for scheduling in each SM Each warp spans four slices in the y dimension

For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule.

There are 24 warps available for scheduling in each SM Each warp spans two slices in the y dimension

For 32X32, we have 1024 threads per Block. Not even one can fit into an SM!

Page 64: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Memory Hardware in G80

Page 65: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Device Memory Space: Review Each thread can:

R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

The host can R/W global, constant, and texture memories

Page 66: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Parallel Memory Sharing Local Memory: per-thread

Private per thread Auto variables, register spill

Shared Memory: per-Block Shared by threads of the same block Inter-thread communication

Global Memory: per-application Shared by all threads Inter-Grid communication

Thread

Local Memory

Grid 0

. . .

GlobalMemory

. . .

Grid 1SequentialGridsin Time

Block

SharedMemory

Page 67: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

SM Memory Architecture

Threads in a block share data & results In Memory and Shared Memory Synchronize at barrier instruction

Per-Block Shared Memory Allocation Keeps data close to processor Minimize trips to global Memory Shared Memory is dynamically

allocated to blocks, one of the limiting resources

t0 t1 t2 … tm

Blocks

Texture L1

SP

SharedMemory

MT IU

SP

SharedMemory

MT IU

TF

L2

Memory

t0 t1 t2 … tm

Blocks

SM 1SM 0

Courtesy: John Nicols, NVIDIA

Page 68: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

SM Register File

Register File (RF) 32 KB (8K entries) for each SM in

G80

TEX pipe can also read/write RF 2 SMs share 1 TEX

Load/Store pipe can also read/write RF

I$L1

MultithreadedInstruction Buffer

RF

C$L1

SharedMem

Operand Select

MAD SFU

Page 69: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Programmer View of Register File There are 8192 registers in

each SM in G80 This is an implementation

decision, not part of CUDA Registers are dynamically

partitioned across all blocks assigned to the SM

Once assigned to a block, the register is NOT accessible by threads in other blocks

Each thread in the same block only access registers assigned to itself

4 blocks 3 blocks

Page 70: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiplication Example If each Block has 16X16 threads and each thread

uses 10 registers, how many thread can run on each SM?Each block requires 10*256 = 2560 registers8192 = 3 * 2560 + changeSo, three blocks can run on an SM as far as registers

are concerned How about if each thread increases the use of

registers by 1?Each Block now requires 11*256 = 2816 registers8192 < 2816 *3Only two Blocks can run on an SM, 1/3 reduction of

parallelism!!!

Page 71: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

More on Dynamic Partitioning

Dynamic partitioning gives more flexibility to compilers/programmersOne can run a smaller number of threads that

require many registers each or a large number of threads that require few registers each

This allows for finer grain threading than traditional CPU threading models.

The compiler can tradeoff between instruction-level parallelism and thread level parallelism

Page 72: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Let’s program this thing!

Page 73: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

GPU Computing History

2001/2002 – researchers see GPU as data-parallel coprocessorThe GPGPU field is born

2007 – NVIDIA releases CUDACUDA – Compute Uniform Device ArchitectureGPGPU shifts to GPU Computing

2008 – Khronos releases OpenCL specification

Page 74: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Abstractions

A hierarchy of thread groups Shared memories Barrier synchronization

Page 75: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Terminology

Host – typically the CPUCode written in ANSI C

Device – typically the GPU (data-parallel)Code written in extended ANSI C

Host and device have separate memories CUDA Program

Contains both host and device code

Page 76: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Terminology

Kernel – data-parallel function Invoking a kernel creates lightweight threads

on the device Threads are generated and scheduled with

hardware

Does a kernel remind you of a shader in OpenGL?

Page 77: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Kernels

Executed N times in parallel by N different CUDA threads

Thread ID

ExecutionConfiguration

DeclarationSpecifier

Page 78: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Program Execution

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Page 79: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Thread Hierarchies

Grid – one or more thread blocks1D or 2D

Block – array of threads1D, 2D, or 3DEach block in a grid has the same number of

threadsEach thread in a block can

Synchronize Access shared memory

Page 80: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Thread Hierarchies

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Page 81: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Thread Hierarchies

Block – 1D, 2D, or 3DExample: Index into vector, matrix, volume

Page 82: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Thread Hierarchies

Thread ID: Scalar thread identifier Thread Index: threadIdx

1D: Thread ID == Thread Index 2D with size (Dx, Dy)

Thread ID of index (x, y) == x + y Dy

3D with size (Dx, Dy, Dz)Thread ID of index (x, y, z) == x + y Dy + z Dx Dy

Page 83: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Thread Hierarchies

1 Thread Block 2D Block

2D Index

Page 84: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Thread Hierarchies

Thread BlockGroup of threads

G80 and GT200: Up to 512 threads Fermi: Up to 1024 threads

Reside on same processor coreShare memory of that core

Page 85: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Thread Hierarchies

Thread BlockGroup of threads

G80 and GT200: Up to 512 threads Fermi: Up to 1024 threads

Reside on same processor coreShare memory of that core

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Page 86: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Thread Hierarchies

Block Index: blockIdx Dimension: blockDim

1D or 2D

Page 87: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Thread Hierarchies

2D Thread Block

16x16Threads per block

Page 88: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Thread Hierarchies

Example: N = 3216x16 threads per block (independent of N)

threadIdx ([0, 15], [0, 15])

2x2 thread blocks in grid blockIdx ([0, 1], [0, 1]) blockDim = 16

i = [0, 1] * 16 + [0, 15]

Page 89: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Thread Hierarchies

Thread blocks execute independently In any order: parallel or seriesScheduled in any order by any number of

cores Allows code to scale with core count

Page 90: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Thread Hierarchies

Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf

Page 91: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Thread Hierarchies

Threads in a blockShare (limited) low-latency memorySynchronize execution

To coordinate memory accesses __syncThreads()

Barrier – threads in block wait until all threads reach this Lightweight

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Page 92: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Memory Transfers

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Page 93: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Memory Transfers

Host can transfer to/from deviceGlobal memoryConstant memory

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Page 94: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Memory Transfers

cudaMalloc()Allocate global memory on device

cudaFree()Frees memory

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Page 95: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Memory Transfers

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Page 96: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Memory Transfers

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Pointer to device memory

Page 97: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Memory Transfers

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Size in bytes

Page 98: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Memory Transfers

cudaMemcpy()Memory transfer

Host to host Host to device Device to host Device to device

Host Device

Global Memory

Does this remind you of VBOs in OpenGL?

Page 99: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Memory Transfers

cudaMemcpy()Memory transfer

Host to host Host to device Device to host Device to device

Host Device

Global Memory

Page 100: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Memory Transfers

cudaMemcpy()Memory transfer

Host to host Host to device Device to host Device to device

Host Device

Global Memory

Page 101: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Memory Transfers

cudaMemcpy()Memory transfer

Host to host Host to device Device to host Device to device

Host Device

Global Memory

Page 102: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Memory Transfers

cudaMemcpy()Memory transfer

Host to host Host to device Device to host Device to device

Host Device

Global Memory

All transfers are asynchronous

Page 103: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Memory Transfers

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Host Device

Global Memory

Host to device

Page 104: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Memory Transfers

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Host Device

Global Memory

Source (host)Destination (device)

Page 105: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

CUDA Memory Transfers

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Host Device

Global Memory

Page 106: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply

Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

P = M * N Assume M and N are

square for simplicity Is this data-parallel?

Page 107: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply

1,000 x 1,000 matrix 1,000,000 dot products

Each 1,000 multiples and 1,000 adds

Page 108: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply: CPU Implementation

Code from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture3%20cuda%20threads%20spring%202010.ppt

void MatrixMulOnHost(float* M, float* N, float* P, int width){ for (int i = 0; i < width; ++i) for (int j = 0; j < width; ++j) { float sum = 0; for (int k = 0; k < width; ++k) { float a = M[i * width + k]; float b = N[k * width + j]; sum += a * b; } P[i * width + j] = sum; }}

Page 109: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply: CUDA Skeleton

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Page 110: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply: CUDA Skeleton

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Page 111: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply: CUDA Skeleton

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Page 112: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply

Step 1Add CUDA memory transfers to the skeleton

Page 113: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply: Data Transfer

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Allocate input

Page 114: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply: Data Transfer

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Allocate output

Page 115: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply: Data Transfer

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Page 116: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply: Data Transfer

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Read back from device

Page 117: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply: Data Transfer

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Does this remind you of GPGPU with GLSL?

Page 118: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply

Step 2 Implement the kernel in CUDA C

Page 119: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply: CUDA Kernel

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Accessing a matrix, so using a 2D block

Page 120: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply: CUDA Kernel

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Each kernel computes one output

Page 121: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply: CUDA Kernel

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

Where did the two outer for loops in the CPU implementation go?

Page 122: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply: CUDA Kernel

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

No locks or synchronization, why?

Page 123: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply

Step 3 Invoke the kernel in CUDA C

Page 124: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply: Invoke Kernel

Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf

One block with width by width threads

Page 125: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply

125

One Block of threads compute matrix Pd Each thread computes one element

of Pd Each thread

Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition

for each pair of Md and Nd elements

Compute to off-chip memory access ratio close to 1:1 (not very high)

Size of matrix limited by the number of threads allowed in a thread block

Grid 1

Block 1

3 2 5 4

2

4

2

6

48

Thread(2, 2)

WIDTH

Md Pd

Nd

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign

Slide from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture2%20cuda%20spring%2009.ppt

Page 126: Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where n is 2 or 3.

Matrix Multiply

What is the major performance problem with our implementation?

What is the major limitation?