Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where...
-
date post
22-Dec-2015 -
Category
Documents
-
view
219 -
download
0
Transcript of Introduction to CUDA (1 of n*) Joseph Kider University of Pennsylvania CIS 565 - Spring 2011 * Where...
Introduction to CUDA (1 of n*)
Joseph KiderUniversity of PennsylvaniaCIS 565 - Spring 2011
* Where n is 2 or 3
Agenda
GPU architecture review CUDA
First of two or three dedicated classes
Acknowledgements
Many slides are fromKayvon Fatahalian's From Shader Code to a
Teraflop: How GPU Shader Cores Work: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_g
puArchTeraflop_BPS_SIGGRAPH2010.pdf
David Kirk and Wen-mei Hwu’s UIUC course: http://courses.engr.illinois.edu/ece498/al/
GPU Architecture Review
GPUs are:ParallelMultithreadedMany-core
GPUs have:Tremendous computational horsepowerHigh memory bandwidth
GPU Architecture Review
GPUs are specialized forCompute-intensive, highly parallel computationGraphics!
Transistors are devoted to:ProcessingNot:
Data caching Flow control
GPU Architecture Review
Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf
Transistor Usage
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Slide from: http://bps10.idav.ucdavis.edu/talks/03-fatahalian_gpuArchTeraflop_BPS_SIGGRAPH2010.pdf
Threading Hardware in G80
Sources
Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu
John Nickolls, NVIDIA
VertexIndex
Stream
3D APICommands
AssembledPrimitives
PixelUpdates
PixelLocationStream
ProgrammableFragmentProcessor
ProgrammableFragmentProcessor
Tra
nsf
orm
ed
Vert
ices
ProgrammableVertex
Processor
ProgrammableVertex
Processor
GPUFront End
GPUFront End
PrimitiveAssembly
PrimitiveAssembly
Frame Buffer
Frame Buffer
RasterOperations
Rasterizationand
Interpolation
3D API:OpenGL orDirect3D
3D API:OpenGL orDirect3D
3DApplication
Or Game
3DApplication
Or Game
Pre
-transfo
rmed
Vertice
s
Pre
-transfo
rmed
Fragm
en
ts
Tra
nsf
orm
ed
Fragm
en
ts
GPU
Com
mand &
Data
Stre
am
CPU-GPU Boundary (AGP/PCIe)
Fixed-function pipeline
VertexIndex
Stream
3D APICommands
AssembledPrimitives
PixelUpdates
PixelLocationStream
ProgrammableFragmentProcessor
ProgrammableFragmentProcessor
Tra
nsf
orm
ed
Vert
ices
ProgrammableVertex
Processor
ProgrammableVertex
Processor
GPUFront End
GPUFront End
PrimitiveAssembly
PrimitiveAssembly
Frame Buffer
Frame Buffer
RasterOperations
Rasterizationand
Interpolation
3D API:OpenGL orDirect3D
3D API:OpenGL orDirect3D
3DApplication
Or Game
3DApplication
Or Game
Pre
-transfo
rmed
Vertice
s
Pre
-transfo
rmed
Fragm
en
ts
Tra
nsf
orm
ed
Fragm
en
ts
GPU
Com
mand &
Data
Stre
am
CPU-GPU Boundary (AGP/PCIe)
Programmable pipeline
VertexIndex
Stream
3D APICommands
AssembledPrimitives
PixelUpdates
PixelLocationStream
Unified Vertex,Fragment, GeometryProcessor
Unified Vertex,Fragment, GeometryProcessor
Tra
nsf
orm
ed
Vert
ices
GPUFront End
GPUFront End
PrimitiveAssembly
PrimitiveAssembly
Frame Buffer
Frame Buffer
RasterOperations
Rasterizationand
Interpolation
3D API:OpenGL orDirect3D
3D API:OpenGL orDirect3D
3DApplication
Or Game
3DApplication
Or Game
Pre
-transfo
rmed
Vertice
s
Pre
-transfo
rmed
Fragm
en
ts
Tra
nsf
orm
ed
Fragm
en
ts
GPU
Com
mand &
Data
Stre
am
CPU-GPU Boundary (AGP/PCIe)
Unified Programmable pipeline
General Diagram (6800/NV40)
TurboCache
Uses PCI-Express bandwidth to render directly to system memory
Card needs less memory Performance boost while lowering cost TurboCache Manager dynamically allocates
from main memory Local memory used to cache data and to
deliver peak performance when needed
NV40 Vertex Processor
An NV40 vertex processor is able to execute one vector operation (up to four FP32 components), one scalar FP32 operation, and make one access to the texture per clock cycle
NV40 Fragment ProcessorsEarly termination from mini z buffer and z buffer checks; resulting sets of 4 pixels
(quads) passed on to fragment units
Why NV40 series was better
Massive parallelism Scalability
Lower end products have fewer pixel pipes and fewer vertex shader units
Computation Power 222 million transistors First to comply with Microsoft’s DirectX 9 spec
Dynamic Branching in pixel shaders
Dynamic Branching
Helps detect if pixel needs shading Instruction flow handled in groups of pixels Specify branch granularity (the number of
consecutive pixels that take the same branch)
Better distribution of blocks of pixels between the different quad engines
General Diagram (7800/G70)
General Diagram (7800/G70)
General Diagram (6800/NV40)
GeForce Go 7800 – Power Issues
Power consumption and package are the same as the 6800 Ultra chip, meaning notebook designers do not have to change very much about their thermal designs
Dynamic clock scaling can run as slow as 16 MHz This is true for the engine, memory, and pixel clocks
Heavier use of clock gating than the desktop version Runs at voltages lower than any other mobile performance part Regardless, you won’t get much battery-based runtime for a 3D game
Triangle Setup/RasterTriangle Setup/Raster
Shader Instruction DispatchShader Instruction Dispatch
Fragment CrossbarFragment Crossbar
MemoryPartitionMemoryPartition
MemoryPartitionMemoryPartition
MemoryPartitionMemoryPartition
MemoryPartitionMemoryPartition
Z-CullZ-Cull
8 Vertex Engines
24 Pixel Shaders
16 Raster Operation Pipelines
GeForce 7800 GTX ParallelismGeForce 7800 GTX Parallelism
L2
FB
SP SP
L1
TF
Th
rea
d P
roc
es
so
r
Vtx Thread Issue
Setup / Rstr / ZCull
Geom Thread Issue Pixel Thread Issue
Input Assembler
Host
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
The future of GPUs is programmable processing
So – build the architecture around the processor
G80 – Graphics Mode
G80 CUDA mode – A Device Example
Processors execute computing threads New operating mode/HW interface for
computing
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Load/store Load/store Load/store Load/store Load/store
The GPU has evolved into a very flexible and powerful processor: It’s programmable using high-level languages It supports 32-bit floating point precision It offers lots of GFLOPS:
GPU in every PC and workstation
GF
LOP
S
G80 = GeForce 8800 GTX
G71 = GeForce 7900 GTX
G70 = GeForce 7800 GTX
NV40 = GeForce 6800 Ultra
NV35 = GeForce FX 5950 Ultra
NV30 = GeForce FX 5800
Why Use the GPU for Computing ?
What is Behind such an Evolution? The GPU is specialized for compute-intensive, highly data parallel computation (exactly what graphics rendering is about) So, more transistors can be devoted to data processing
rather than data caching and flow control
The fast-growing video game industry exerts strong economic pressure that forces constant innovation
DRAM
Cache
ALUControl
ALU
ALU
ALU
DRAM
CPU GPU
What is (Historical) GPGPU ?
General Purpose computation using GPU and graphics API in applications other than 3D graphics GPU accelerates critical path of application
Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation
Applications – see //GPGPU.org Game effects (FX) physics, image processing Physical modeling, computational engineering, matrix algebra, convolution,
correlation, sorting
Previous GPGPU Constraints Dealing with graphics API Working with the corner cases of
the graphics API
Addressing modes Limited texture size/dimension
Shader capabilities Limited outputs
Instruction sets Lack of Integer & bit ops
Communication limited Between pixels Scatter a[i] = p
Input Registers
Fragment Program
Output Registers
Constants
Texture
Temp Registers
per threadper Shaderper Context
FB Memory
An Example of Physical Reality Behind CUDA CPU
(host)GPU w/
local DRAM(device)
Arrays of Parallel Threads
• A CUDA kernel is executed by an array of threads– All threads run the same code (SPMD)– Each thread has an ID that it uses to compute
memory addresses and make control decisions
76543210
…float x = input[threadID];float y = func(x);output[threadID] = y;…
threadID
…float x = input[threadID];float y = func(x);output[threadID] = y;…
threadID
Thread Block 0
……float x = input[threadID];float y = func(x);output[threadID] = y;…
Thread Block 0
…float x = input[threadID];float y = func(x);output[threadID] = y;…
Thread Block N - 1
Thread Blocks: Scalable Cooperation Divide monolithic thread array into multiple
blocks Threads within a block cooperate via shared
memory, atomic operations and barrier synchronization
Threads in different blocks cannot cooperate76543210 76543210 76543210
Thread Batching: Grids and Blocks A kernel is executed as a
grid of thread blocks All threads share data
memory space A thread block is a batch of
threads that can cooperate with each other by: Synchronizing their execution
For hazard-free shared memory accesses
Efficiently sharing data through a low latency shared memory
Two threads from two different blocks cannot cooperate
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Courtesy: NDVIA
Block and Thread IDs
Threads and blocks have IDs So each thread can decide
what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D
Simplifies memoryaddressing when processingmultidimensional data Image processing Solving PDEs on volumes …
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Courtesy: NDVIA
CUDA Device Memory Space Overview Each thread can:
R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host The host can R/W
global, constant, and texture memories
Global, Constant, and Texture Memories(Long Latency Accesses)
Global memory Main means of
communicating R/W Data between host and device
Contents visible to all threads
Texture and Constant Memories Constants initialized by
host Contents visible to all
threads
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
Courtesy: NDVIA
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Organization.
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Block IDs and Thread IDs Each thread uses IDs to
decide what data to work on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D
Simplifies memoryaddressing when processingmultidimensional data Image processing Solving PDEs on volumes …
CUDA Memory Model Overview Global memory
Main means of communicating R/W Data between host and device
Contents visible to all threads
Long latency access We will focus on global
memory for nowConstant and texture
memory will come later
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Parallel Computing on a GPU
8-series GPUs deliver 25 to 200+ GFLOPSon compiled parallel C applications Available in laptops, desktops, and clusters
GPU parallelism is doubling every year Programming model scales transparently
Programmable in C with CUDA tools Multithreaded SPMD model uses
application data parallelism and thread parallelism
GeForce 8800
Tesla S870
Tesla D870
Single-Program Multiple-Data (SPMD)
CUDA integrated CPU + GPU application C program Serial C code executes on CPU Parallel Kernel C code executes on GPU thread blocks
CPU Serial CodeGrid 0
. . .
. . .
GPU Parallel Kernel
KernelA<<< nBlk, nTid >>>(args);
Grid 1CPU Serial Code
GPU Parallel Kernel
KernelB<<< nBlk, nTid >>>(args);
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(0, 1)
Block(1, 1)
Grid 2
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Organization.
Block (1, 1)
Thread(0,1,0)
Thread(1,1,0)
Thread(2,1,0)
Thread(3,1,0)
Thread(0,0,0)
Thread(1,0,0)
Thread(2,0,0)
Thread(3,0,0)
(0,0,1) (1,0,1) (2,0,1) (3,0,1)
Grids and Blocks A kernel is executed as a grid
of thread blocks All threads share global
memory space A thread block is a batch of
threads that can cooperate with each other by: Synchronizing their execution
using barrier Efficiently sharing data through
a low latency shared memory Two threads from two different
blocks cannot cooperate
CUDA Thread Block Programmer declares (Thread) Block:
Block size 1 to 512 concurrent threads Block shape 1D, 2D, or 3D Block dimensions in threads
All threads in a Block execute the same thread program
Threads share data and synchronize while doing their share of the work
Threads have thread id numbers within Block
Thread program uses thread id to select work and address shared data
CUDA Thread Block
Thread Id #:0 1 2 3 … m
Thread program
Courtesy: John Nickolls, NVIDIA
GeForce-8 Series HW Overview
TPC TPC TPC TPC TPC TPC
TEX
SM
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1 Data L1
Texture Processor Cluster Streaming Multiprocessor
SM
Shared Memory
Streaming Processor Array
…
SPA Streaming Processor Array (variable across GeForce 8-
series, 8 in GeForce8800)
TPC Texture Processor Cluster (2 SM + TEX)
SM Streaming Multiprocessor (8 SP) Multi-threaded processor core Fundamental processing unit for CUDA thread block
SP Streaming Processor Scalar ALU for a single CUDA thread
CUDA Processor Terminology
Streaming Multiprocessor (SM)
Streaming Multiprocessor (SM)
8 Streaming Processors (SP) 2 Super Function Units (SFU)
Multi-threaded instruction dispatch 1 to 512 threads active Shared instruction fetch per 32
threads Cover latency of texture/memory
loads
20+ GFLOPS 16 KB shared memory texture and global memory
access
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1 Data L1
Streaming Multiprocessor
Shared Memory
G80 Thread Computing Pipeline Processors execute computing threads Alternative operating mode specifically for
computing
Load/store
Global Memory
Thread Execution Manager
Input Assembler
Host
Texture Texture Texture Texture Texture Texture Texture TextureTexture
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Parallel DataCache
Load/store Load/store Load/store Load/store Load/store
The future of GPUs is programmable processing
So – build the architecture around the processor
L2
FB
SP SP
L1
TF
Th
rea
d P
roc
es
so
r
Vtx Thread Issue
Setup / Rstr / ZCull
Geom Thread Issue Pixel Thread Issue
Input Assembler
Host
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
Generates Thread grids
based on kernel calls
Thread Life Cycle in HW Grid is launched on the SPA Thread Blocks are serially
distributed to all the SM’s Potentially >1 Thread Block
per SM Each SM launches Warps of
Threads 2 levels of parallelism
SM schedules and executes Warps that are ready to run
As Warps and Thread Blocks complete, resources are freed SPA can distribute more
Thread Blocks
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
SM Executes Blocks
Threads are assigned to SMs in Block granularity Up to 8 Blocks to each SM as
resource allows SM in G80 can take up to 768
threads Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc.
Threads run concurrently SM assigns/maintains thread id #s SM manages/schedules thread
execution
t0 t1 t2 … tm
Blocks
Texture L1
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
TF
L2
Memory
t0 t1 t2 … tm
Blocks
SM 1SM 0
Thread Scheduling/Execution
Each Thread Blocks is divided
in 32-thread Warps This is an implementation
decision, not part of the CUDA programming model
Warps are scheduling units in SM
If 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM? Each Block is divided into 256/32
= 8 Warps There are 8 * 3 = 24 Warps At any point in time, only one of
the 24 Warps will be selected for instruction fetch and execution.
…t0 t1 t2 … t31…
…t0 t1 t2 … t31…Block 1 Warps Block 2 Warps
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1 Data L1
Streaming Multiprocessor
Shared Memory
SM Warp Scheduling SM hardware implements zero-
overhead Warp scheduling Warps whose next instruction has its
operands ready for consumption are eligible for execution
Eligible Warps are selected for execution on a prioritized scheduling policy
All threads in a Warp execute the same instruction when selected
4 clock cycles needed to dispatch the same instruction for all threads in a Warp in G80 If one global memory access is needed
for every 4 instructions A minimal of 13 Warps are needed to
fully tolerate 200-cycle memory latency
warp 8 instruction 11
SM multithreadedWarp scheduler
warp 1 instruction 42
warp 3 instruction 95
warp 8 instruction 12
...
time
warp 3 instruction 96
SM Instruction Buffer – Warp Scheduling Fetch one warp instruction/cycle
from instruction L1 cache into any instruction buffer slot
Issue one “ready-to-go” warp instruction/cycle from any warp - instruction buffer slot operand scoreboarding used to prevent hazards
Issue selection based on round-robin/age of warp
SM broadcasts the same instruction to 32 Threads of a Warp
I$L1
MultithreadedInstruction Buffer
RF
C$L1
SharedMem
Operand Select
MAD SFU
Scoreboarding
All register operands of all instructions in the Instruction Buffer are scoreboarded Instruction becomes ready after the needed values are
deposited prevents hazards cleared instructions are eligible for issue
Decoupled Memory/Processor pipelines any thread can continue to issue instructions until
scoreboarding prevents issue allows Memory/Processor ops to proceed in shadow of
other waiting Memory/Processor ops
TB1W1
TB = Thread Block, W = Warp
TB2W1
TB3W1
TB2W1
TB1W1
TB3W2
TB1W2
TB1W3
TB3W2
Time
TB1, W1 stallTB3, W2 stallTB2, W1 stall
Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4
Granularity Considerations For Matrix Multiplication, should I use 4X4, 8X8, 16X16 or 32X32 tiles?
For 4X4, we have 16 threads per block, Since each SM can take up to 768 threads, the thread capacity allows 48 blocks. However, each SM can only take up to 8 blocks, thus there will be only 128 threads in each SM!
There are 8 warps but each warp is only half full.
For 8X8, we have 64 threads per Block. Since each SM can take up to 768 threads, it could take up to 12 Blocks. However, each SM can only take up to 8 Blocks, only 512 threads will go into each SM!
There are 16 warps available for scheduling in each SM Each warp spans four slices in the y dimension
For 16X16, we have 256 threads per Block. Since each SM can take up to 768 threads, it can take up to 3 Blocks and achieve full capacity unless other resource considerations overrule.
There are 24 warps available for scheduling in each SM Each warp spans two slices in the y dimension
For 32X32, we have 1024 threads per Block. Not even one can fit into an SM!
Memory Hardware in G80
CUDA Device Memory Space: Review Each thread can:
R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
The host can R/W global, constant, and texture memories
Parallel Memory Sharing Local Memory: per-thread
Private per thread Auto variables, register spill
Shared Memory: per-Block Shared by threads of the same block Inter-thread communication
Global Memory: per-application Shared by all threads Inter-Grid communication
Thread
Local Memory
Grid 0
. . .
GlobalMemory
. . .
Grid 1SequentialGridsin Time
Block
SharedMemory
SM Memory Architecture
Threads in a block share data & results In Memory and Shared Memory Synchronize at barrier instruction
Per-Block Shared Memory Allocation Keeps data close to processor Minimize trips to global Memory Shared Memory is dynamically
allocated to blocks, one of the limiting resources
t0 t1 t2 … tm
Blocks
Texture L1
SP
SharedMemory
MT IU
SP
SharedMemory
MT IU
TF
L2
Memory
t0 t1 t2 … tm
Blocks
SM 1SM 0
Courtesy: John Nicols, NVIDIA
SM Register File
Register File (RF) 32 KB (8K entries) for each SM in
G80
TEX pipe can also read/write RF 2 SMs share 1 TEX
Load/Store pipe can also read/write RF
I$L1
MultithreadedInstruction Buffer
RF
C$L1
SharedMem
Operand Select
MAD SFU
Programmer View of Register File There are 8192 registers in
each SM in G80 This is an implementation
decision, not part of CUDA Registers are dynamically
partitioned across all blocks assigned to the SM
Once assigned to a block, the register is NOT accessible by threads in other blocks
Each thread in the same block only access registers assigned to itself
4 blocks 3 blocks
Matrix Multiplication Example If each Block has 16X16 threads and each thread
uses 10 registers, how many thread can run on each SM?Each block requires 10*256 = 2560 registers8192 = 3 * 2560 + changeSo, three blocks can run on an SM as far as registers
are concerned How about if each thread increases the use of
registers by 1?Each Block now requires 11*256 = 2816 registers8192 < 2816 *3Only two Blocks can run on an SM, 1/3 reduction of
parallelism!!!
More on Dynamic Partitioning
Dynamic partitioning gives more flexibility to compilers/programmersOne can run a smaller number of threads that
require many registers each or a large number of threads that require few registers each
This allows for finer grain threading than traditional CPU threading models.
The compiler can tradeoff between instruction-level parallelism and thread level parallelism
Let’s program this thing!
GPU Computing History
2001/2002 – researchers see GPU as data-parallel coprocessorThe GPGPU field is born
2007 – NVIDIA releases CUDACUDA – Compute Uniform Device ArchitectureGPGPU shifts to GPU Computing
2008 – Khronos releases OpenCL specification
CUDA Abstractions
A hierarchy of thread groups Shared memories Barrier synchronization
CUDA Terminology
Host – typically the CPUCode written in ANSI C
Device – typically the GPU (data-parallel)Code written in extended ANSI C
Host and device have separate memories CUDA Program
Contains both host and device code
CUDA Terminology
Kernel – data-parallel function Invoking a kernel creates lightweight threads
on the device Threads are generated and scheduled with
hardware
Does a kernel remind you of a shader in OpenGL?
CUDA Kernels
Executed N times in parallel by N different CUDA threads
Thread ID
ExecutionConfiguration
DeclarationSpecifier
CUDA Program Execution
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Hierarchies
Grid – one or more thread blocks1D or 2D
Block – array of threads1D, 2D, or 3DEach block in a grid has the same number of
threadsEach thread in a block can
Synchronize Access shared memory
Thread Hierarchies
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Hierarchies
Block – 1D, 2D, or 3DExample: Index into vector, matrix, volume
Thread Hierarchies
Thread ID: Scalar thread identifier Thread Index: threadIdx
1D: Thread ID == Thread Index 2D with size (Dx, Dy)
Thread ID of index (x, y) == x + y Dy
3D with size (Dx, Dy, Dz)Thread ID of index (x, y, z) == x + y Dy + z Dx Dy
Thread Hierarchies
1 Thread Block 2D Block
2D Index
Thread Hierarchies
Thread BlockGroup of threads
G80 and GT200: Up to 512 threads Fermi: Up to 1024 threads
Reside on same processor coreShare memory of that core
Thread Hierarchies
Thread BlockGroup of threads
G80 and GT200: Up to 512 threads Fermi: Up to 1024 threads
Reside on same processor coreShare memory of that core
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Thread Hierarchies
Block Index: blockIdx Dimension: blockDim
1D or 2D
Thread Hierarchies
2D Thread Block
16x16Threads per block
Thread Hierarchies
Example: N = 3216x16 threads per block (independent of N)
threadIdx ([0, 15], [0, 15])
2x2 thread blocks in grid blockIdx ([0, 1], [0, 1]) blockDim = 16
i = [0, 1] * 16 + [0, 15]
Thread Hierarchies
Thread blocks execute independently In any order: parallel or seriesScheduled in any order by any number of
cores Allows code to scale with core count
Thread Hierarchies
Image from: http://developer.download.nvidia.com/compute/cuda/3_2_prod/toolkit/docs/CUDA_C_Programming_Guide.pdf
Thread Hierarchies
Threads in a blockShare (limited) low-latency memorySynchronize execution
To coordinate memory accesses __syncThreads()
Barrier – threads in block wait until all threads reach this Lightweight
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Host can transfer to/from deviceGlobal memoryConstant memory
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
cudaMalloc()Allocate global memory on device
cudaFree()Frees memory
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Pointer to device memory
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Size in bytes
CUDA Memory Transfers
cudaMemcpy()Memory transfer
Host to host Host to device Device to host Device to device
Host Device
Global Memory
Does this remind you of VBOs in OpenGL?
CUDA Memory Transfers
cudaMemcpy()Memory transfer
Host to host Host to device Device to host Device to device
Host Device
Global Memory
CUDA Memory Transfers
cudaMemcpy()Memory transfer
Host to host Host to device Device to host Device to device
Host Device
Global Memory
CUDA Memory Transfers
cudaMemcpy()Memory transfer
Host to host Host to device Device to host Device to device
Host Device
Global Memory
CUDA Memory Transfers
cudaMemcpy()Memory transfer
Host to host Host to device Device to host Device to device
Host Device
Global Memory
All transfers are asynchronous
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Host Device
Global Memory
Host to device
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Host Device
Global Memory
Source (host)Destination (device)
CUDA Memory Transfers
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Host Device
Global Memory
Matrix Multiply
Image from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
P = M * N Assume M and N are
square for simplicity Is this data-parallel?
Matrix Multiply
1,000 x 1,000 matrix 1,000,000 dot products
Each 1,000 multiples and 1,000 adds
Matrix Multiply: CPU Implementation
Code from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture3%20cuda%20threads%20spring%202010.ppt
void MatrixMulOnHost(float* M, float* N, float* P, int width){ for (int i = 0; i < width; ++i) for (int j = 0; j < width; ++j) { float sum = 0; for (int k = 0; k < width; ++k) { float a = M[i * width + k]; float b = N[k * width + j]; sum += a * b; } P[i * width + j] = sum; }}
Matrix Multiply: CUDA Skeleton
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: CUDA Skeleton
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: CUDA Skeleton
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply
Step 1Add CUDA memory transfers to the skeleton
Matrix Multiply: Data Transfer
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Allocate input
Matrix Multiply: Data Transfer
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Allocate output
Matrix Multiply: Data Transfer
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Matrix Multiply: Data Transfer
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Read back from device
Matrix Multiply: Data Transfer
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Does this remind you of GPGPU with GLSL?
Matrix Multiply
Step 2 Implement the kernel in CUDA C
Matrix Multiply: CUDA Kernel
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Accessing a matrix, so using a 2D block
Matrix Multiply: CUDA Kernel
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Each kernel computes one output
Matrix Multiply: CUDA Kernel
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
Where did the two outer for loops in the CPU implementation go?
Matrix Multiply: CUDA Kernel
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
No locks or synchronization, why?
Matrix Multiply
Step 3 Invoke the kernel in CUDA C
Matrix Multiply: Invoke Kernel
Code from: http://courses.engr.illinois.edu/ece498/al/textbook/Chapter2-CudaProgrammingModel.pdf
One block with width by width threads
Matrix Multiply
125
One Block of threads compute matrix Pd Each thread computes one element
of Pd Each thread
Loads a row of matrix Md Loads a column of matrix Nd Perform one multiply and addition
for each pair of Md and Nd elements
Compute to off-chip memory access ratio close to 1:1 (not very high)
Size of matrix limited by the number of threads allowed in a thread block
Grid 1
Block 1
3 2 5 4
2
4
2
6
48
Thread(2, 2)
WIDTH
Md Pd
Nd
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Slide from: http://courses.engr.illinois.edu/ece498/al/lectures/lecture2%20cuda%20spring%2009.ppt
Matrix Multiply
What is the major performance problem with our implementation?
What is the major limitation?