[Harvard CS264] 05 - Advanced-level CUDA Programming
description
Transcript of [Harvard CS264] 05 - Advanced-level CUDA Programming
Lecture #5: Advanced CUDA | February 22th, 2011
Nicolas Pinto (MIT, Harvard) [email protected]
Massively Parallel ComputingCS 264 / CSCI E-292
Administrivia
• HW2: out, due Mon 3/14/11 (not Fri 3/11/11)
• Projects: think about it, consult the staff (*), proposals due ~ Fri 3/25/11
• Guest lectures:
• schedule coming soon
• on Fridays 7.35-9.35pm (March, April) ?
During this course,
we’ll try to
and use existing material ;-)
“ ”
adapted for CS264
Todayyey!!
Outline
1.Hardware Review
2.Memory/Communication Optimizations
3.Threading/Execution Optimizations
1. Hardware Review
8© NVIDIA Corporation 2008
10-Series Architecture
240 thread processors execute kernel threads
30 multiprocessors, each contains
8 thread processors
One double-precision unit
Shared memory enables thread cooperation
Thread
Processors
Multiprocessor
Shared
Memory
Double
© 2008 NVIDIA Corporation.
Execution Model
Software Hardware
Threads are executed by thread processors
Thread
Thread Processor
Thread Block Multiprocessor
Thread blocks are executed on multiprocessors
Thread blocks do not migrate
Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources (shared memory and register file)
...
Grid Device
A kernel is launched as a grid of thread blocks
Only one kernel can execute on a device at one time
Threading Hierarchy
10© NVIDIA Corporation 2008
Warps and Half Warps
Thread
Block Multiprocessor
32 Threads
32 Threads
32 Threads
...
Warps
16
Half Warps
16
DRAM
Global
Local
A thread block consists of 32-
thread warps
A warp is executed physically in
parallel (SIMD) on a
multiprocessor
Device
Memory
=
A half-warp of 16 threads can
coordinate global memory
accesses into a single transaction
11© NVIDIA Corporation 2008
Memory Architecture
Host
CPU
Chipset
DRAM
Device
DRAM
Global
Constant
Texture
Local
GPU
Multiprocessor
Registers
Shared Memory
Multiprocessor
Registers
Shared Memory
Multiprocessor
Registers
Shared Memory
Constant and Texture
Caches
© 2008 NVIDIA Corporation.
Kernel Memory Access
Per-thread
Per-block
Per-device
ThreadRegisters
Local Memory
SharedMemory
Block
...Kernel 0
...Kernel 1
GlobalMemory
Time
On-chip
Off-chip, uncached
• On-chip, small
• Fast
• Off-chip, large
• Uncached
• Persistent across kernel launches
• Kernel I/O
Kernel Memory Access
© 2008 NVIDIA Corporation.
Kernel Memory Access
Per-thread
Per-block
Per-device
ThreadRegisters
Local Memory
SharedMemory
Block
...Kernel 0
...Kernel 1
GlobalMemory
Time
On-chip
Off-chip, uncached
• On-chip, small
• Fast
• Off-chip, large
• Uncached
• Persistent across kernel launches
• Kernel I/O
Global Memory
Per-device
...Kernel 0
...Kernel 1
GlobalMemory
Time
• Off-chip, large
• Uncached
• Persistent across kernel launches
• Kernel I/O
• Different types of “global memory”
• Linear Memory
• Texture Memory
• Constant Memory
12© NVIDIA Corporation 2008
Memory Architecture
Memory Location Cached Access Scope Lifetime
Register On-chip N/A R/W One thread Thread
Local Off-chip No R/W One thread Thread
Shared On-chip N/A R/W All threads in a block Block
Global Off-chip No R/W All threads + host Application
Constant Off-chip Yes R All threads + host Application
Texture Off-chip Yes R All threads + host Application
2. Memory/CommunicationOptimizations
2.1 Host/Device TransferOptimizations
Review
!"#$$%&$'()*+,-.(/$$01(234-5(63*7,-5(8-,9:+5,;<(
=3*<+,.4;(>(?@;;4:A(B3C,;43(/$$0
/DE/D/$$0
F
!"#$%&'()*+,-
! !"#$%&'(")*+,-".'/"$'0&"12"#+"345"#$%&'(6
! !**"#$%&'(6"78"'"#$%&'(")*+,-"'%&"%18"+8"#$&"
6'.&".1*#792%+,&66+%
! :$16",'8",+..187,'#&"07'"6$'%&(".&.+%/
! !8("6/8,$%+87;&"
! :$%&'(6"+<"'")*+,-"'%&".1*#72*&=&("+8#+"'"
.1*#792%+,&66+%"'6"!"#$%
>?@
A+%#$)%7(B&
CD!E
F+1#$)%7(B&
F!:! G#$&%8&#
H%'2$7,6">'%("I"
>@C!
J%+8#"F7(&"K16
E&.+%/"K16 ?>L"K16
?>L9G=2
%&66"K16
!
! ./012 +%"./0$! D&2*',&("!H?
! ?5?M"J1**"C12*&="F&%7'*M"F/..&#%7,"K16! 53NEKI6")'8(O7(#$"78"&',$"(7%&,#7+8
! "#$$#%&'()#$*+%,-(+%.#($/&.+0&,(1&2%,3(,+8<7B1%'#7+86P""GPBQ! ?>L9G"4R="S"4R"*'8&6
! 4R"#7.&6"#$&")'8(O7(#$"TUHKI6V
! :$&">@C!"62&,7<7,'#7+8"$'6")&&8"12('#&(! W&%67+8"4PN"4 L87#7'*"%&*&'6&M"N4INX
! W&%67+8"4P4"4@2('#&"O7#$"8&O&%"$'%(O'%&M"NUINX
! K',-O'%(6",+.2'#7)*&
! G=2&,#&("12('#&6"78"8&'%"<1#1%&Q! W&%67+8"4P5"I"5PN
! RY9)7#"<*+'#78B"2+78#"6122+%#"T7P&P"(+1)*&V
! W&%67+8"4P4"'((&("6+.&"7.2+%#'8#"16&<1*"
<&'#1%&6Q
3*456%#$
! !6/8,$%+8+16".&.+%/",+27&6
! !6/8,$%+8+16"H?@"2%+B%'."*'18,$
7%#&6%#$
! !#+.7,".&.+%/"786#%1,#7+86
3+ Gb/s
8 GB/s
25+ GB/s
160+ GB/sto
VRAM
PC Architecture
modified from Matthew Bolitho
Review
The PCI-“not-so”-e Bus
• PCIe bus is slow
• Try to minimize/group transfers
• Use pinned memory on host whenever possible
• Try to perform copies asynchronously (e.g. Streams)
• Use “Zero-Copy” when appropriate
• Examples in the SDK (e.g. bandwidthTest)
Review
2.2 Device MemoryOptimizations
Definitions
• gmem: global memory
• smem: shared memory
• tmem: texture memory
• cmem: constant memory
• bmem: binary code (cubin) memory ?!?(covered next week)
Performance Analysise.g. Matrix Transpose
22© NVIDIA Corporation 2008
Matrix Transpose
Transpose 2048x2048 matrix of floats
Performed out-of-place
Separate input and output matrices
Use tile of 32x32 elements, block of 32x8 threads
Each thread processes 4 matrix elements
In general tile and block size are fair game foroptimization
Process
Get the right answer
Measure effective bandwidth (relative to theoretical orreference case)
Address global memory coalescing, shared memory bankconflicts, and partition camping while repeating abovesteps
23© NVIDIA Corporation 2008
Theoretical Bandwidth
Device Bandwidth of GTX 280
1107 * 10^6 * (512 / 8) * 2 / 1024^3 = 131.9 GB/s
Specs report 141 GB/s
Use 10^9 B/GB conversion rather than 1024^3
Whichever you use, be consistent
Memory
clock (Hz)
Memory
interface
(bytes)
DDR
24© NVIDIA Corporation 2008
Effective Bandwidth
Transpose Effective Bandwidth
2048^2 * 4 B/element / 1024^3 * 2 / (time in secs) = GB/s
Reference Case - Matrix Copy
Transpose operates on tiles - need better comparisonthan raw device bandwidth
Look at effective bandwidth of copy that uses tiles
Matrix size
(bytes)
Read and
write
25© NVIDIA Corporation 2008
Matrix Copy Kernel
__global__ void copy(float *odata, float *idata, int width, int height){ int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index = xIndex + width*yIndex;
for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS) { odata[index+i*width] = idata[index+i*width]; }}
TILE_DIM = 32BLOCK_ROWS = 8
32x32 tile32x8 thread block
idata and odata
in global memory
idata odata
Elements copied by a half-warp of threads
26© NVIDIA Corporation 2008
Matrix Copy Kernel Timing
Measure elapsed time over loop
Looping/timing done in two ways:
Over kernel launches (nreps = 1)
Includes launch/indexing overhead
Within the kernel over loads/stores (nreps > 1)
Amortizes launch/indexing overhead
__global__ void copy(float *odata, float* idata, int width, int height, int nreps){ int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index = xIndex + width*yIndex;
for (int r = 0; r < nreps; r++) { for (int i = 0; i < TILE_DIM; i += BLOCK_ROWS) { odata[index+i*width] = idata[index+i*width]; } }}
27© NVIDIA Corporation 2008
Naïve Transpose
Similar to copy
Input and output matrices have different indices
__global__ void transposeNaive(float *odata, float* idata, int width, int height, int nreps){ int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y;
int index_in = xIndex + width * yIndex; int index_out = yIndex + height * xIndex;
for (int r=0; r < nreps; r++) { for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { odata[index_out+i] = idata[index_in+i*width]; } }}
idata odata
28© NVIDIA Corporation 2008
Effective Bandwidth
Effective Bandwidth (GB/s)
2048x2048, GTX 280
Loop over
kernel
Loop in kernel
Simple Copy 96.9 81.6
Naïve
Transpose
2.2 2.2
gmem coalescing
Memory Coalescing
GPU memory controller granularity is 64 or 128 bytes
Must also be 64 or 128 byte aligned
Suppose thread loads a float (4 bytes)
Controller loads 64 bytes, throws 60 bytes away
Memory Coalescing
Memory controller actually more intelligent
Consider half-warp (16 threads)
Suppose each thread reads consecutive float
Memory controller will perform one 64 byte load
This is known as coalescing
Make threads read consecutive locations
30© NVIDIA Corporation 2008
Coalescing
Global Memory
Half-warp of threads
} 64B aligned segment (16 floats)
Global memory access of 32, 64, or 128-bit words by a half-warp of threads can result in as few as one (or two)transaction(s) if certain access requirements are met
Depends on compute capability
1.0 and 1.1 have stricter access requirements
Examples – float (32-bit) data
}128B aligned segment (32 floats)
31© NVIDIA Corporation 2008
CoalescingCompute capability 1.0 and 1.1
K-th thread must access k-th word in the segment (or k-th word in 2contiguous 128B segments for 128-bit words), not all threads need toparticipate
Coalesces – 1 transaction
Out of sequence – 16 transactions Misaligned – 16 transactions
Memory Coalescing
GT200 has hardware coalescer
Inspects memory requests from each half-warp
Determines minimum set of transactions which are
64 or 128 bytes long
64 or 128 byte aligned
32© NVIDIA Corporation 2008
CoalescingCompute capability 1.2 and higher
1 transaction - 64B segment
2 transactions - 64B and 32B segments 1 transaction - 128B segment
Coalescing is achieved for any pattern of addresses that fits into asegment of size: 32B for 8-bit words, 64B for 16-bit words, 128B for32- and 64-bit words
Smaller transactions may be issued to avoid wasted bandwidth dueto unused words
(e.g. GT200 like the C1060)
© NVIDIA Corporation 2010
CoalescingCompute capability 2.0 (Fermi, Tesla C2050)
32 transactions - 32 x 32B segments, instead of 32 x 128B segments.
2 transactions - 2 x 128B segment - but next warp probably only 1 extra transaction, due to L1 cache.
Memory transactions handled per warp (32 threads)L1 cache ON: Issues always 128B segment transactionscaches them in 16kB or 48kB L1 cache per multiprocessor
L1 cache OFF: Issues always 32B segment transactionsE.g. advantage for widely scattered thread accesses
Coalescing Summary
Coalescing dramatically speeds global memory access
Strive for perfect coalescing:
Align starting address (may require padding)
A warp should access within a contiguous region
33© NVIDIA Corporation 2008
Coalescing in Transpose
Naïve transpose coalesces reads, but not writes
idata odata
Elements transposed by a half-warp of threads
Q: How to coalesce writes ?
smem as a cache
Shared Memory
SMs can access gmem at 80+ GiB/sec
but have hundreds of cycles of latency
Each SM has 16 kiB ‘shared’ memory
Essentially user-managed cache
Speed comparable to registers
Accessible to all threads in a block
Reduces load/stores to device memory
34© NVIDIA Corporation 2008
Shared Memory
~Hundred times faster than global memory
Cache data to reduce global memory accesses
Threads can cooperate via shared memory
Use it to avoid non-coalesced accessStage loads and stores in shared memory to re-order non-coalesceable addressing
33© NVIDIA Corporation 2008
Coalescing in Transpose
Naïve transpose coalesces reads, but not writes
idata odata
Elements transposed by a half-warp of threads
Q: How to coalesce writes ?
34© NVIDIA Corporation 2008
Shared Memory
~Hundred times faster than global memory
Cache data to reduce global memory accesses
Threads can cooperate via shared memory
Use it to avoid non-coalesced accessStage loads and stores in shared memory to re-order non-coalesceable addressing
© 2008 NVIDIA Corporation
A Common Programming Strategy
! Partition data into subsets that fit into shared memory
© 2008 NVIDIA Corporation
A Common Programming Strategy
! Handle each data subset with one thread block
© 2008 NVIDIA Corporation
A Common Programming Strategy
! Load the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism
© 2008 NVIDIA Corporation
A Common Programming Strategy
! Perform the computation on the subset from shared memory
© 2008 NVIDIA Corporation
A Common Programming Strategy
! Copy the result from shared memory back to global memory
35© NVIDIA Corporation 2008
Coalescing through shared memory
Access columns of a tile in shared memory to writecontiguous data to global memory
Requires __syncthreads() since threads write dataread by other threads
Elements transposed by a half-warp of threads
idata odata
tile
36
__global__ void transposeCoalesced(float *odata, float *idata, int width, int height, int nreps){ __shared__ float tile[TILE_DIM][TILE_DIM];
int xIndex = blockIdx.x * TILE_DIM + threadIdx.x; int yIndex = blockIdx.y * TILE_DIM + threadIdx.y; int index_in = xIndex + (yIndex)*width;
xIndex = blockIdx.y * TILE_DIM + threadIdx.x; yIndex = blockIdx.x * TILE_DIM + threadIdx.y; int index_out = xIndex + (yIndex)*height;
for (int r=0; r < nreps; r++) { for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width]; }
__syncthreads();
for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i]; } }}
© NVIDIA Corporation 2008
Coalescing through shared memory
37© NVIDIA Corporation 2008
Effective Bandwidth
Effective Bandwidth (GB/s)
2048x2048, GTX 280
Loop over kernel Loop in kernel
Simple Copy 96.9 81.6
Shared Memory Copy 80.9 81.1
Naïve Transpose 2.2 2.2
Coalesced Transpose 16.5 17.1
Uses shared
memory tile
and
__syncthreads()
smem bank conflicts
39© NVIDIA Corporation 2008
Shared Memory Architecture
Many threads accessing memoryTherefore, memory is divided into banks
Successive 32-bit words assigned to successive banks
Each bank can service one address per cycleA memory can service as many simultaneousaccesses as it has banks
Multiple simultaneous accesses to a bankresult in a bank conflict
Conflicting accesses are serialized
Bank 15
Bank 7
Bank 6Bank 5
Bank 4
Bank 3Bank 2
Bank 1Bank 0
Shared Memory Banks
Shared memory divided into 16 ‘banks’
Shared memory is (almost) as fast as registers (...)
Exception is in case of bank conflicts
0 16
1 17
2 18
3 19
4 20
5 21
6 22
7 23
8 24
9 25
10 26
11 27
12 28
13 29
14 30
15 31
4 bytes
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Bank 8
Bank 9
Bank 10
Bank 11
Bank 12
Bank 13
Bank 14
Bank 15
40© NVIDIA Corporation 2008
Bank Addressing Examples
No Bank Conflicts
Linear addressingstride == 1
No Bank Conflicts
Random 1:1 Permutation
Bank 15
Bank 7
Bank 6Bank 5
Bank 4
Bank 3Bank 2
Bank 1Bank 0
Thread 15
Thread 7
Thread 6Thread 5
Thread 4
Thread 3Thread 2
Thread 1Thread 0
Bank 15
Bank 7
Bank 6Bank 5
Bank 4
Bank 3Bank 2
Bank 1Bank 0
Thread 15
Thread 7
Thread 6Thread 5
Thread 4
Thread 3Thread 2
Thread 1Thread 0
41© NVIDIA Corporation 2008
Bank Addressing Examples
2-way Bank Conflicts
Linear addressingstride == 2
8-way Bank Conflicts
Linear addressingstride == 8
Thread 11
Thread 10
Thread 9Thread 8
Thread 4
Thread 3Thread 2
Thread 1Thread 0
Bank 15
Bank 7
Bank 6Bank 5
Bank 4
Bank 3Bank 2
Bank 1Bank 0
Thread 15
Thread 7
Thread 6Thread 5
Thread 4
Thread 3Thread 2
Thread 1Thread 0
Bank 9Bank 8
Bank 15
Bank 7
Bank 2
Bank 1Bank 0
x8
x8
42© NVIDIA Corporation 2008
Shared memory bank conflicts
Shared memory is ~ as fast as registers if there are no bankconflicts
warp_serialize profiler signal reflects conflicts
The fast case:
If all threads of a half-warp access different banks, there is nobank conflict
If all threads of a half-warp read the identical address, there is nobank conflict (broadcast)
The slow case:
Bank Conflict: multiple threads in the same half-warp access thesame bank
Must serialize the accesses
Cost = max # of simultaneous accesses to a single bank
43© NVIDIA Corporation 2008
Bank Conflicts in Transpose
32x32 shared memory tile of floats
Data in columns k and k+16 are in same bank
16-way bank conflict reading half columns in tile
Solution - pad shared memory array__shared__ float tile[TILE_DIM][TILE_DIM+1];
Data in anti-diagonals are in same bank
idata odata
tile
Q: How to avoid bank conflicts ?
43© NVIDIA Corporation 2008
Bank Conflicts in Transpose
32x32 shared memory tile of floats
Data in columns k and k+16 are in same bank
16-way bank conflict reading half columns in tile
Solution - pad shared memory array__shared__ float tile[TILE_DIM][TILE_DIM+1];
Data in anti-diagonals are in same bank
idata odata
tile
Illustration!"#$%&'(%)*$+,'-.*/&/01'2#03'4*056/789• :;<:; !(=('#$$#+• >#$?'#77%99%9'#'7*6@)0,
– !"#$%&'(%)*'+,)-./+01'20345%61'/)'%'$%47'%++511'035'1%85'(%)*9
warps:
© NVIDIA 2010
31
210
31210
31210
warps:0 1 2 31
Bank 0Bank 1…
Bank 3120 1
31
!"#$%&'(%)*$+,'-.*/&/01'2#03'4*056/789• -&&'#'7*6:)0'5*$';#&&/01,
– !"#!! $%&%'())(*• <#$;'#77%99%9'#'7*6:)0,
– !" +,--.)./0'1(/234'/5'1(/2'65/-7,603warps:
© NVIDIA 2010
31210
31210
31210
warps:0 1 2 31 padding
Bank 0Bank 1…
Bank 313120 1
Illustration
44© NVIDIA Corporation 2008
Effective Bandwidth
Effective Bandwidth (GB/s)
2048x2048, GTX 280
Loop over
kernel
Loop in kernel
Simple Copy 96.9 81.6
Shared Memory Copy 80.9 81.1
Naïve Transpose 2.2 2.2
Coalesced Transpose 16.5 17.1
Bank Conflict Free Transpose 16.6 17.2
Need a pause?
Unrelated: Tchatcher Illusion
Unrelated: Tchatcher Illusion
gmem partition camping
46© NVIDIA Corporation 2008
Partition Camping
Global memory accesses go through partitions
6 partitions on 8-series GPUs, 8 partitions on 10-seriesGPUs
Successive 256-byte regions of global memory areassigned to successive partitions
For best performance:
Simultaneous global memory accesses GPU-wide shouldbe distributed evenly amongst partitions
Partition Camping occurs when global memoryaccesses at an instant use a subset of partitions
Directly analogous to shared memory bank conflicts, buton a larger scale
47© NVIDIA Corporation 2008
0 1 2 3 4 5
64 65 66 67 68 69
128 129 130 ...
0 64 128
1 65 129
2 66 130
3 67 ...
4 68
5 69
odataidata
Partition Camping in Transpose
tiles in matrices
colors = partitions
blockId = gridDim.x * blockIdx.y + blockIdx.x
Partition width = 256 bytes = 64 floats
Twice width of tile
On GTX280 (8 partitions), data 2KB apart map tosame partition
2048 floats divides evenly by 2KB => columns of matricesmap to same partition
48© NVIDIA Corporation 2008
Partition Camping Solutions
blockId = gridDim.x * blockIdx.y + blockIdx.x
Pad matrices (by two tiles)
In general might be expensive/prohibitive memory-wise
Diagonally reorder blocks
Interpret blockIdx.y as different diagonal slices andblockIdx.x as distance along a diagonal
odataidata
0 64 128
1 65 129
2 66 130
3 67 ...
4 68
5
0
64 1
128 65 2
129 66 3
130 67 4
... 68 5
49
__global__ void transposeDiagonal(float *odata, float *idata, int width, int height, int nreps){ __shared__ float tile[TILE_DIM][TILE_DIM+1];
int blockIdx_y = blockIdx.x; int blockIdx_x = (blockIdx.x+blockIdx.y)%gridDim.x;
int xIndex = blockIdx_x * TILE_DIM + threadIdx.x; int yIndex = blockIdx_y * TILE_DIM + threadIdx.y; int index_in = xIndex + (yIndex)*width;
xIndex = blockIdx_y * TILE_DIM + threadIdx.x; yIndex = blockIdx_x * TILE_DIM + threadIdx.y; int index_out = xIndex + (yIndex)*height;
for (int r=0; r < nreps; r++) { for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { tile[threadIdx.y+i][threadIdx.x] = idata[index_in+i*width]; } __syncthreads(); for (int i=0; i<TILE_DIM; i+=BLOCK_ROWS) { odata[index_out+i*height] = tile[threadIdx.x][threadIdx.y+i]; } }}© NVIDIA Corporation 2008
Diagonal Transpose
Add lines to map diagonal
to Cartesian coordinates
Replace
blockIdx.x
with
blockIdx_x,
blockIdx.y
with
blockIdx_y
50
if (width == height) { blockIdx_y = blockIdx.x; blockIdx_x = (blockIdx.x+blockIdx.y)%gridDim.x;} else { int bid = blockIdx.x + gridDim.x*blockIdx.y; blockIdx_y = bid%gridDim.y; blockIdx_x = ((bid/gridDim.y)+blockIdx_y)%gridDim.x;}
© NVIDIA Corporation 2008
Diagonal Transpose
Previous slide for square matrices (width == height)
More generally:
51© NVIDIA Corporation 2008
Effective Bandwidth
Effective Bandwidth (GB/s)
2048x2048, GTX 280
Loop over kernel Loop in kernel
Simple Copy 96.9 81.6
Shared Memory Copy 80.9 81.1
Naïve Transpose 2.2 2.2
Coalesced Transpose 16.5 17.1
Bank Conflict Free Transpose 16.6 17.2
Diagonal 69.5 78.3
52© NVIDIA Corporation 2008
Order of Optimizations
Larger optimization issues can mask smaller ones
Proper order of some optimization techniques innot known a priori
Eg. partition camping is problem-size dependent
Don’t dismiss an optimization technique asineffective until you know it was applied at the righttime
Naïve
2.2 GB/s
Coalescing16.5 GB/s
Bank
ConflictsPartition
Camping16.6 GB/s
48.8 GB/s
69.5 GB/s
Partition
Camping
Bank
Conflicts
53© NVIDIA Corporation 2008
Transpose Summary
Coalescing and shared memory bank conflicts aresmall-scale phenomena
Deal with memory access within half-warp
Problem-size independent
Partition camping is a large-scale phenomenon
Deals with simultaneous memory accesses by warps ondifferent multiprocessors
Problem size dependent
Wouldn’t see in (2048+32)^2 matrix
Coalescing is generally the most critical
SDK Transpose Example:http://developer.download.nvidia.com/compute/cuda/sdk/website/samples.html
tmem
55© NVIDIA Corporation 2008
Textures in CUDA
Texture is an object for reading data
Benefits:Data is cached (optimized for 2D locality)
Helpful when coalescing is a problem
FilteringLinear / bilinear / trilinear
Dedicated hardware
Wrap modes (for “out-of-bounds” addresses)Clamp to edge / repeat
Addressable in 1D, 2D, or 3DUsing integer or normalized coordinates
Usage:CPU code binds data to a texture object
Kernel reads data by calling a fetch function
Other goodiesOptional “format conversion”
• {char, short, int, half} (16bit) to float (32bit)
• “for free”
• useful for *mem compression (see later)
56© NVIDIA Corporation 2008
Texture Addressing
WrapOut-of-bounds coordinate iswrapped (modulo arithmetic)
ClampOut-of-bounds coordinate isreplaced with the closestboundary
0 1 2 3 4
1
2
3
0(5.5, 1.5)
0 1 2 3 4
1
2
3
0(2.5, 0.5)(1.0, 1.0)
0 1 2 3 4
1
2
3
0(5.5, 1.5)
57© NVIDIA Corporation 2008
Two CUDA Texture Types
Bound to linear memoryGlobal memory address is bound to a textureOnly 1DInteger addressingNo filtering, no addressing modes
Bound to CUDA arraysCUDA array is bound to a texture1D, 2D, or 3DFloat addressing (size-based or normalized)FilteringAddressing modes (clamping, repeat)
Both:Return either element type or normalized float
58© NVIDIA Corporation 2008
CUDA Texturing Steps
Host (CPU) code:Allocate/obtain memory (global linear, or CUDA array)
Create a texture reference object
Currently must be at file-scope
Bind the texture reference to memory/array
When done:
Unbind the texture reference, free resources
Device (kernel) code:Fetch using texture reference
Linear memory textures:
tex1Dfetch()
Array textures:
tex1D() or tex2D() or tex3D()
cmem
!"#$%&#%'()*"+,• -.)&/'0"+'1")00212)#%$'&#.'"%3)+'.&%&'%3&%'2$'+)&.'4#20"+*/,'5,'6&+7$• 8&%&'2$'$%"+).'2#'9/"5&/'*)*"+,:'+)&.'%3+"493'&'1"#$%&#%;1&13)
– !!"#$%&'$&!!()*'+,-,./(,$(0."+'/'&,#$%
– 1'$(#$+2(3.(/.'0(32(456(7./$.+%
– 8,9,&.0(&#(:;<=
• <)+*2'&..$'4#20"+*'&11)$$)$=
© NVIDIA 2010
• <)+*2'&..$'4#20"+*'&11)$$)$=– <./$.+(>#,$&./('/?*9.$&()*'+,-,.0(@,&A(!"#$%
– 1#9>,+./(9*%&(0.&./9,$.(&A'&('++(&A/.'0%(,$('(&A/.'03+#"7 @,++(0./.-./.$".(&A.(%'9.('00/.%%
– B#(+,9,&(#$('//'2(%,C.D("'$(*%.('$2(?+#3'+(9.9#/2(>#,$&./
• !"#$%&#%'1&13)'%3+"49374%='– EF(3,&%(>./(@'/>(>./(F("+#"7%(>./(9*+&,>/#".%%#/
– G#(3.(*%.0(@A.$('++(&A/.'0%(,$('(@'/>(/.'0(&A.(%'9.('00/.%%
• H./,'+,C.%(#&A./@,%.
!"#$%&#%'()*"+,• -.)&/'0"+'1")00212)#%$'&#.'"%3)+'.&%&'%3&%'2$'+)&.'4#20"+*/,'5,'6&+7$• 8&%&'2$'$%"+).'2#'9/"5&/'*)*"+,:'+)&.'%3+"493'&'1"#$%&#%;1&13)
– !!"#$%&'$&!!()*'+,-,./(,$(0."+'/'&,#$%
– 1'$(#$+2(3.(/.'0(32(456(7./$.+%
– 8,9,&.0(&#(:;<=
• <)+*2'&..$'4#20"+*'&11)$$)$=
!!>+#3'+!!(?#,0(7./$.+@("#$%&(-+#'&(A>!' BC
DDD
© NVIDIA 2010
• <)+*2'&..$'4#20"+*'&11)$$)$=– <./$.+(E#,$&./('/>*9.$&()*'+,-,.0(F,&G(!"#$%
– 1#9E,+./(9*%&(0.&./9,$.(&G'&('++(&G/.'0%(,$('(&G/.'03+#"7 F,++(0./.-./.$".(&G.(%'9.('00/.%%
– H#(+,9,&(#$('//'2(%,I.J("'$(*%.('$2(>+#3'+(9.9#/2(E#,$&./
• !"#$%&#%'1&13)'%3+"49374%='– KL(3,&%(E./(F'/E(E./(L("+#"7%(E./(9*+&,E/#".%%#/
– M#(3.(*%.0(FG.$('++(&G/.'0%(,$('(F'/E(/.'0(&G.(%'9.('00/.%%• N./,'+,I.%(#&G./F,%.
DDD-+#'&(O(P(>!'QRSTU(((((((((((((((((((VV(*$,-#/9-+#'&(2(P(>!'Q3+#"7W0ODOXSTU((((VV(*$,-#/9-+#'&(I(P(>!'Q&G/.'0W0ODOTU((((((VV($#$Y*$,-#/9DDD
Z
!"#$%&#%'()*"+,• -)+#).')/)01%)$'23-'%4+)&5$'6783'9&+:$;:)+'<('51+=#>'=%$'.=?)%=*)• @..'%4+)&5$'&00)$$'%4)'$&*)'AB'9"+5• C$=#>'D(E(F
– !"#$%&"'(%)*+#$*,%-./%01%234/%5)%67,%+'"))8#– 9"#$8:;%<5"=,%(5+*:+8"<<>%&5',*%? 2.@/%<8:*A%B*'>%<8C*<>%+5%6*%*B8#+*=%D7<+8(<*%+8D*,
© NVIDIA 2010
– 9"#$8:;%<5"=,%(5+*:+8"<<>%&5',*%? 2.@/%<8:*A%B*'>%<8C*<>%+5%6*%*B8#+*=%D7<+8(<*%+8D*,
...addresses from a warp
96 192128 160 224 28825632 64 352320 384 4484160
!"#$%&#%'()*"+,• -)+#).')/)01%)$'23-'%4+)&5$'6783'9&+:$;:)+'<('51+=#>'=%$'.=?)%=*)• @..'%4+)&5$'&00)$$'%4)'$&*)'AB'9"+5• C$=#>'0"#$%&#%D1#=?"+*'&00)$$E
– !"#$%&'(#)&*+%,-+$&./&01%+$– 233&4%-+#$&-"%&"5&,45$%(5%&,(,-+&67&./&01%+$&4*&08$&%#(**",
© NVIDIA 2010
– 233&4%-+#$&-"%&"5&,45$%(5%&,(,-+&67&./&01%+$&4*&08$&%#(**",• 953":+31&%4&0+&+;",%+<&4;+#&:+#5+3&3"*+%"=+&> 4%-+#&34(<$&<4&54%&?4&%-#48?-&%-"$&,(,-+
...addresses from a warp
96 192128 160 224 28825632 64 352320 384 4484160
*mem compression
!"#$%$&$'()*$#+),-%"./00$-'• 1+/')233)/30/)+20)4//')-"#$%$&/5)2'5)6/.'/3)$0)3$%$#/5)47)#+/)'8%4/.)-9)47#/0)'//5/5:);-'0$5/.);-%"./00$-'
• <"".-2;+/0=– !"#$%&'"()*+,'"%-)#.))"%/01%2301%450-,# ,"#)6)*+%,+ 2%,"+#*7&#,'"%8390-,# *):7,*)+%;%
&'7<=)>– ?@$%&'"()*+,'"%-)#.))"%A<231%A<451%A<39 ,+%'")%,"+#*7&#,'"
© NVIDIA 2010
– ?@$%&'"()*+,'"%-)#.))"%A<231%A<451%A<39 ,+%'")%,"+#*7&#,'"• A<23 82+B)2CD>%,+%+#'*;6)%'"=E1%"'%D;#F%,"+#*7&#,'"+
– G;"6)0-;+)H$• I'.)*%;"H%7<<)*%=,D,#+%;*)%J)*")=%;*67D)#+
• K;#;%,+%;"%,"H)L%A'*%,"#)*<'=;#,'"
• <""3$;2#$-')$')".2;#$;/=– M=;*J%!"#$%& NO'=(,"6%I;##,&)%PMK%+E+#)D+%'A%):7;#,'"+%7+,"6%D,L)H%<*)&,+,'"%
+'=()*+%'"%Q@R+S– F##<$TT;*L,(U'*6T;-+TCV22U42V2
34
Accelerating GPU computation through
mixed-precision methods
Michael Clark
Harvard-Smithsonian Center for Astrophysics
Harvard University
SC’10
caching
... too much ?
bank conflicts
coalescing
partition campingclam
ping
mix
ed p
reci
sion
broadcasting
streamszero-copy
Parallel ProgrammParking
is Hard(but you’ll pick it up)
(you are not alone)
3. Threading/ExecutionOptimizations
3.1 Exec. ConfigurationOptimizations
60© NVIDIA Corporation 2008
Occupancy
Thread instructions are executed sequentially, soexecuting other warps is the only way to hidelatencies and keep the hardware busy
Occupancy = Number of warps running concurrentlyon a multiprocessor divided by maximum number ofwarps that can run concurrently
Limited by resource usage:
Registers
Shared memory
61© NVIDIA Corporation 2008
Grid/Block Size Heuristics
# of blocks > # of multiprocessorsSo all multiprocessors have at least one block to execute
# of blocks / # of multiprocessors > 2Multiple blocks can run concurrently in a multiprocessor
Blocks that aren’t waiting at a __syncthreads() keep thehardware busy
Subject to resource availability – registers, shared memory
# of blocks > 100 to scale to future devicesBlocks executed in pipeline fashion
1000 blocks per grid will scale across multiple generations
62© NVIDIA Corporation 2008
Register Dependency
Read-after-write register dependencyInstruction’s result can be read ~24 cycles later
Scenarios: CUDA: PTX:
To completely hide the latency:Run at least 192 threads (6 warps) per multiprocessor
At least 25% occupancy (1.0/1.1), 18.75% (1.2/1.3)
Threads do not have to belong to the same thread block
add.f32 $f3, $f1, $f2
add.f32 $f5, $f3, $f4
x = y + 5;
z = x + 3;
ld.shared.f32 $f3, [$r31+0]
add.f32 $f3, $f3, $f4
s_data[0] += 3;
63© NVIDIA Corporation 2008
Register Pressure
Hide latency by using more threads per SM
Limiting Factors:Number of registers per kernel
8K/16K per SM, partitioned among concurrent threads
Amount of shared memory
16KB per SM, partitioned among concurrent threadblocks
Compile with –ptxas-options=-v flag
Use –maxrregcount=N flag to NVCCN = desired maximum registers / kernel
At some point “spilling” into local memory may occur
Reduces performance – local memory is slow
64© NVIDIA Corporation 2008
Occupancy Calculator
65© NVIDIA Corporation 2008
Optimizing threads per block
Choose threads per block as a multiple of warp sizeAvoid wasting computation on under-populated warps
Want to run as many warps as possible permultiprocessor (hide latency)
Multiprocessor can run up to 8 blocks at a time
HeuristicsMinimum: 64 threads per block
Only if multiple concurrent blocks
192 or 256 threads a better choice
Usually still enough regs to compile and invoke successfully
This all depends on your computation, so experiment!
66© NVIDIA Corporation 2008
Occupancy != Performance
Increasing occupancy does not necessarily increaseperformance
BUT …
Low-occupancy multiprocessors cannot adequatelyhide latency on memory-bound kernels
(It all comes down to arithmetic intensity and availableparallelism)
!"##"$%&"$'($)*+,"%*#%-
(."$%/,,01*+,2
%
!"#$%&'!(%)(*'
+,'-./).%.&'
'
0.12.34./'556'5787'
8'
GTC’10
66© NVIDIA Corporation 2008
Occupancy != Performance
Increasing occupancy does not necessarily increaseperformance
BUT …
Low-occupancy multiprocessors cannot adequatelyhide latency on memory-bound kernels
(It all comes down to arithmetic intensity and availableparallelism)
3.2 InstructionOptimizations
69© NVIDIA Corporation 2008
CUDA Instruction Performance
Instruction cycles (per warp) = sum ofOperand read cycles
Instruction execution cycles
Result update cycles
Therefore instruction throughput depends onNominal instruction throughput
Memory latency
Memory bandwidth
“Cycle” refers to the multiprocessor clock rate1.3 GHz on the Tesla C1060, for example
70© NVIDIA Corporation 2008
Maximizing Instruction Throughput
Maximize use of high-bandwidth memoryMaximize use of shared memory
Minimize accesses to global memory
Maximize coalescing of global memory accesses
Optimize performance by overlapping memoryaccesses with HW computation
High arithmetic intensity programs
i.e. high ratio of math to memory transactions
Many concurrent threads
71© NVIDIA Corporation 2008
Arithmetic Instruction Throughput
int and float add, shift, min, max and float mul, mad:4 cycles per warp
int multiply (*) is by default 32-bit
requires multiple cycles / warp
Use __mul24() / __umul24() intrinsics for 4-cycle 24-bit intmultiply
Integer divide and modulo are more expensiveCompiler will convert literal power-of-2 divides to shifts
But we have seen it miss some cases
Be explicit in cases where compiler can’t tell that divisor isa power of 2!
Useful trick: foo % n == foo & (n-1) if n is a power of 2
72© NVIDIA Corporation 2008
Runtime Math Library
There are two types of runtime math operations insingle-precision
__funcf(): direct mapping to hardware ISA
Fast but lower accuracy (see prog. guide for details)
Examples: __sinf(x), __expf(x), __powf(x,y)
funcf() : compile to multiple instructions
Slower but higher accuracy (5 ulp or less)
Examples: sinf(x), expf(x), powf(x,y)
The -use_fast_math compiler option forces everyfuncf() to compile to __funcf()
73© NVIDIA Corporation 2008
GPU results may not match CPU
Many variables: hardware, compiler, optimizationsettings
CPU operations aren’t strictly limited to 0.5 ulpSequences of operations can be more accurate due to 80-bit extended precision ALUs
Floating-point arithmetic is not associative!
74© NVIDIA Corporation 2008
FP Math is Not Associative!
In symbolic math, (x+y)+z == x+(y+z)
This is not necessarily true for floating-point additionTry x = 1030, y = -1030 and z = 1 in the above equation
When you parallelize computations, you potentiallychange the order of operations
Parallel results may not exactly match sequentialresults
This is not specific to GPU or CUDA – inherent part ofparallel execution
75© NVIDIA Corporation 2008
Control Flow Instructions
Main performance concern with branching isdivergence
Threads within a single warp take different paths
Different execution paths must be serialized
Avoid divergence when branch condition is afunction of thread ID
Example with divergence:
if (threadIdx.x > 2) { }
Branch granularity < warp size
Example without divergence:
if (threadIdx.x / WARP_SIZE > 2) { }
Branch granularity is a whole multiple of warp size
Scared ?
Howwwwww?!(do I start)
Scared ?
Profiler
!"#$%&'&()'*+(,-./'$0-• ,-./'$0-(1.2"*0-&3
– !"#$%&'$!("#)!##&*+!"!"#$%&'$!("#)*,*'&$*+• #$%&"'()*+,+(%+-"./"0 1+*"23*1
• 4'556+-7"'()86-+5"*+183/5!"4+9+)6%+-7"-$+5"($%
– -.+)%*/&*#$!"-#$)%*/&*#$• :()*+,+(%+-"./ 0"1+*"23*1";$*"+3)&"8$3-<5%$*+"'(5%*6)%'$(
© NVIDIA 2010
• :()*+,+(%+-"./ 0"1+*"23*1";$*"+3)&"8$3-<5%$*+"'(5%*6)%'$(
• :(5%*6)%'$(",3/".+")$6(%+-"';"'%"'5"41*+-')3%+-"$6%7
– .0)-.(12.).(2+)3!##!".0)-.(12.).(2+)4!$!"-.(12.)#$(%*)$%2"#2'$!("• :()*+,+(%+-"./"0 1+*"45($'"0 =8'(+"'5"0>?#@
– &"'2'4*+)-.(12.).(2+)$%2"#2'$!("• :()*+,+(%+-"./ 0"1+*"A*$16 $;"0!">!"B!"$*"C"%*3(53)%'$(5
• 6.78#-03– B>""D"!"#$%&'$!("#)!##&*+ <D"B>"E"23*1"5'F+"D<– 0>?#"D"=-.(12.)#$(%*)$%2"#2'$!(" 56.0)-.(12.).(2+)3!##@
7
© NVIDIA Corporation 2010
CUDA Visual Profiler data for memory transfers
Memory transfer type and direction(D=Device, H=Host, A=cuArray)
e.g. H to D: Host to Device
Synchronous / Asynchronous
Memory transfer size, in bytes
Stream ID
© NVIDIA Corporation 2010
CUDA Visual Profiler data for kernels
© NVIDIA Corporation 2010
CUDA Visual Profiler computed data for kernels
Instruction throughput: Ratio of achieved instruction rate to peak single issue instruction rate
Global memory read throughput (Gigabytes/second)
Global memory write throughput (Gigabytes/second)
Overall global memory access throughput (Gigabytes/second)
Global memory load efficiency
Global memory store efficiency
© NVIDIA Corporation 2010
CUDA Visual Profiler data analysis viewsViews:
Summary table Kernel tableMemcopy table Summary plotGPU Time Height plotGPU Time Width plotProfiler counter plotProfiler table column plotMulti-device plotMulti-stream plot
Analyze profiler counters
Analyze kernel occupancy
© NVIDIA Corporation 2010
CUDA Visual Profiler Misc.Multiple sessions
Compare views for different sessions
Comparison Summary plot
Profiler projects save & load
Import/Export profiler data (.CSV format)
meh!!!! I don’t like to profile
Scared ?
Modified source code
!"#$%&'&()'*+(,-.'/'0.(1-2340(5-.0
• 6'70(707-3%8-"$%(#".(7#*+8-"$%(903&'-"&(-/(*+0(:03"0$– !"#$%&'()&'*)+%#',-",'+)./,'-"0%'+","1+%2%.+%.,'*).,&)31(3)4')&'
"++&%##$.5– 6$0%#'7)8'5))+'%#,$9",%#'()&:
• ;$9%'#2%.,'"**%##$.5'9%9)&7
© NVIDIA 2010
• ;$9%'#2%.,'"**%##$.5'9%9)&7• ;$9%'#2%.,'$.'%<%*8,$.5'$.#,&8*,$).#
• 5-7;#3'"<(*+0(*'70&(/-3(7-.'/'0.(:03"0$&– =%32#'+%*$+%'4-%,-%&',-%'>%&.%3'$#'9%9 )&'9",-'?)8.+– @-)4#'-)4'4%33'9%9)&7')2%&",$).#'"&%')0%&3"22%+'4$,-'"&$,-9%,$*
• A)92"&%',-%'#89')('9%91).37'".+'9",-1).37',$9%#',)'(8331>%&.%3',$9%
9
I want to believe...
Scared ?
!"#$%&'(#)*$%!+$,(-."/
time
© NVIDIA 2010
mem math full mem math full mem math full mem math full
Memory and latency bound
Poor mem-math overlap: latency is a problem
Math-bound
Good mem-math overlap: latency not a problem
(assuming instruction throughput is not low compared to HW theory)
Memory-bound
Good mem-math overlap: latency not a problem
(assuming memory throughput is not low compared to HW theory)
Balanced
Good mem-math overlap: latency not a problem
(assuming memory/instrthroughput is not low compared to HW theory)
13
Memory bound ?
Math bound ?
Latency bound ?
!"#$%&'(#)*$%!+$,(-."/
time
© NVIDIA 2010
mem math full mem math full mem math full mem math full
Memory and latency bound
Poor mem-math overlap: latency is a problem
Math-bound
Good mem-math overlap: latency not a problem
(assuming instruction throughput is not low compared to HW theory)
Memory-bound
Good mem-math overlap: latency not a problem
(assuming memory throughput is not low compared to HW theory)
Balanced
Good mem-math overlap: latency not a problem
(assuming memory/instrthroughput is not low compared to HW theory)
13
!"#$%&'(#)*$%!+$,(-."/
time
© NVIDIA 2010
mem math full mem math full mem math full mem math full
Memory and latency bound
Poor mem-math overlap: latency is a problem
Math-bound
Good mem-math overlap: latency not a problem
(assuming instruction throughput is not low compared to HW theory)
Memory-bound
Good mem-math overlap: latency not a problem
(assuming memory throughput is not low compared to HW theory)
Balanced
Good mem-math overlap: latency not a problem
(assuming memory/instrthroughput is not low compared to HW theory)
13
Argn&%#$... too many optimizations !!!
67© NVIDIA Corporation 2008
Parameterize Your Application
Parameterization helps adaptation to different GPUs
GPUs vary in many ways# of multiprocessors
Memory bandwidth
Shared memory size
Register file size
Max. threads per block
You can even make apps self-tuning (like FFTW andATLAS)
“Experiment” mode discovers and saves optimalconfiguration
More ?• Next week:
GPU “Scripting”, Meta-programming, Auto-tuning
• Thu 3/31/11:PyOpenCL (A.Knockler, NYU), ahh (C.Omar, CMU)
• Tue 3/29/11:Algorithm Strategies (W. Hwu, UIUC)
• Tue 4/5/11:Analysis-driven Optimization (C.Wooley, NVIDIA)
• Thu 4/7/11:Irregular Parallelism & Efficient Data Structures (J.Owens, UCDavis)
• Thu 4/14/11:Optimization for Ninjas (D.Merill, UVirg)
• ...
iPhD one more thingor two...
Life/Code Hacking #2.xSpeed {listen,read,writ}ing
accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.2Speed writing
accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Life/Code Hacking #2.2Speed writing
accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
http://steve-yegge.blogspot.com/2008/09/programmings-dirtiest-little-secret.html
Life/Code Hacking #2.2Speed writing
accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Typing tutors: gtypist, ktouch, typingweb.com, etc.
Life/Code Hacking #2.2Speed writing
accelerated e-learning (c) / massively parallel {learn,programm}ing (c)
Kinesis Advantage (QWERTY/DVORAK)
Demo
COME