CUDA Lecture 11 Performance Considerations
description
Transcript of CUDA Lecture 11 Performance Considerations
Prepared 10/11/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
CUDA Lecture 11Performance
Considerations
Always measure where your time is going!Even if you think you know where it is goingStart coarse, go fine-grained as need be
Keep in mind Amdahl’s Law when optimizing any part of your codeDon’t continue to optimize once a part is only a
small fraction of overall execution time
Performance Considerations – Slide 2
Preliminaries
Performance Consideration IssuesMemory coalescingShared memory bank conflictsControl-flow divergenceOccupancyKernel launch overheads
Performance Considerations – Slide 3
Outline
Off-chip memory is accessed in chunksEven if you read only a single wordIf you don’t use whole chunk, bandwidth is
wasted Chunks are aligned to multiples of 32/64/128
bytesUnaligned accesses will cost more
When accessing global memory, peak performance utilization occurs when all threads in a half warp access continuous memory locations.
Performance Considerations – Slide 4
Performance Topic A: Memory Coalescing
Performance Considerations – Slide 5
Memory Layout of a Matrix in C
M2,0
M1,1
M1,0M0,0
M0,1
M3,0
M2,1 M3,1
M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2
M1,2M0,2 M2,2 M3,2
M1,3M0,3 M2,3 M3,3
M1,3M0,3 M2,3 M3,3
M
Performance Considerations – Slide 6
Memory Layout of a Matrix in C
M2,0
M1,1
M1,0M0,0
M0,1
M3,0
M2,1 M3,1
M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2
M1,2M0,2 M2,2 M3,2
M1,3M0,3 M2,3 M3,3
M1,3M0,3 M2,3 M3,3
M
Access direction in Kernel code
T1 T2 T3 T4
Time Period 1T1 T2 T3 T4
Time Period 2 …
T1 T2 T3 T4
Time Period 1
T1 T2 T3 T4
Time Period 2 …
Performance Considerations – Slide 7
Memory Layout of a Matrix in C
M2,0
M1,1
M1,0M0,0
M0,1
M3,0
M2,1 M3,1
M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2
M1,2M0,2 M2,2 M3,2
M1,3M0,3 M2,3 M3,3
M1,3M0,3 M2,3 M3,3
M
Access direction in Kernel code
Performance Considerations – Slide 8
Memory Layout of a Matrix in C
Md Nd
WIDTH
WIDTH
Thread 1Thread 2
Not coalesced coalesced
Performance Considerations – Slide 9
Use Shared Memory to Improve Coalescing
Md Nd
WIDTH
WIDTH
Md Nd
Original AccessPattern
Tiled AccessPattern
Copy into scratchpad
memory
Perform multiplication
with scratchpad values
Threads 0-15 access 4-byte words at addresses 116-176Thread 0 is lowest active, accesses address 116128-byte segment: 0-127
Performance Considerations – Slide 10
Second Example
96 192128
128B segment
160 224
t1t2
288256
...t0 t15
0 32 64
t3
Threads 0-15 access 4-byte words at addresses 116-176Thread 0 is lowest active, accesses address 116128-byte segment: 0-127 (reduce to 64B)
Performance Considerations – Slide 11
Second Example (cont.)
96 192128
64B segment
160 224
t1t2
288256
...t0 t15
0 32 64
t3
Threads 0-15 access 4-byte words at addresses 116-176Thread 0 is lowest active, accesses address 116128-byte segment: 0-127 (reduce to 32B)
Performance Considerations – Slide 12
Second Example (cont.)
96 192128
32B segment
160 224
t1t2
288256
...t0 t15
0 32 64
t3
Threads 0-15 access 4-byte words at addresses 116-176Thread 3 is lowest active, accesses address 128128-byte segment: 128-255
Performance Considerations – Slide 13
Second Example (cont.)
96 192128
128B segment
160 224
t1t2
288256
...t0 t15
0 32 64
t3
Threads 0-15 access 4-byte words at addresses 116-176Thread 3 is lowest active, accesses address 128128-byte segment: 128-255 (reduce to 64B)
Performance Considerations – Slide 14
Second Example (cont.)
96 192128
64B segment
160 224
t1t2
288256
...t0 t15
0 32 64
t3
Performance Considerations – Slide 15
Consider the stride of your accesses
__global__ void foo(int* input, float3* input2){ int i = blockDim.x * blockIdx.x + threadIdx.x; // Stride 1 int a = input[i]; // Stride 2, half the bandwidth is wasted int b = input[2 * i]; // Stride 3, 2/3 of the bandwidth wasted float c = input2[i].x;}
Performance Considerations – Slide 16
Example: Array of Structures (AoS)
struct record{ int key; int value; int flag;};
record *d_records;cudaMalloc((void**)&d_records, ...);
Performance Considerations – Slide 17
Example: Structure of Arrays (SoA)
struct SoA{ int * keys; int * values; int * flags;};
SoA d_SoA_data;cudaMalloc((void**)&d_SoA_data.keys, ...);cudaMalloc((void**)&d_SoA_data.values, ...);cudaMalloc((void**)&d_SoA_data.flags, ...);
Performance Considerations – Slide 18
Example: SoA versus AoS__global__ void bar(record *AoS_data, SoA SoA_data){ int i = blockDim.x * blockIdx.x + threadIdx.x; // AoS wastes bandwidth int key = AoS_data[i].key; // SoA efficient use of bandwidth int key_better = SoA_data.keys[i];}
Structure of arrays is often better than array of structures Very clear win on regular, stride 1 access
patternsUnpredictable or irregular access patterns are
case-by-case
Performance Considerations – Slide 19
Example: SoA versus AoS (cont.)
As seen each SM has 16 KB of shared memory16 banks of 32-bit words (Tesla)
CUDA uses shared memory as shared storage visible to all threads in a thread blockread and write access
Not used explicitly for pixel shader programswe dislike pixels talking to each
other Performance Considerations – Slide 20
Performance Topic B: Shared Memory Bank Conflicts
I$L1
MultithreadedInstruction Buffer
RF
C$L1
SharedMem
Operand Select
MAD SFU
So shared memory is bankedOnly matters for threads within a warpFull performance with some restrictionsThreads can each access different banksOr can all access the same value
Consecutive words are in different banksIf two or more threads access the same bank
but different value, get bank conflicts
Performance Considerations – Slide 21
Shared Memory
In a parallel machine, many threads access memoryTherefore, memory is divided into banksEssential to achieve high bandwidth
Each bank can service one address per cycleA memory can service as many
simultaneous accesses as it has banks
Multiple simultaneous accesses to a bankresult in a bank conflict Conflicting accesses are serializedPerformance Considerations – Slide 22
Details: Parallel Memory Architecture
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
No Bank Conflicts No Bank Conflicts Linear addressing, stride == 1 Random 1:1 permutation
Performance Considerations – Slide 23
Bank Addressing Examples
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Two-way Bank Conflicts Eight-way Bank ConflictsLinear addressing stride == 2 Linear addressing stride == 8
Performance Considerations – Slide 24
Bank Addressing Examples (cont.)
Thread 11Thread 10Thread 9Thread 8
Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 9Bank 8
Bank 15
Bank 7
Bank 2Bank 1Bank 0x8
x8
Each bank has a bandwidth of 32 bits per clock cycle
Successive 32-bit words are assigned to successive banks
G80 has 16 banksSo bank = address % 16Same as the size of a half-warp
No bank conflicts between different half-warps, only within a single half-warp
Performance Considerations – Slide 25
How addresses map to banks on G80
Shared memory is as fast as registers if there are no bank conflicts
The fast case:If all threads of a half-warp access different banks,
there is no bank conflictIf all threads of a half-warp access the identical
address, there is no bank conflict (broadcast)The slow case:
Bank conflict: multiple threads in the same half-warp access the same bank
Must serialize the accessesCost = max # of simultaneous accesses to a single
bank Performance Considerations – Slide 26
Shared memory bank conflicts
Change all shared memory reads to the same valueAll broadcasts = no conflictsWill show how much performance could be
improved by eliminating bank conflictsThe same doesn’t work for shared memory
writesSo, replace shared memory array indices with threadIdx.x
Can also be done to the reads
Performance Considerations – Slide 27
Trick to Assess Impact On Performance
Given:
This is only bank-conflict-free if s shares no common factors with the number of banks 16 on G80, so s must be odd
Performance Considerations – Slide 28
Linear Addressing
__shared__ float shared[256];float foo = shared[baseIndex + s * threadIdx.x];
Performance Considerations – Slide 29
Linear Addressing Examples
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
Bank 15
Bank 7Bank 6Bank 5Bank 4Bank 3Bank 2Bank 1Bank 0
Thread 15
Thread 7Thread 6Thread 5Thread 4Thread 3Thread 2Thread 1Thread 0
s=3s=1
texture and __constant__Read-onlyData resides in global memoryDifferent read path:
includes specialized caches
Performance Considerations – Slide 30
Additional “memories”
Data stored in global memory, read through a constant-cache path__constant__ qualifier in declarationsCan only be read by GPU kernelsLimited to 64KB
To be used when all threads in a warp read the same addressSerializes otherwise
Throughput: 32 bits per warp per clock per multiprocessor
Performance Considerations – Slide 31
Constant Memory
Immediate address constantsIndexed address constantsConstants stored in DRAM, and
cached on chipL1 per SM
A constant value can be broadcast to all threads in a warpExtremely efficient way of
accessing a value that is common for all threads in a block!
Performance Considerations – Slide 32
ConstantsI$L1
MultithreadedInstruction Buffer
RF
C$L1
SharedMem
Operand Select
MAD SFU
ObjectivesTo understand the implications of control flow
onBranch divergence overheadSM execution resource utilization
To learn better ways to write code with control flow
To understand compiler/HW predication designed to reduce the impact of control flowThere is a cost involved.
Performance Considerations – Slide 33
Performance Topic C: Control Flow Divergence
Thread: concurrent code and associated state executed on the CUDA device (in parallel with other threads)The unit of parallelism in CUDA
Warp: a group of threads executed physically in parallel in G80
Block: a group of threads that are executed together and form the unit of resource assignment
Grid: a group of thread blocks that must all complete before the next kernel call of the program can take effect
Performance Considerations – Slide 34
Quick terminology review
Thread blocks are partitioned into warps with instructions issued per 32 threads (warp)Thread IDs within a warp are consecutive and
increasingWarp 0 starts with Thread ID 0
Partitioning is always the sameThus you can use this knowledge in control
flow The exact size of warps may change from
generation to generation
Performance Considerations – Slide 35
How thread blocks are partitioned
However, DO NOT rely on any ordering between warpsIf there are any dependencies between
threads, you must __syncthreads() to get correct results
Performance Considerations – Slide 36
How thread blocks are partitioned (cont.)
Main performance concern with branching is divergenceThreads within a single warp take different
pathsif-else, ...
Different execution paths within a warp are serialized in G80The control paths taken by the threads in a warp
are traversed one at a time until there is no more.Different warps can execute different code
with no impact on performance
Performance Considerations – Slide 37
Control Flow Instructions
A common case: avoid diverging within a warp, i.e. when branch condition is a function of thread IDExample with divergence:
This creates two different control paths for threads in a block
Branch granularity < warp size; threads 0 and 1 follow different path than the rest of the threads in the first warp
Performance Considerations – Slide 38
Control Flow Divergence (cont.)
if (threadIdx.x > 2) {...}else {...}
A common case: avoid diverging within a warp, i.e. when branch condition is a function of thread IDExample without divergence:
Also creates two different control paths for threads in a block
Branch granularity is a whole multiple of warp size; all threads in any given warp follow the same path
Performance Considerations – Slide 39
Control Flow Divergence (cont.)
if (threadIdx.x / WARP_SIZE > 2) {...}else {...}
Given an array of values, “reduce” them to a single value in parallel
Examples sum reduction: sum of all values in the arrayMax reduction: maximum of all values in the
arrayTypically parallel implementation:
Recursively halve # threads, add two values per thread
Takes log(n) steps for n elements, requires n/2 threads
Performance Considerations – Slide 40
Parallel Reduction
Performance Considerations – Slide 41
Example: Divergent Iteration__global__ void per_thread_sum(int *indices, float *data, float *sums){ ... // number of loop iterations is data dependent for(int j=indices[i];j<indices[i+1]; j++) { sum += data[j]; } sums[i] = sum;}
Assume an in-place reduction using shared memoryThe original vector is in device global memoryThe shared memory used to hold a partial sum
vectorEach iteration brings the partial sum vector
closer to the final sumThe final solution will be in element 0
Performance Considerations – Slide 42
A Vector Reduction Example
Assume we have already loaded array into__shared__ float partialSum[]
Performance Considerations – Slide 43
A simple implementation
unsigned int t = threadIdx.x;for (unsigned int stride = 1;
stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0)
partialSum[t] += partialSum[t+stride];
}
Performance Considerations – Slide 44
Vector Reduction with Bank Conflicts
0 1 2 3 4 5 76 1098 11
0+1 2+3 4+5 6+7 10+118+9
0...3 4..7 8..11
0..7 8..15
1
2
3
Array elements
ITERATIONS
Performance Considerations – Slide 45
Vector Reduction with Branch Divergence
0 1 2 3 4 5 76 1098 11
0+1 2+3 4+5 6+7 10+118+9
0...3 4..7 8..11
0..7 8..15
1
2
3
Array elements
ITERATIONS
Thread 0 Thread 8Thread 2 Thread 4 Thread 6 Thread 10
In each iteration, two control flow paths will be sequentially traversed for each warpThreads that perform addition and threads that
do notThreads that do not perform addition may cost
extra cycles depending on the implementation of divergence
Performance Considerations – Slide 46
Some Observations
No more than half of threads will be executing at any timeAll odd index threads are disabled right from
the beginning!On average, less than ¼ of the threads will be
activated for all warps over time.After the 5th iteration, entire warps in each
block will be disabled, poor resource utilization but no divergence.This can go on for a while, up to 4 more iterations
(512/32=16= 24), where each iteration only has one thread activated until all warps retire
Performance Considerations – Slide 47
Some Observations (cont.)
Assume we have already loaded array into__shared__ float partialSum[]
Performance Considerations – Slide 48
Short comings of the implementation
unsigned int t = threadIdx.x;for (unsigned int stride = 1;
stride < blockDim.x; stride *= 2) { __syncthreads(); if (t % (2*stride) == 0)
partialSum[t] += partialSum[t+stride];
}
BAD: Divergence due to interleaved branch decisions
Assume we have already loaded array into__shared__ float partialSum[]
Performance Considerations – Slide 49
A better implementation
unsigned int t = threadIdx.x;for (unsigned int stride = blockDim.x;
stride > 1; stride >> 1) { __syncthreads(); if (t < stride)
partialSum[t] += partialSum[t+stride];
}
Performance Considerations – Slide 50
Less Divergence than originalThread 0
0 1 2 3 … 13 1514 181716 19
0+16 15+311
3
4
Only the last 5 iterations will have divergenceEntire warps will be shut down as iterations
progressFor a 512-thread block, 4 iterations to shut
down all but one warps in each blockBetter resource utilization, will likely retire
warps and thus blocks fasterRecall, no bank conflicts either
Performance Considerations – Slide 51
Some Observations About the New Implementation
For last 6 loops only one warp active (i.e. tid’s 0..31) Shared reads and writes SIMD synchronous
within a warp, so skip __syncthreads() and unroll last 5 iterations
Performance Considerations – Slide 52
A Potential Further Refinement but bad idea
unsigned int tid = threadIdx.x;for (unsigned int d = n>>1; d > 32; d >>= 1) {
__syncthreads();if (tid < d) shared[tid] += shared[tid + d];
}__syncthreads();if (tid <= 32) { // unroll last 6 predicated steps
shared[tid] += shared[tid + 32];shared[tid] += shared[tid + 16];shared[tid] += shared[tid + 8];shared[tid] += shared[tid + 4];shared[tid] += shared[tid + 2];shared[tid] += shared[tid + 1];
}
This would not work properly is warp size decreases; need
__synchthreads() between each statement!
However, having ___synchthreads() in if statement
is problematic.
A single thread can drag a whole warp with it for a long time
Know your data patternsIf data is unpredictable, try to flatten peaks
by letting threads work on multiple data items
Performance Considerations – Slide 53
Conclusion: Iteration Divergence
<p1> LDR r1,r2,0If p1 is TRUE, instruction executes normallyIf p1 is FALSE, instruction treated as NOPPredication Example
Performance Considerations – Slide 54
Predicated Execution Concept
…if (x == 10) c = c + 1;…
… LDR r5, X p1 <- r5 eq 10<p1> LDR r1 <- C<p1> ADD r1, r1, 1<p1> STR r1 -> C …
Performance Considerations – Slide 55
Predication very helpful for if-else
B
A
C
D
ABCD
The cost is extra instructions will be issued each time the code is executed. However, there is no branch divergence.
Performance Considerations – Slide 56
If-else example…
p1,p2 <- r5 eq 10<p1> inst 1 from B<p1> inst 2 from B<p1> …
…<p2> inst 1 from C<p2> inst 2 from C …
… p1,p2 <- r5 eq 10<p1> inst 1 from B<p2> inst 1 from C
<p1> inst 2 from B<p2> inst 2 from C
<p1> ……
schedule
Comparison instructions set condition codes (CC)Instructions can be predicated to write results only
when CC meets criterion (CC != 0, CC >= 0, etc.)Compiler tries to predict if a branch condition is
likely to produce many divergent warpsIf guaranteed not to diverge: only predicates if < 4
instructionsIf not guaranteed: only predicates if < 7 instructions
May replace branches with instruction predication
Performance Considerations – Slide 57
Instruction Predication in G80
ALL predicated instructions take execution cyclesThose with false conditions don’t write their
outputOr invoke memory loads and stores
Saves branch instructions, so can be cheaper than serializing divergent paths
Performance Considerations – Slide 58
Instruction Predication in G80 (cont.)
“A Comparison of Full and Partial Predicated Execution Support for ILP Processors,” S. A. Mahlke, R. E. Hank, J.E. McCormick, D. I. August, and W. W. Hwu, Proceedings of the 22nd International Symposium on Computer Architecture, June 1995, pp. 138-150 http://www.crhc.uiuc.edu/IMPACT/ftp/conference/
isca-95-partial-pred.pdfAlso available in Readings in Computer
Architecture, edited by Hill, Jouppi, and Sohi, Morgan Kaufmann, 2000
Performance Considerations – Slide 59
For more information on instruction predication
Recall that streaming multiprocessor implements zero-overhead warp schedulingAt any time, only one of the warps is executed
by SM *Warps whose next instruction has its inputs
ready for consumption are eligible for execution
Eligible Warps are selected for execution on a prioritized scheduling policy
All threads in a warp execute the same instruction when selected
Performance Considerations – Slide 60
Performance Topic D: Occupancy
TB1W1
TB = Thread Block, W = Warp
TB2W1
TB3W1
TB2W1
TB1W1
TB3W2
TB1W2
TB1W3
TB3W2
Time
TB1, W1 stallTB3, W2 stallTB2, W1 stall
Instruction: 1 2 3 4 5 6 1 2 1 2 3 41 2 7 8 1 2 1 2 3 4
What happens if all warps are stalled?No instruction issued performance lost
Most common reason for stalling?Waiting on global memory
If your code reads global memory every couple of instructionsYou should try to maximize occupancy
What determines occupancy?Register usage per thread and shared memory
per thread block
Performance Considerations – Slide 61
Thread Scheduling
Pool of registers and sharedmemory per streamingmultiprocessorEach thread block grabs registers
and shared memory
Performance Considerations – Slide 62
Resource Limits
TB 0
Registers Shared Memory
TB 1
TB 2
TB 0TB 1TB 2
Pool of registers and sharedmemory per streamingmultiprocessorIf one or the other is fully utilized
no more thread blocks
Performance Considerations – Slide 63
Resource Limits (cont.)
TB 0
Registers
TB 1TB 0
TB 1
Shared Memory
Can only have eight thread blocks per streaming multiprocessorIf they’re too small, can’t fill up the SMNeed 128 threads/thread block (gt200), 192
threads/TB (gf100)Higher occupancy has diminishing returns for
hiding latency
Performance Considerations – Slide 64
Resource Limits (cont.)
Performance Considerations – Slide 65
Hiding latency with more threads
Use nvcc -Xptxas –v to get register and shared memory usage
Plug those numbers into CUDA Occupancy Calculatorhttp://developer.download.nvidia.com/
compute/cuda/CUDA_Occupancy_calculator.xls
Performance Considerations – Slide 66
How do you know what you’re using?
Performance Considerations – Slide 67
CUDA GPU Occupancy Calculator Example
Performance Considerations – Slide 68
CUDA GPU Occupancy Calculator Example (cont.)
Performance Considerations – Slide 69
CUDA GPU Occupancy Calculator Example (cont.)
Performance Considerations – Slide 70
CUDA GPU Occupancy Calculator Example (cont.)
Pass option –maxrregcount=X to nvcc
This isn’t magic, won’t get occupancy for free
Use this very carefully when you are right on the edge
Performance Considerations – Slide 71
How to influence how many registers you use
Kernel launches aren’t freeA null kernel launch will take non-trivial timeActual number changes with hardware
generations and driver software, so I can’t give you one number
Independent kernel launches are cheaper than dependent kernel launchesDependent launch: Some read-back to the CPU
If you are launching lots of small grids you will lose substantial performance due to this effect
Performance Considerations – Slide 72
Performance Topic E: Kernel Launch Overhead
If you are reading back data to the CPU for control decisions, consider doing it on the GPU
Even though the GPU is slow at serial tasks, can do surprising amounts of work before you used up kernel launch overhead
Performance Considerations – Slide 73
Kernel Launch Overhead (cont.)
Measure, measure, then measure some more!Once you identify bottlenecks, apply judicious
tuningWhat is most important depends on your
programYou’ll often have a series of bottlenecks, where
each optimization gives a smaller boost than expected
Performance Considerations – Slide 74
In Conclusion…
Reading: Chapter 6, “Programming Massively Parallel Processors” by Kirk and Hwu.
Based on original material fromThe University of Illinois at Urbana-Champaign
David Kirk, Wen-mei W. HwuStanford University: Jared Hoberock, David
TarjanRevision history: last updated 10/11/2011.
Previous revisions: 9/9/2011.
Performance Considerations – Slide 75
End Credits