GPU Computation GPU Computation Strategies & TricksStrategies & Tricks
Ian BuckIan Buck
NVIDIANVIDIA
2
Recent TrendsRecent Trends
3
Compute is CheapCompute is Cheap
• parallelismparallelism• to keep 100s of ALUs to keep 100s of ALUs
per chip busyper chip busy
• shading is highly shading is highly parallelparallel• millions of fragments millions of fragments
per frameper frame
90nm Chip64-bit FPU(to scale)
12mm
0.5mm
courtesy of Bill Dally
4
...but Bandwidth is Expensive...but Bandwidth is Expensive
courtesy of Bill Dally
• latency tolerancelatency tolerance• to cover 500 cycle to cover 500 cycle
remote memory remote memory access timeaccess time
• localitylocality• to match 20Tb/s ALU to match 20Tb/s ALU
bandwidth to bandwidth to ~100Gb/s chip ~100Gb/s chip bandwidthbandwidth
90nm Chip
12mm
0.5mm
1 clock
5
Optimizing for GPUsOptimizing for GPUs
• shading is shading is compute intensivecompute intensive• 100s of floating point 100s of floating point
operationsoperations
• output 1 32-bit color output 1 32-bit color valuevalue
• compute to compute to bandwidth ratiobandwidth ratio• arithmetic intensityarithmetic intensity
90nm Chip
12mm
0.5mm
1 clock
courtesy of Bill Dally
6
Compute vs. BandwidthCompute vs. Bandwidth
R300R300 R360R360 R420R420
GFLOPSGFLOPS
GFloats/secGFloats/sec
7
Arithmetic IntensityArithmetic Intensity
R300R300 R360R360 R420R420
GFLOPSGFLOPS
GFloats/secGFloats/sec
7x Gap
8
Arithmetic IntensityArithmetic Intensity
GPU wins when…GPU wins when…• Arithmetic Arithmetic
intensityintensity SegmentSegment
3.7 ops per 3.7 ops per wordword
11 GFLOPS11 GFLOPS
GeForce 7800 GTX
Pentium 4 3.0 GHz
9
Arithmetic IntensityArithmetic Intensity
• Overlapping computation with Overlapping computation with communicationcommunication
10
Memory BandwidthMemory Bandwidth
GPU wins when…GPU wins when…• Streaming memory Streaming memory
bandwidthbandwidth
SAXPYSAXPY
FFTFFT
GeForce 7800 GTX
Pentium 4 3.0 GHz
11
Memory BandwidthMemory Bandwidth
• Streaming Memory Streaming Memory SystemSystem• Optimized for sequential Optimized for sequential
performanceperformance
• GPU cache is limitedGPU cache is limited• Optimized for texture Optimized for texture
filteringfiltering
• Read-onlyRead-only
• SmallSmall
• Local storageLocal storage• CPU >> GPUCPU >> GPU
GeForce 7800 GTXGeForce 7800 GTX Pentium 4Pentium 4
12
Computational IntensityComputational Intensity
• Considering GPU transfer costs: TConsidering GPU transfer costs: Trr
GPU
Memory
Memory
CPU
13
Computational IntensityComputational Intensity
• Considering GPU transfer costs: TConsidering GPU transfer costs: Trr
• Computational intensity: Computational intensity:
• to outperform the CPU:to outperform the CPU:
speedup: s speedup: s K Kcpucpu / K / Kgpugpu
KKgpugpu / T / Trr
work per word transferredwork per word transferred
1s - 1 >
14
Kernel OverheadKernel Overhead
• Considering CPU cost to issuing a kernelConsidering CPU cost to issuing a kernel• Generating compute geometryGenerating compute geometry
• Graphics driverGraphics driver
CPUCPUlimitedlimited
GPUGPUlimitedlimited
15
Floating Point PrecisionFloating Point Precision
•NVIDIA FP32NVIDIA FP32•s23e8 s23e8
•ATI 24-bit floatATI 24-bit float•s16e7 s16e7
•NVIDIA FP16NVIDIA FP16•s10e5s10e5
mantissamantissaexponentexponentss
sign * 1.mantissa * 2(exponent+bias)
16
Floating Point PrecisionFloating Point Precision
• Common BugCommon Bug• Pack large 1D array in 2D texturePack large 1D array in 2D texture
• Compute 1D address in shaderCompute 1D address in shader
• Convert 1D address into 2DConvert 1D address into 2D
• FP precision will leave FP precision will leave unaddressable texels!unaddressable texels!
NVIDIA FP32: 16,777,217NVIDIA FP32: 16,777,217ATI 24-bit float: 131,073ATI 24-bit float: 131,073NVIDIA FP16: 2,049NVIDIA FP16: 2,049
Largest Counting Number
17
Scatter TechniquesScatter Techniques
• Problem: a[i] = pProblem: a[i] = p• Indirect writeIndirect write
• Can’t set the x,y of fragment in pixel shaderCan’t set the x,y of fragment in pixel shader
• Often want to do: a[i] += pOften want to do: a[i] += p
18
Scatter TechniquesScatter Techniques
• Solution 1: Convert to GatherSolution 1: Convert to Gather
m1m1 m2m2
f2f2f3f3
f1f1
for each springf = computed force
mass_force[left] += f; mass_force[right] -= f;
19
Scatter TechniquesScatter Techniques
• Solution 1: Convert to GatherSolution 1: Convert to Gather
m1m1 m2m2
f2f2f3f3
f1f1
for each springf = computed force
for each massmass_force = f[left] -
f[right];
20
Scatter TechniquesScatter Techniques
• Solution 2: Address SortingSolution 2: Address Sorting• Sort & SearchSort & Search
•Shader outputs destination address and dataShader outputs destination address and data
•Bitonic sort based on addressBitonic sort based on address
•Run binary search shader over destination bufferRun binary search shader over destination buffer– Each fragment searches for source dataEach fragment searches for source data
21
Scatter TechniquesScatter Techniques
• Solution 3: Vertex processorSolution 3: Vertex processor• Render pointsRender points
•Use vertex shader to set destinationUse vertex shader to set destination
•or just read back the data and re-issueor just read back the data and re-issue
• Vertex TexturesVertex Textures
•Render data and address to textureRender data and address to texture
• Issue points, set point x,y in vertex shader using Issue points, set point x,y in vertex shader using address textureaddress texture
•Requires texld instruction in vertex programRequires texld instruction in vertex program
22
ConditionalsConditionals
Strategies & Tricks:Strategies & Tricks:
23
ConditionalsConditionals
• Problem:Problem:
• Limited fragment shader conditional supportLimited fragment shader conditional support
if (a) b = f();else b = g();
24
Pre-computationPre-computation
• Pre-compute anything that will not Pre-compute anything that will not change every iteration!change every iteration!
• Example: static obstacles in fluid simExample: static obstacles in fluid sim• When user draws obstacles, compute texture When user draws obstacles, compute texture
containing boundary info for cellscontaining boundary info for cells
• Reuse that texture until obstacles are modifiedReuse that texture until obstacles are modified
• Combine with Z-cull for higher performance!Combine with Z-cull for higher performance!
25
Static Branch ResolutionStatic Branch Resolution
• Avoid branches where outcome is Avoid branches where outcome is fixedfixed• One region is always true, another falseOne region is always true, another false
• Separate FPs for each region, no branchesSeparate FPs for each region, no branches
• Example: boundariesExample: boundaries
26
Branching with Occlusion QueryBranching with Occlusion Query• Use it for iteration terminationUse it for iteration termination
Do Do
{ // outer loop on CPU{ // outer loop on CPU
BeginOcclusionQueryBeginOcclusionQuery
{{
// Render with fragment program that // Render with fragment program that // discards fragments that satisfy // discards fragments that satisfy
// termination criteria // termination criteria
} } EndQueryEndQuery
} } While query returns > 0While query returns > 0
27
ConditionalsConditionals
• Using the depth bufferUsing the depth buffer• Set Z buffer to aSet Z buffer to a
• Z-test can prevent shader executionZ-test can prevent shader execution
•glEnable(GL_DEPTH_TEST)glEnable(GL_DEPTH_TEST)
• Locality in conditionalLocality in conditional
if (a) b = f();else b = g();
28
ConditionalsConditionals
• Using the depth bufferUsing the depth buffer• Optimization disabled with:Optimization disabled with:
ATI: ATI: • Writing Z in shaderWriting Z in shader
• Enabling Alpha testEnabling Alpha test
• Using texkill in shaderUsing texkill in shader
NVIDIA:NVIDIA:
• Changing depth test Changing depth test direction in framedirection in frame
• Writing stencil while Writing stencil while rejecting based on stencil rejecting based on stencil
• Changing stencil Changing stencil func/ref/mask in framefunc/ref/mask in frame
29
Depth ConditionalsDepth Conditionals
GeForce 7800 GTX
30
ConditionalsConditionals
• PredicationPredication
• Execute bothExecute both
• f and gf and g
• Use CMP instructionUse CMP instruction• CMP b, -a, f, gCMP b, -a, f, g
• Executes all conditional codeExecutes all conditional code
if (a) b = f();else b = g();
31
ConditionalsConditionals
• PredicationPredication
• Use DP4 instructionUse DP4 instruction• DP4 b.x, a, f DP4 b.x, a, f
• Executes all conditional codeExecutes all conditional code
if (a.x) b = x;else if (a.y) b = y;else if (a.z) b = z;else if (a.w) b = w;
a = (0, 1, 0, 0)f = (x, y, z, w)
32
ConditionalsConditionals
• Conditional InstructionsConditional Instructions• Available with NV_fragment_program2Available with NV_fragment_program2
MOVC CC, R0;IF GT.x;MOV R0, R1; # executes if R0.x > 0ELSE;MOV R0, R2; # executes if R0.x <= 0ENDIF;
33
GeForce 6+ Series BranchingGeForce 6+ Series Branching
• True, SIMD branchingTrue, SIMD branching• Lots of incoherent branching can hurt performanceLots of incoherent branching can hurt performance
• Should have coherent regions of ~1000 pixelsShould have coherent regions of ~1000 pixels
•That is only about 30x30 pixels, so still very useable!That is only about 30x30 pixels, so still very useable!
• Don’t ignore overhead of branch Don’t ignore overhead of branch instructionsinstructions• Branching over < 5 instructions may not be worth itBranching over < 5 instructions may not be worth it
• Use branching for early exit from loopsUse branching for early exit from loops• Save a lot of computationSave a lot of computation
34
Conditional InstructionsConditional Instructions
GeForce 7800 GTX
35
Branching TechniquesBranching Techniques• Fragment program branches can be expensiveFragment program branches can be expensive
• No true fragment branching on GeForce FX or Radeon 9x00-No true fragment branching on GeForce FX or Radeon 9x00-X850X850
• SIMD branching on GeForce 6/7 SeriesSIMD branching on GeForce 6/7 Series
• Incoherent branching hurts performanceIncoherent branching hurts performance
• Sometimes better to move decisions up the Sometimes better to move decisions up the pipelinepipeline• Pre-computationPre-computation
• Replace with mathReplace with math
• Occlusion QueryOcclusion Query
• Static Branch ResolutionStatic Branch Resolution
• Depth BufferDepth Buffer
Top Related