Download - GPU Computation Strategies & Tricks Ian Buck NVIDIA.

GPU Computation GPU Computation Strategies & TricksStrategies & Tricks

Ian BuckIan Buck

NVIDIANVIDIA

2

Recent TrendsRecent Trends

3

Compute is CheapCompute is Cheap

• parallelismparallelism• to keep 100s of ALUs to keep 100s of ALUs

per chip busyper chip busy

• shading is highly shading is highly parallelparallel• millions of fragments millions of fragments

per frameper frame

90nm Chip64-bit FPU(to scale)

12mm

0.5mm

courtesy of Bill Dally

4

...but Bandwidth is Expensive...but Bandwidth is Expensive


• latency tolerancelatency tolerance• to cover 500 cycle to cover 500 cycle

remote memory remote memory access timeaccess time

• localitylocality• to match 20Tb/s ALU to match 20Tb/s ALU

bandwidth to bandwidth to ~100Gb/s chip ~100Gb/s chip bandwidthbandwidth

90nm Chip

12mm

0.5mm

1 clock

5

Optimizing for GPUsOptimizing for GPUs

• shading is shading is compute intensivecompute intensive• 100s of floating point 100s of floating point

operationsoperations

• output 1 32-bit color output 1 32-bit color valuevalue

• compute to compute to bandwidth ratiobandwidth ratio• arithmetic intensityarithmetic intensity

90nm Chip

12mm

0.5mm

1 clock


6

Compute vs. BandwidthCompute vs. Bandwidth

R300R300 R360R360 R420R420

GFLOPSGFLOPS

GFloats/secGFloats/sec

7

Arithmetic IntensityArithmetic Intensity

R300R300 R360R360 R420R420

GFLOPSGFLOPS

GFloats/secGFloats/sec

7x Gap

8


GPU wins when…GPU wins when…• Arithmetic Arithmetic

intensityintensity SegmentSegment

3.7 ops per 3.7 ops per wordword

11 GFLOPS11 GFLOPS

GeForce 7800 GTX

Pentium 4 3.0 GHz

9


• Overlapping computation with Overlapping computation with communicationcommunication

10

Memory BandwidthMemory Bandwidth

GPU wins when…GPU wins when…• Streaming memory Streaming memory

bandwidthbandwidth

SAXPYSAXPY

FFTFFT

GeForce 7800 GTX

Pentium 4 3.0 GHz

11

Memory BandwidthMemory Bandwidth

• Streaming Memory Streaming Memory SystemSystem• Optimized for sequential Optimized for sequential

performanceperformance

• GPU cache is limitedGPU cache is limited• Optimized for texture Optimized for texture

filteringfiltering

• Read-onlyRead-only

• SmallSmall

• Local storageLocal storage• CPU >> GPUCPU >> GPU

GeForce 7800 GTXGeForce 7800 GTX Pentium 4Pentium 4

12

Computational IntensityComputational Intensity

• Considering GPU transfer costs: TConsidering GPU transfer costs: Trr

GPU

Memory

Memory

CPU

13

Computational IntensityComputational Intensity

• Considering GPU transfer costs: TConsidering GPU transfer costs: Trr

• Computational intensity: Computational intensity:

• to outperform the CPU:to outperform the CPU:

speedup: s speedup: s K Kcpucpu / K / Kgpugpu

KKgpugpu / T / Trr

work per word transferredwork per word transferred

1s - 1 >

14

Kernel OverheadKernel Overhead

• Considering CPU cost to issuing a kernelConsidering CPU cost to issuing a kernel• Generating compute geometryGenerating compute geometry

• Graphics driverGraphics driver

CPUCPUlimitedlimited

GPUGPUlimitedlimited

15

Floating Point PrecisionFloating Point Precision

•NVIDIA FP32NVIDIA FP32•s23e8 s23e8

•ATI 24-bit floatATI 24-bit float•s16e7 s16e7

•NVIDIA FP16NVIDIA FP16•s10e5s10e5

mantissamantissaexponentexponentss

sign * 1.mantissa * 2(exponent+bias)

16

Floating Point PrecisionFloating Point Precision

• Common BugCommon Bug• Pack large 1D array in 2D texturePack large 1D array in 2D texture

• Compute 1D address in shaderCompute 1D address in shader

• Convert 1D address into 2DConvert 1D address into 2D

• FP precision will leave FP precision will leave unaddressable texels!unaddressable texels!

NVIDIA FP32: 16,777,217NVIDIA FP32: 16,777,217ATI 24-bit float: 131,073ATI 24-bit float: 131,073NVIDIA FP16: 2,049NVIDIA FP16: 2,049

Largest Counting Number

17

Scatter TechniquesScatter Techniques

• Problem: a[i] = pProblem: a[i] = p• Indirect writeIndirect write

• Can’t set the x,y of fragment in pixel shaderCan’t set the x,y of fragment in pixel shader

• Often want to do: a[i] += pOften want to do: a[i] += p

18


• Solution 1: Convert to GatherSolution 1: Convert to Gather

m1m1 m2m2

f2f2f3f3

f1f1

for each springf = computed force

mass_force[left] += f; mass_force[right] -= f;

19


• Solution 1: Convert to GatherSolution 1: Convert to Gather

m1m1 m2m2

f2f2f3f3

f1f1

for each springf = computed force

for each massmass_force = f[left] -

f[right];

20


• Solution 2: Address SortingSolution 2: Address Sorting• Sort & SearchSort & Search

•Shader outputs destination address and dataShader outputs destination address and data

•Bitonic sort based on addressBitonic sort based on address

•Run binary search shader over destination bufferRun binary search shader over destination buffer– Each fragment searches for source dataEach fragment searches for source data

21


• Solution 3: Vertex processorSolution 3: Vertex processor• Render pointsRender points

•Use vertex shader to set destinationUse vertex shader to set destination

•or just read back the data and re-issueor just read back the data and re-issue

• Vertex TexturesVertex Textures

•Render data and address to textureRender data and address to texture

• Issue points, set point x,y in vertex shader using Issue points, set point x,y in vertex shader using address textureaddress texture

•Requires texld instruction in vertex programRequires texld instruction in vertex program

22

ConditionalsConditionals

Strategies & Tricks:Strategies & Tricks:

23


• Problem:Problem:

• Limited fragment shader conditional supportLimited fragment shader conditional support

if (a) b = f();else b = g();

24

Pre-computationPre-computation

• Pre-compute anything that will not Pre-compute anything that will not change every iteration!change every iteration!

• Example: static obstacles in fluid simExample: static obstacles in fluid sim• When user draws obstacles, compute texture When user draws obstacles, compute texture

containing boundary info for cellscontaining boundary info for cells

• Reuse that texture until obstacles are modifiedReuse that texture until obstacles are modified

• Combine with Z-cull for higher performance!Combine with Z-cull for higher performance!

25

Static Branch ResolutionStatic Branch Resolution

• Avoid branches where outcome is Avoid branches where outcome is fixedfixed• One region is always true, another falseOne region is always true, another false

• Separate FPs for each region, no branchesSeparate FPs for each region, no branches

• Example: boundariesExample: boundaries

26

Branching with Occlusion QueryBranching with Occlusion Query• Use it for iteration terminationUse it for iteration termination

Do Do

{ // outer loop on CPU{ // outer loop on CPU

BeginOcclusionQueryBeginOcclusionQuery

{{

// Render with fragment program that // Render with fragment program that // discards fragments that satisfy // discards fragments that satisfy

// termination criteria // termination criteria

} } EndQueryEndQuery

} } While query returns > 0While query returns > 0

27


• Using the depth bufferUsing the depth buffer• Set Z buffer to aSet Z buffer to a

• Z-test can prevent shader executionZ-test can prevent shader execution

•glEnable(GL_DEPTH_TEST)glEnable(GL_DEPTH_TEST)

• Locality in conditionalLocality in conditional

if (a) b = f();else b = g();

28


• Using the depth bufferUsing the depth buffer• Optimization disabled with:Optimization disabled with:

ATI: ATI: • Writing Z in shaderWriting Z in shader

• Enabling Alpha testEnabling Alpha test

• Using texkill in shaderUsing texkill in shader

NVIDIA:NVIDIA:

• Changing depth test Changing depth test direction in framedirection in frame

• Writing stencil while Writing stencil while rejecting based on stencil rejecting based on stencil

• Changing stencil Changing stencil func/ref/mask in framefunc/ref/mask in frame

29

Depth ConditionalsDepth Conditionals

GeForce 7800 GTX

30


• PredicationPredication

• Execute bothExecute both

• f and gf and g

• Use CMP instructionUse CMP instruction• CMP b, -a, f, gCMP b, -a, f, g

• Executes all conditional codeExecutes all conditional code

if (a) b = f();else b = g();

31


• PredicationPredication

• Use DP4 instructionUse DP4 instruction• DP4 b.x, a, f DP4 b.x, a, f

• Executes all conditional codeExecutes all conditional code

if (a.x) b = x;else if (a.y) b = y;else if (a.z) b = z;else if (a.w) b = w;

a = (0, 1, 0, 0)f = (x, y, z, w)

32


• Conditional InstructionsConditional Instructions• Available with NV_fragment_program2Available with NV_fragment_program2

MOVC CC, R0;IF GT.x;MOV R0, R1; # executes if R0.x > 0ELSE;MOV R0, R2; # executes if R0.x <= 0ENDIF;

33

GeForce 6+ Series BranchingGeForce 6+ Series Branching

• True, SIMD branchingTrue, SIMD branching• Lots of incoherent branching can hurt performanceLots of incoherent branching can hurt performance

• Should have coherent regions of ~1000 pixelsShould have coherent regions of ~1000 pixels

•That is only about 30x30 pixels, so still very useable!That is only about 30x30 pixels, so still very useable!

• Don’t ignore overhead of branch Don’t ignore overhead of branch instructionsinstructions• Branching over < 5 instructions may not be worth itBranching over < 5 instructions may not be worth it

• Use branching for early exit from loopsUse branching for early exit from loops• Save a lot of computationSave a lot of computation

34

Conditional InstructionsConditional Instructions

GeForce 7800 GTX

35

Branching TechniquesBranching Techniques• Fragment program branches can be expensiveFragment program branches can be expensive

• No true fragment branching on GeForce FX or Radeon 9x00-No true fragment branching on GeForce FX or Radeon 9x00-X850X850

• SIMD branching on GeForce 6/7 SeriesSIMD branching on GeForce 6/7 Series

• Incoherent branching hurts performanceIncoherent branching hurts performance

• Sometimes better to move decisions up the Sometimes better to move decisions up the pipelinepipeline• Pre-computationPre-computation

• Replace with mathReplace with math

• Occlusion QueryOcclusion Query

• Static Branch ResolutionStatic Branch Resolution

• Depth BufferDepth Buffer