DirectCompute Accelerated Separable Filtering

29

description

DirectCompute Accelerated Separable Filtering. Separable Filters. Much faster than executing a box filter Classically performed by the Pixel Shader Consists of a horizontal and vertical pass Source image over-sampling increases with kernel size Shader is usually TEX instruction limited. - PowerPoint PPT Presentation

Transcript of DirectCompute Accelerated Separable Filtering

Page 1: DirectCompute Accelerated Separable Filtering
Page 2: DirectCompute Accelerated Separable Filtering

DirectCompute Accelerated Separable Filtering

28th February 2011 2AMD‘s Favorite Effects

Page 3: DirectCompute Accelerated Separable Filtering

Separable Filters• Much faster than executing a box filter• Classically performed by the Pixel Shader• Consists of a horizontal and vertical pass • Source image over-sampling increases with

kernel size– Shader is usually TEX instruction limited

28th February 2011 AMD‘s Favorite Effects 3

Page 4: DirectCompute Accelerated Separable Filtering

Separable? – Who Cares • In many cases developers use this technique

even though the filter may not actually be separable– Results are often still acceptable– Much faster than performing a real box filter– Accelerates many bilateral cases

28th February 2011 AMD‘s Favorite Effects 4

Page 5: DirectCompute Accelerated Separable Filtering

Typical Pipeline Steps

28th February 2011 AMD‘s Favorite Effects 5

SourceRT

IntermediateRT

Destination RT

Horizontal Pass Vertical Pass

Page 6: DirectCompute Accelerated Separable Filtering

Use Bilinear HW filtering?• Bilinear filter HW can halve the number of

ALU and TEX instructions– Just need to compute the correct sampling offsets

• Not possible with more advanced filters– Usually because weighting is a dynamic operation– Think about bilateral cases...

28th February 2011 AMD‘s Favorite Effects 6

Page 7: DirectCompute Accelerated Separable Filtering

Where to start with DirectCompute

• Is the Pixel Shader version TEX or ALU limited?– You need to know what to optimize for!– Use IHV tools to establish this

• Achieving peak performance is not easy – so write a highly configurable kernel– Will allow you to easily experiment and fine tune

28th February 2011 AMD‘s Favorite Effects 7

Page 8: DirectCompute Accelerated Separable Filtering

Thread Group Shared Memory (TGSM)• TGSM can be used to reduce TEX ops• TGSM can also be used to cache results

– Thus saving ALU ops too

• Load a sensible run length – base this on HW wavefront/warp size (AMD = 64, NVIDIA = 32) – Choose a good common factor (multiples of 64)

28th February 2011 AMD‘s Favorite Effects 8

Page 9: DirectCompute Accelerated Separable Filtering

Kernel #1

• Redundant compute threads 28th February 2011 AMD‘s Favorite Effects 9

...........

128 threads load 128 texels

128 – ( Kernel Radius * 2 ) threads compute results

Kernel Radius

Page 10: DirectCompute Accelerated Separable Filtering

Avoid Redundant Threads• Should ensure that all threads in a group have

useful work to do – wherever possible• Redundant threads will not be reassigned

work from another group• This would involve alot of redundancy for a

large kernel diameter28th February 2011 AMD‘s Favorite Effects 10

Page 11: DirectCompute Accelerated Separable Filtering

Kernel #2

28th February 2011 AMD‘s Favorite Effects 11

...........

128 threads load 128 texels

128 threads compute results

Kernel Radius

• No redundant compute threads

Kernel Radius * 2 threadsload 1 extra texel each

Page 12: DirectCompute Accelerated Separable Filtering

Multiple Pixels per Thread• Allows for natural vectorization

– 4 works well on AMD HW– Doesn‘t hurt performance on scalar HW

• Possible to cache TGSM reads on General Purpose Registers (GPRs)– Quartering TGSM reads - absolute winner!!

28th February 2011 AMD‘s Favorite Effects 12

Page 13: DirectCompute Accelerated Separable Filtering

Kernel #3

• Compute threads not a multiple of 64 28th February 2011 AMD‘s Favorite Effects 13

...........

32 threads compute 128 results

Kernel Radius

32 threads load 128 texels

Kernel Radius * 2 threadsload 1 extra texel each

Page 14: DirectCompute Accelerated Separable Filtering

Multiple Lines per Thread Group• Process multiple lines per thread group

– Better than one long line– 2 or 4 works well

• Improved texture cache efficiency• Compute threads back to a multiple of 64

28th February 2011 AMD‘s Favorite Effects 14

Page 15: DirectCompute Accelerated Separable Filtering

Kernel #4

28th February 2011 AMD‘s Favorite Effects 15

...........

...........

Kernel Radius

64 threads compute 256 results

64 threads load 256 texels

Kernel Radius * 4 threadsload 1 extra texel each

Page 16: DirectCompute Accelerated Separable Filtering

Kernel Diameter• Kernel diameter needs to be > 7 to see a

DirectCompute win– Otherwise the overhead cancels out the

advantage

• The larger the kernel diameter the greater the win

28th February 2011 AMD‘s Favorite Effects 16

Page 17: DirectCompute Accelerated Separable Filtering

Use Packing in TGSM• Use packing to reduce storage space required in

TGSM– Only have 32k per SIMD

• Reduces reads/writes from TGSM• Often a uint is sufficient for color filtering• Use SM5.0 instructions f32tof16(), f16tof32()28th February 2011 AMD‘s Favorite Effects 17

Page 18: DirectCompute Accelerated Separable Filtering

High Definition Ambient Occlusion

28th February 2011 AMD‘s Favorite Effects 18

Depth + Normals

HDAO buffer

* =

Original Scene Final Scene

Page 19: DirectCompute Accelerated Separable Filtering

Perform at Half Resolution• HDAO at full resolution is expensive• Running at half resolution captures more

occlusion – and is obviously much faster• Problem: Artifacts are introduced when

combined with the full resolution scene

28th February 2011 AMD‘s Favorite Effects 19

Page 20: DirectCompute Accelerated Separable Filtering

Bilateral Dilate & Blur

28th February 2011 AMD‘s Favorite Effects 20

HDAO buffer doesn‘t match with scene

A bilateral dilate & blur fixes the issue

Page 21: DirectCompute Accelerated Separable Filtering

New Pipeline...

28th February 2011 AMD‘s Favorite Effects 21

Bilinear Upsample Intermediate UAV Dilated & Blurred

Horizontal Pass Vertical Pass

½ Res Still much faster than performing at full res!

Page 22: DirectCompute Accelerated Separable Filtering

Pixel Shader vs DirectCompute

28th February 2011 AMD‘s Favorite Effects 22

*Tested on a range of AMD and NVIDIA DX11 HW, DirectCompute is between ~2.53x to ~3.17x faster than the Pixel Shader

Page 23: DirectCompute Accelerated Separable Filtering

Depth of Field• Many techniques exist to solve this problem• A common technique is to figure out how

blurry a pixel should be– Often called the Cirle of Confusion (CoC)

• A Gaussian blur weighted by CoC is a pretty efficient way to implement this effect

28th February 2011 AMD‘s Favorite Effects 23

Page 24: DirectCompute Accelerated Separable Filtering

The Pipeline...

28th February 2011 AMD‘s Favorite Effects 24

Intermediate UAV

CoC

Horizontal Pass Vertical Pass

Page 25: DirectCompute Accelerated Separable Filtering

28th February 2011 AMD‘s Favorite Effects 25

Shogun 2: DoF OFF

Page 26: DirectCompute Accelerated Separable Filtering

28th February 2011 AMD‘s Favorite Effects 26

Shogun 2: DoF ON

Page 27: DirectCompute Accelerated Separable Filtering

Pixel Shader vs DirectCompute

28th February 2011 AMD‘s Favorite Effects 27

*Tested on a range of AMD and NVIDIA DX11 HW, DirectCompute is between ~1.48x to ~1.86x faster than the Pixel Shader

Page 28: DirectCompute Accelerated Separable Filtering

Summary• DirectCompute greatly accelerates larger kernel diameter

filters• Allows for filtering at full resolution• For access to source code:

– HDAO11: [email protected]– DoF11: [email protected]

28th February 2011 AMD‘s Favorite Effects 28

Page 29: DirectCompute Accelerated Separable Filtering

[email protected]

[email protected]@amd.com

Please fill in the feedback forms!28th February 2011 29AMD‘s Favorite Effects