Advanced Visual Effects with - Home - AMD · 2013-10-25 · Advanced Visual Effects with OpenGL ......

Advanced Visual Effects withAdvanced Visual Effects withOpenGLOpenGL10:00-11:00 Intro & Updates Bill Licea-Kane ATI Research11:00-11:15 Coffee Break11:15-12:15 What’s Next Bill Licea-Kane ATI Research

Michael Gold NVIDIA12:15-12:30 Morning Q/A12:30-14:00 Lunch14:00-14:45 Performance Evan Hart ATI Research14:45-15:15 Tools Jeff Kiel NVIDIA

Derek Cornish NVIDIA15:15-16:00 Tools Yaki Tebeka Graphic Remedy

Avi Shapira Graphic Remedy16:00-16:15 Coffee Break16:15-17:00 NVIDIA OpenGL Simon Green

NVIDIA17:00-17:45 GPGPU Mark Harris NVIDIA17:45-18:00 Closing Q/A

OpenGL Performance TuningOpenGL Performance Tuning

Back to the Basics

Evan Hart – ATI ResearchEhart @ ati.com

Performance RoadmapPerformance Roadmap

• Pipeline refresher• Finding bottlenecks• Ride the pipeline• Hot topics

OpenGL Graphics PipelineOpenGL Graphics Pipeline

Pipeline ContinuedPipeline Continued

• Enables parallelism• Designed to thrive in a multiprocessor system

• Complicates measurement• Performance limited by slowest stage• Must isolate stages to find the limit

Bottleneck IdentificationBottleneck Identification

• Remove pipeline stages• ‘Useless’ API functions help

• Reduce workload on suspect stages• Walk up or down the pipeline

• Use performance counters• Direct hardware insight

Bottleneck IdentificationBottleneck IdentificationVary FB FPS

varies?FB

limited

Vary texturesize/filtering

FPSvaries?

Vary resolution

FPSvaries?

Texturelimited

Vary instructions

FPSvaries?

Vary vertex

instructions

FPSvaries?

Vary vertex size/AGP rate

FPSvaries?

Transferlimited

Fragment/Vertexlimited

Raster/setuplimited

CPUlimited

Yes

No

No

No

No

No

No

Yes

Yes Yes

Yes

Yes

System PerformanceSystem Performance

• CPU time spent on graphics• Largest problem in graphics performance• Problem grows over the lifecycle

• Graphics scales easier than CPU• Enhancing quality can create balance

• Shaders and anti-aliasing

System BottlenecksSystem Bottlenecks

• Hardest bottleneck to find• Performance stays fixed

• Increased graphics power does not help• Reduced loads do not help

• Performance counters can help

Software BottlenecksSoftware Bottlenecks

System Perf. ComponentsSystem Perf. Components

• Data Transmission• Vertex data, textures, etc.• Large isolated component• Often easiest gains

• State management• Too much state• State thrashing

Data TransmissionData Transmission

• Largest single chunks• Big data requires big efficiency• Poor performance on most of the data means

poor performance• Types of submissions

• Geometry submission• Image submission

Geometry SubmissionGeometry Submission

• Relative performance (best to worst) • Display lists• Vertex buffer objects• Vertex arrays, preferably ranged • Immediate mode (glBegin/glEnd)• glArrayElement

Display ListsDisplay Lists

• Excellent method for static geometry• Allows the driver to correct app mistakes

• Merge small draw calls• Format data types to be hardware friendly• Reformat primitive types

• Fairly large software penalty for compile

VBOsVBOs

• Performance equivalent to display lists• Application must not make mistakes

• Supports both static and dynamic data• Cheaper to update than display lists

• Significantly more flexible than DLs• More control over memory usage

• May not be efficient for small draws

Vertex ArraysVertex Arrays

• Extremely flexible• Reasonably efficient method of data

submission• Few calls, lots of data

• No data caching• Likely to run into CPU bottlenecks

Miserable PerformersMiserable Performers

• Immediate mode evils• Stream arbitrarily hard to parse• Potentially poor cache performance• Each call involves function pointer indirection

• Like a virtual function for each attribute

• Further glArrayElement evils• Fools you into believing it is a vertex array

Image TransferImage Transfer• Avoid in critical paths• Utilize methods that do not require memory

management (sub-image)• Match format as closely as possible

• Use a hardware native format

• Avoid synchronization (glReadPixels)• Asynch behavior being developed

• Utilize GPU friendly memory when available• Pixel Buffer Object

General Transfer TipsGeneral Transfer Tips

• Bigger is better• More efficient to transfer lots of data together

• Know the native formats• Avoid GLdouble (processing is in floats)• Avoid GL(u)int (indices are ok)• Avoid unnecessary conversions• Avoid odd sizes (24-bit color)

State ManagementState Management

• Too much state• Try to sort for efficient state transitions• Use shaders instead of fixed function

• State thrashing• Toggling state back and forth• Scene-graph centric problem

Other State EvilsOther State Evils

• Context switching• Expensive software operation• Use FBOs instead of Pbuffers

• glPushAttrib / glPopAttrib• Hits a lot of state at once• Use sparingly for compatibility with 3rd party

code

State Thrashing ExampleState Thrashing ExampleglEnableClientState( … );glVertexPointer( …);glEnable( GL_TEXTURE_GEN*);glMaterial( …);glDrawElements( …);glDisable( GL_TEXTURE_GEN*);glDisableClientState( …);

//Next objectglEnableClientState( … );glVertexPointer( …);glEnable( GL_TEXTURE_GEN*);glMaterial( …);glDrawElements( …);glDisable( GL_TEXTURE_GEN*);glDisableClientState( …);

Vertex PerformanceVertex Performance

• Vertex fetch performance• How fast does the GPU get it?

• Vertex compute performance• How fast does it evaluate?

• Vertex efficiency• Is it wasting time?

Vertex BottlenecksVertex Bottlenecks

Vertex EfficiencyVertex Efficiency

• Indexed primitives• Utilized generalized post-transform vertex

cache• Avoids fetch and compute costs

• Ideal for maximal mesh efficiency• Other vertex reuse

• Strips, fans, and loops

Vertex Fetch PerformanceVertex Fetch Performance• Minimize vertex size

• Utilize byte/ubytes/shorts/ushorts• Interleave vertex data

• Single vertex fits in a cache line• Maximize locality of reference

• Indices 0, 1, 2 are faster than 3, 8, 13• Pay attention to natural boundaries

• Aim for 32 or 64 byte vertices• Use ushorts for indices

Vertex Compute PerformanceVertex Compute Performance

• Turn off anything you don’t need• Avoid the universal shader

• Try custom shortcuts• If it is only a 2x2 matrix, use a mat2

• Send fewer vertices• Efficient app level culling is always desirable

Primitive PerformancePrimitive Performance

• Rare to have problems here• Possible issues

• Clipping• Interpolator overload• Culling

• Frustum, not back face

Fragment PerformanceFragment Performance

• Second most common bottleneck• Often easy to address

• Reduce total fragments• Reduce per-fragment cost• Turn on multismapling

• Contains two subcomponents• Texture performance• ALU performance

Fragment PerformanceFragment Performance

Texture BottleneckTexture Bottleneck• Fragment pipe is starved reading textures• Expensive filtering

• Anisotropic• Trilinear• Deep formats (RGBA FP32)

• Texture cache abuse• Improper use of mipmaps

• Negative LOD bias• No mipmaps

• ‘Noisy’ dependent texture fetches• Textures oversized

• Use texture compression• Utilize smaller formats where appropriate• Fill ‘unused’ components (there is no 24-bit format)

• Trade off ALU instructions

ALU BottleneckALU Bottleneck• Too much computation• Switch computations to textures

• Transcendental functions (some hardware)• Normalize (some hardware)• Only if texture is not a bottleneck• Becoming less effective

• Utilize dynamic flow control when applicable

• Avoid universal shaders

Reducing FragmentsReducing Fragments• Scissor to the area of interest

• Scissor is essentially free• Use occlusion culling

• Render roughly front to back• Early depth testing

• Avoid discard, alpha test, and alpha to coverage• Hierarchical depth testing

• Use GL_LESS of GL_LEQUAL• Use reasonable projections• Clear the depth buffer

• Pre-fill the depth buffer with depth only pass

Backend PerformanceBackend Performance

• Many ops do not impact performance directly• Alpha test• Fog• Dithering

• Typically heavily memory limited• Blending• Depth read/write• Multisampling

Backend PerformanceBackend Performance

Optimizing the BackendOptimizing the Backend• Utilize blending sparingly

• Collapse multiple passes into one• Avoid unnecessary use of higher bit depths• Ensure that occlusion culling operations can be used• Clear color, depth, and stencil buffers

• Can maximize compression• Clear together if possible

• Avoid accumulating unnecessary junk• Set unused alpha to identity value (0 or 1)

• Utilize write masks• If you don’t need it, don’t write it• Not applicable to single color channels

ThanksThanks

• ATI ISV teams & 3DArg• NVIDIA ISV team• John Spitzer @ NVIDIA

• Performance analysis flowchart

Questions?Questions?

Advanced Visual Effects with - Home - AMD · 2013-10-25 · Advanced Visual Effects with OpenGL ......

Documents

Transcript of Advanced Visual Effects with - Home - AMD · 2013-10-25 · Advanced Visual Effects with OpenGL ......