Advanced Visual Effects with - Home - AMD · 2013-10-25 · Advanced Visual Effects with OpenGL ......
Transcript of Advanced Visual Effects with - Home - AMD · 2013-10-25 · Advanced Visual Effects with OpenGL ......
Advanced Visual Effects withAdvanced Visual Effects withOpenGLOpenGL10:00-11:00 Intro & Updates Bill Licea-Kane ATI Research11:00-11:15 Coffee Break11:15-12:15 What’s Next Bill Licea-Kane ATI Research
Michael Gold NVIDIA12:15-12:30 Morning Q/A12:30-14:00 Lunch14:00-14:45 Performance Evan Hart ATI Research14:45-15:15 Tools Jeff Kiel NVIDIA
Derek Cornish NVIDIA15:15-16:00 Tools Yaki Tebeka Graphic Remedy
Avi Shapira Graphic Remedy16:00-16:15 Coffee Break16:15-17:00 NVIDIA OpenGL Simon Green
NVIDIA17:00-17:45 GPGPU Mark Harris NVIDIA17:45-18:00 Closing Q/A
OpenGL Performance TuningOpenGL Performance Tuning
Back to the Basics
Evan Hart – ATI ResearchEhart @ ati.com
Performance RoadmapPerformance Roadmap
• Pipeline refresher• Finding bottlenecks• Ride the pipeline• Hot topics
OpenGL Graphics PipelineOpenGL Graphics Pipeline
Pipeline ContinuedPipeline Continued
• Enables parallelism• Designed to thrive in a multiprocessor system
• Complicates measurement• Performance limited by slowest stage• Must isolate stages to find the limit
Bottleneck IdentificationBottleneck Identification
• Remove pipeline stages• ‘Useless’ API functions help
• Reduce workload on suspect stages• Walk up or down the pipeline
• Use performance counters• Direct hardware insight
Bottleneck IdentificationBottleneck IdentificationVary FB FPS
varies?FB
limited
Vary texturesize/filtering
FPSvaries?
Vary resolution
FPSvaries?
Texturelimited
Vary instructions
FPSvaries?
Vary vertex
instructions
FPSvaries?
Vary vertex size/AGP rate
FPSvaries?
Transferlimited
Fragment/Vertexlimited
Raster/setuplimited
CPUlimited
Yes
No
No
No
No
No
No
Yes
Yes Yes
Yes
Yes
System PerformanceSystem Performance
• CPU time spent on graphics• Largest problem in graphics performance• Problem grows over the lifecycle
• Graphics scales easier than CPU• Enhancing quality can create balance
• Shaders and anti-aliasing
System BottlenecksSystem Bottlenecks
• Hardest bottleneck to find• Performance stays fixed
• Increased graphics power does not help• Reduced loads do not help
• Performance counters can help
Software BottlenecksSoftware Bottlenecks
System Perf. ComponentsSystem Perf. Components
• Data Transmission• Vertex data, textures, etc.• Large isolated component• Often easiest gains
• State management• Too much state• State thrashing
Data TransmissionData Transmission
• Largest single chunks• Big data requires big efficiency• Poor performance on most of the data means
poor performance• Types of submissions
• Geometry submission• Image submission
Geometry SubmissionGeometry Submission
• Relative performance (best to worst) • Display lists• Vertex buffer objects• Vertex arrays, preferably ranged • Immediate mode (glBegin/glEnd)• glArrayElement
Display ListsDisplay Lists
• Excellent method for static geometry• Allows the driver to correct app mistakes
• Merge small draw calls• Format data types to be hardware friendly• Reformat primitive types
• Fairly large software penalty for compile
VBOsVBOs
• Performance equivalent to display lists• Application must not make mistakes
• Supports both static and dynamic data• Cheaper to update than display lists
• Significantly more flexible than DLs• More control over memory usage
• May not be efficient for small draws
Vertex ArraysVertex Arrays
• Extremely flexible• Reasonably efficient method of data
submission• Few calls, lots of data
• No data caching• Likely to run into CPU bottlenecks
Miserable PerformersMiserable Performers
• Immediate mode evils• Stream arbitrarily hard to parse• Potentially poor cache performance• Each call involves function pointer indirection
• Like a virtual function for each attribute
• Further glArrayElement evils• Fools you into believing it is a vertex array
Image TransferImage Transfer• Avoid in critical paths• Utilize methods that do not require memory
management (sub-image)• Match format as closely as possible
• Use a hardware native format
• Avoid synchronization (glReadPixels)• Asynch behavior being developed
• Utilize GPU friendly memory when available• Pixel Buffer Object
General Transfer TipsGeneral Transfer Tips
• Bigger is better• More efficient to transfer lots of data together
• Know the native formats• Avoid GLdouble (processing is in floats)• Avoid GL(u)int (indices are ok)• Avoid unnecessary conversions• Avoid odd sizes (24-bit color)
State ManagementState Management
• Too much state• Try to sort for efficient state transitions• Use shaders instead of fixed function
• State thrashing• Toggling state back and forth• Scene-graph centric problem
Other State EvilsOther State Evils
• Context switching• Expensive software operation• Use FBOs instead of Pbuffers
• glPushAttrib / glPopAttrib• Hits a lot of state at once• Use sparingly for compatibility with 3rd party
code
State Thrashing ExampleState Thrashing ExampleglEnableClientState( … );glVertexPointer( …);glEnable( GL_TEXTURE_GEN*);glMaterial( …);glDrawElements( …);glDisable( GL_TEXTURE_GEN*);glDisableClientState( …);
//Next objectglEnableClientState( … );glVertexPointer( …);glEnable( GL_TEXTURE_GEN*);glMaterial( …);glDrawElements( …);glDisable( GL_TEXTURE_GEN*);glDisableClientState( …);
Vertex PerformanceVertex Performance
• Vertex fetch performance• How fast does the GPU get it?
• Vertex compute performance• How fast does it evaluate?
• Vertex efficiency• Is it wasting time?
Vertex BottlenecksVertex Bottlenecks
Vertex EfficiencyVertex Efficiency
• Indexed primitives• Utilized generalized post-transform vertex
cache• Avoids fetch and compute costs
• Ideal for maximal mesh efficiency• Other vertex reuse
• Strips, fans, and loops
Vertex Fetch PerformanceVertex Fetch Performance• Minimize vertex size
• Utilize byte/ubytes/shorts/ushorts• Interleave vertex data
• Single vertex fits in a cache line• Maximize locality of reference
• Indices 0, 1, 2 are faster than 3, 8, 13• Pay attention to natural boundaries
• Aim for 32 or 64 byte vertices• Use ushorts for indices
Vertex Compute PerformanceVertex Compute Performance
• Turn off anything you don’t need• Avoid the universal shader
• Try custom shortcuts• If it is only a 2x2 matrix, use a mat2
• Send fewer vertices• Efficient app level culling is always desirable
Primitive PerformancePrimitive Performance
• Rare to have problems here• Possible issues
• Clipping• Interpolator overload• Culling
• Frustum, not back face
Fragment PerformanceFragment Performance
• Second most common bottleneck• Often easy to address
• Reduce total fragments• Reduce per-fragment cost• Turn on multismapling
• Contains two subcomponents• Texture performance• ALU performance
Fragment PerformanceFragment Performance
Texture BottleneckTexture Bottleneck• Fragment pipe is starved reading textures• Expensive filtering
• Anisotropic• Trilinear• Deep formats (RGBA FP32)
• Texture cache abuse• Improper use of mipmaps
• Negative LOD bias• No mipmaps
• ‘Noisy’ dependent texture fetches• Textures oversized
• Use texture compression• Utilize smaller formats where appropriate• Fill ‘unused’ components (there is no 24-bit format)
• Trade off ALU instructions
ALU BottleneckALU Bottleneck• Too much computation• Switch computations to textures
• Transcendental functions (some hardware)• Normalize (some hardware)• Only if texture is not a bottleneck• Becoming less effective
• Utilize dynamic flow control when applicable
• Avoid universal shaders
Reducing FragmentsReducing Fragments• Scissor to the area of interest
• Scissor is essentially free• Use occlusion culling
• Render roughly front to back• Early depth testing
• Avoid discard, alpha test, and alpha to coverage• Hierarchical depth testing
• Use GL_LESS of GL_LEQUAL• Use reasonable projections• Clear the depth buffer
• Pre-fill the depth buffer with depth only pass
Backend PerformanceBackend Performance
• Many ops do not impact performance directly• Alpha test• Fog• Dithering
• Typically heavily memory limited• Blending• Depth read/write• Multisampling
Backend PerformanceBackend Performance
Optimizing the BackendOptimizing the Backend• Utilize blending sparingly
• Collapse multiple passes into one• Avoid unnecessary use of higher bit depths• Ensure that occlusion culling operations can be used• Clear color, depth, and stencil buffers
• Can maximize compression• Clear together if possible
• Avoid accumulating unnecessary junk• Set unused alpha to identity value (0 or 1)
• Utilize write masks• If you don’t need it, don’t write it• Not applicable to single color channels
ThanksThanks
• ATI ISV teams & 3DArg• NVIDIA ISV team• John Spitzer @ NVIDIA
• Performance analysis flowchart
Questions?Questions?