OpenGL Performance TuningOpenGL Performance Tuning
OpenGL Performance in a OpenGL Performance in a ShaderShader--Centric WorldCentric World
Evan Hart – ATI Research Inc.
OverviewOverview
•• Why Why ‘‘Shader CentricShader Centric’’??•• Performance analysis overviewPerformance analysis overview•• Shader Language performance tipsShader Language performance tips•• Managing stateManaging state•• NonNon--shader issuesshader issues•• MultiprocessingMultiprocessing
Why all these shaders?Why all these shaders?
•• Obvious increase in visual fidelityObvious increase in visual fidelity•• Hardware design trendsHardware design trends•• PerformancePerformance•• Cool factorCool factor
Hardware DesignHardware Design
•• Drive for new featuresDrive for new features•• Limited die spaceLimited die space–– Roughly 200 million transistorsRoughly 200 million transistors
•• Fixed function gets bumped outFixed function gets bumped out–– Collection of special shadersCollection of special shaders–– Still highly tuned Still highly tuned
Improved PerformanceImproved Performance
•• Shaders provide direct controlShaders provide direct control–– Implement the exact algorithmImplement the exact algorithm
•• Avoid external processingAvoid external processing–– Less CPU workLess CPU work
Why focus on shaders?Why focus on shaders?
•• BetterBetter•• FasterFaster•• CoolerCooler
Performance AnalysisPerformance Analysis
•• Find the bottleneckFind the bottleneck•• Balance the performanceBalance the performance•• Rinse, lather, and repeatRinse, lather, and repeat
Possible Pipeline BottlenecksPossible Pipeline Bottlenecks
CPU transfer transform raster fragment framebuffer
Geometry Storage
Geometry Processor
FramebufferCPU Rasterizer Shading
CPU/BusBound Vertex Bound Pixel Bound
Finding the BottleneckFinding the Bottleneck
•• Reduce the workload of different Reduce the workload of different stagesstages
•• If performance does not changeIf performance does not change–– Move on, this is not itMove on, this is not it
•• If performance does change If performance does change significantlysignificantly–– This is the bottleneckThis is the bottleneck–– Look to reduce the workloadLook to reduce the workload
Pixel BottleneckPixel Bottleneck
•• Easiest to detectEasiest to detect–– Does performance scale with resolution?Does performance scale with resolution?
•• Multiple causesMultiple causes–– Memory bandwidthMemory bandwidth
•• Disable blendingDisable blending•• Reduce texture bit depthReduce texture bit depth
–– Shader PerformanceShader Performance•• Try a simplified shaderTry a simplified shader
–– Texture filteringTexture filtering•• Turn off trilinear or anisotropic filteringTurn off trilinear or anisotropic filtering
Vertex BottleneckVertex Bottleneck
•• Harder to detect, less frequentHarder to detect, less frequent–– Render Render ½½ the triangles of each objectthe triangles of each object–– Reduce the complexity of vertex Reduce the complexity of vertex
shadersshaders–– If both scale performanceIf both scale performance•• Vertex bottleneckVertex bottleneck
–– If only reduced triangle count scalesIf only reduced triangle count scales•• Submission/fetch bottleneckSubmission/fetch bottleneck
CPU BottleneckCPU Bottleneck
•• Most common today (on highMost common today (on high--end)end)–– Use profilers like Use profilers like VTuneVTune®®–– Find API versus application timeFind API versus application time–– If application limitedIf application limited•• Make it prettier, its essentially freeMake it prettier, its essentially free
BottlenecksBottlenecks
•• Find major areaFind major area–– Fragment/PixelFragment/Pixel–– VertexVertex–– CPUCPU
•• Tune to find the exact causeTune to find the exact cause
Shading Language PerformanceShading Language Performance
•• General shader tuningGeneral shader tuning•• Shader conditionalsShader conditionals•• Marking items as constMarking items as const•• Reducing register pressureReducing register pressure•• Utilizing vector opsUtilizing vector ops•• Understanding the architecturesUnderstanding the architectures
General Tuning AdviceGeneral Tuning Advice
•• Remember floating point and Remember floating point and compiler basicscompiler basics
•• Keep hardware limitations in mindKeep hardware limitations in mind•• Balance resourcesBalance resources•• Write clear codeWrite clear code
Floating Point BasicsFloating Point Basics
•• Operations are not commutativeOperations are not commutative–– (a*b)*c != a*(b*c)(a*b)*c != a*(b*c)
•• Limits the ability of a compiler to Limits the ability of a compiler to reorder codereorder code
•• Likely to be fairly aggressiveLikely to be fairly aggressive•• Do not rely on extreme reorderingDo not rely on extreme reordering
Remember Hardware LimitationsRemember Hardware Limitations
•• Falling to software is the ultimate Falling to software is the ultimate performance penaltyperformance penalty–– Dynamic addressing not available Dynamic addressing not available
everywhereeverywhere–– Vertex texturing is limited Vertex texturing is limited
Balance ResourcesBalance Resources
•• Three distinct categoriesThree distinct categories–– ComputationComputation•• Multiply accumulateMultiply accumulate•• Transcendental operationsTranscendental operations
–– TextureTexture–– InterpolationInterpolation
•• See IHV information for more See IHV information for more detaildetail
Resource Balancing HintsResource Balancing Hints
•• Bandwidth is scarceBandwidth is scarce–– Texture fetch can absorb 32 GB/sTexture fetch can absorb 32 GB/s
•• Bias toward ALU operationsBias toward ALU operations–– No bandwidth consumedNo bandwidth consumed–– MAD cheaper than MAD cheaper than sqrtsqrt (in general)(in general)
•• Using more varying almost always betterUsing more varying almost always better–– Moving computation back to the vertex Moving computation back to the vertex
shadershader
Write Clear CodeWrite Clear Code
•• Compilers pretty good todayCompilers pretty good today•• GPU compilers are getting much GPU compilers are getting much
betterbetter•• Convoluted code may hurt the Convoluted code may hurt the
ability to optimize ability to optimize
Shader ConditionalsShader Conditionals
•• Powerful featurePowerful feature•• Performance implications are Performance implications are
complexcomplex–– SIMD processing requires coherencySIMD processing requires coherency–– Can increase instruction countCan increase instruction count–– DonDon’’t assume it will be faster to skip t assume it will be faster to skip
workwork
Using Const QualifierUsing Const Qualifier
•• Compile time constants highly Compile time constants highly efficientefficient
•• Uniforms require JIT optimizationsUniforms require JIT optimizations•• Consider specializing shaders for Consider specializing shaders for
uniforms that take 2uniforms that take 2--3 values3 values–– if (if (doShadowdoShadow > 0.0)> 0.0) likely likely
inefficientinefficient
Reduce Register PressureReduce Register Pressure
•• Less important than in the pastLess important than in the past–– Compilers betterCompilers better–– Hardware has evolvedHardware has evolved
•• Minimize array sizesMinimize array sizes•• Avoid storing scalars in vectorsAvoid storing scalars in vectors
Utilize Vector InstructionsUtilize Vector Instructions
•• Specifies explicit parallelismSpecifies explicit parallelism–– Helps compiler outHelps compiler out
•• Reliably accesses Reliably accesses ‘‘specialspecial’’instructionsinstructions–– DP3/DP4DP3/DP4–– Normalize on NVIDIA hardwareNormalize on NVIDIA hardware
Understand the ArchitecturesUnderstand the Architectures
•• Unpublished instruction setsUnpublished instruction sets–– Difficult to understand and properly tuneDifficult to understand and properly tune–– Change from generation to generationChange from generation to generation
•• ATIATI–– Vector + ScalarVector + Scalar
•• NVIDIANVIDIA–– Vector SuperscalarVector Superscalar
•• 3DLabs3DLabs–– ScalarScalar
State ManagementState Management
•• Avoid unnecessary state changeAvoid unnecessary state change•• Avoid extra shader compilesAvoid extra shader compiles•• Avoid frequent sampler remappingAvoid frequent sampler remapping•• Minimize queriesMinimize queries–– Ask once and cache the valueAsk once and cache the value
•• Test Test üüberber shaders carefullyshaders carefully–– Switching shaders may be fasterSwitching shaders may be faster
Fixed Function BitsFixed Function Bits
•• Depth OptimizationsDepth Optimizations•• Use Modern MethodsUse Modern Methods–– Extensions are often performance Extensions are often performance
focusedfocused•• Good state managementGood state management
Depth OptimizationsDepth Optimizations
•• Most modern GPUs have occlusion Most modern GPUs have occlusion cullingculling–– Can massively improve performanceCan massively improve performance–– No cost to the applicationNo cost to the application–– Simple guidelines to maximize Simple guidelines to maximize
resultsresults
Depth GuidelinesDepth Guidelines
•• Clear depth and stencil togetherClear depth and stencil togetherglClear(GL_DEPTHglClear(GL_DEPTH | GL_STENCIL)| GL_STENCIL)
•• Clear the depth bufferClear the depth buffer•• Draw roughly front to backDraw roughly front to back
–– Opaque objectsOpaque objects•• Consider a depth fill passConsider a depth fill pass
–– Only if shaders are expensiveOnly if shaders are expensive•• Avoid killing pixels while updating depthAvoid killing pixels while updating depth
–– Alpha test or DiscardAlpha test or Discard•• Avoid reading the depth bufferAvoid reading the depth buffer
–– glReadPixelsglReadPixels or or glCopyTexSubImageglCopyTexSubImage•• Use a consistent depth functionUse a consistent depth function
–– Preferably directional (GL_LEQUAL)Preferably directional (GL_LEQUAL)
Use ExtensionsUse Extensions
•• Most address specific bottlenecksMost address specific bottlenecks–– Frame Buffer ObjectFrame Buffer Object•• Context switch overheadContext switch overhead
–– Vertex Buffer ObjectVertex Buffer Object•• Vertex cachingVertex caching
–– Pixel Buffer ObjectPixel Buffer Object•• Simple pixel transfer speedsSimple pixel transfer speeds
State ManagementState Management
•• Attempt to minimize state changesAttempt to minimize state changes–– Toggling states costs CPU timeToggling states costs CPU time–– Avoid resetting to default Avoid resetting to default ‘‘just becausejust because’’
•• Some changes are worse than othersSome changes are worse than others–– Blend enable/disable cheapBlend enable/disable cheap–– Shader change expensiveShader change expensive–– Texture format change expensiveTexture format change expensive
Preparing for MultiPreparing for Multi--CPUCPU
•• Understand OpenGL's Understand OpenGL's synchronization semanticssynchronization semantics–– MulticontextMulticontext gotchasgotchas•• When does something delete?When does something delete?
•• Possible strategiesPossible strategies–– Single graphics worker threadSingle graphics worker thread–– Single context multiple threadsSingle context multiple threads–– Multiple contexts multiple threadsMultiple contexts multiple threads
OpenGL Multithreading Basics OpenGL Multithreading Basics
•• Context is limited to one threadContext is limited to one thread•• Limited context synchronizationLimited context synchronization–– Primarily Primarily glFinishglFinish–– Also Also glBindglBind* to a limited degree* to a limited degree
•• Easy to run into trouble Easy to run into trouble
Object LifetimesObject Lifetimes
•• Delete does not mean deleteDelete does not mean delete–– Other contexts could be using itOther contexts could be using it–– Synchronize when destroying an Synchronize when destroying an
objectobject–– OpenGL spec is a little fuzzy hereOpenGL spec is a little fuzzy here•• Expect slightly different behaviorExpect slightly different behavior
Single Graphics ThreadSingle Graphics Thread
•• One context working on a single threadOne context working on a single thread–– No graphics synchronization overheadNo graphics synchronization overhead–– Services rendering requestsServices rendering requests
•• DrawingDrawing•• Resource managementResource management
–– Other CPU free to handle physics, etc.Other CPU free to handle physics, etc.–– Best overall model presentlyBest overall model presently
MulticontextMulticontext with One Threadwith One Thread
•• AdvantagesAdvantages–– Keep state Keep state ‘‘cleanclean’’ on each contexton each context–– Synchronization logic is easySynchronization logic is easy
•• DisadvantagesDisadvantages–– Context switch is expensiveContext switch is expensive•• Unlikely to be successfulUnlikely to be successful
MultiContext/MultiThreadMultiContext/MultiThread
•• Possibly allow background loadingPossibly allow background loading–– Thread A rendersThread A renders–– Thread B loads texturesThread B loads textures
•• Synchronization is complexSynchronization is complex–– May require May require glFinishglFinish–– Driver may have to do extra internal Driver may have to do extra internal
synchronizationsynchronization•• Not likely to be fast todayNot likely to be fast today
MultiMulti--GPU HintsGPU Hints
•• Minimize Minimize interframeinterframe dependencedependence–– Clear buffersClear buffers–– Avoid Avoid glCopyglCopy* operations* operations–– Avoid rendering to textures every Avoid rendering to textures every
other frameother frame•• Simon has additional detailsSimon has additional details……
ThanksThanks
•• ATI Dev TeamATI Dev Team•• NVIDIA Dev TeamNVIDIA Dev Team•• John SpitzerJohn Spitzer–– Original pipeline slidesOriginal pipeline slides
Top Related