Download - OpenGL Performance Tuning - AMDdeveloper.amd.com/wordpress/...OpenGL_Performance.pdf · OpenGL Performance Tuning OpenGL Performance in a . Shader-Centric World. Evan Hart – ATI

OpenGL Performance TuningOpenGL Performance Tuning

OpenGL Performance in a OpenGL Performance in a ShaderShader--Centric WorldCentric World

Evan Hart – ATI Research Inc.

OverviewOverview

•• Why Why ‘‘Shader CentricShader Centric’’??•• Performance analysis overviewPerformance analysis overview•• Shader Language performance tipsShader Language performance tips•• Managing stateManaging state•• NonNon--shader issuesshader issues•• MultiprocessingMultiprocessing

Why all these shaders?Why all these shaders?

•• Obvious increase in visual fidelityObvious increase in visual fidelity•• Hardware design trendsHardware design trends•• PerformancePerformance•• Cool factorCool factor

Hardware DesignHardware Design

•• Drive for new featuresDrive for new features•• Limited die spaceLimited die space–– Roughly 200 million transistorsRoughly 200 million transistors

•• Fixed function gets bumped outFixed function gets bumped out–– Collection of special shadersCollection of special shaders–– Still highly tuned Still highly tuned

Improved PerformanceImproved Performance

•• Shaders provide direct controlShaders provide direct control–– Implement the exact algorithmImplement the exact algorithm

•• Avoid external processingAvoid external processing–– Less CPU workLess CPU work

Why focus on shaders?Why focus on shaders?

•• BetterBetter•• FasterFaster•• CoolerCooler

Performance AnalysisPerformance Analysis

•• Find the bottleneckFind the bottleneck•• Balance the performanceBalance the performance•• Rinse, lather, and repeatRinse, lather, and repeat

Possible Pipeline BottlenecksPossible Pipeline Bottlenecks

CPU transfer transform raster fragment framebuffer

Geometry Storage

Geometry Processor

FramebufferCPU Rasterizer Shading

CPU/BusBound Vertex Bound Pixel Bound

Finding the BottleneckFinding the Bottleneck

•• Reduce the workload of different Reduce the workload of different stagesstages

•• If performance does not changeIf performance does not change–– Move on, this is not itMove on, this is not it

•• If performance does change If performance does change significantlysignificantly–– This is the bottleneckThis is the bottleneck–– Look to reduce the workloadLook to reduce the workload

Pixel BottleneckPixel Bottleneck

•• Easiest to detectEasiest to detect–– Does performance scale with resolution?Does performance scale with resolution?

•• Multiple causesMultiple causes–– Memory bandwidthMemory bandwidth

•• Disable blendingDisable blending•• Reduce texture bit depthReduce texture bit depth

–– Shader PerformanceShader Performance•• Try a simplified shaderTry a simplified shader

–– Texture filteringTexture filtering•• Turn off trilinear or anisotropic filteringTurn off trilinear or anisotropic filtering

Vertex BottleneckVertex Bottleneck

•• Harder to detect, less frequentHarder to detect, less frequent–– Render Render ½½ the triangles of each objectthe triangles of each object–– Reduce the complexity of vertex Reduce the complexity of vertex

shadersshaders–– If both scale performanceIf both scale performance•• Vertex bottleneckVertex bottleneck

–– If only reduced triangle count scalesIf only reduced triangle count scales•• Submission/fetch bottleneckSubmission/fetch bottleneck

CPU BottleneckCPU Bottleneck

•• Most common today (on highMost common today (on high--end)end)–– Use profilers like Use profilers like VTuneVTune®®–– Find API versus application timeFind API versus application time–– If application limitedIf application limited•• Make it prettier, its essentially freeMake it prettier, its essentially free

BottlenecksBottlenecks

•• Find major areaFind major area–– Fragment/PixelFragment/Pixel–– VertexVertex–– CPUCPU

•• Tune to find the exact causeTune to find the exact cause

Shading Language PerformanceShading Language Performance

•• General shader tuningGeneral shader tuning•• Shader conditionalsShader conditionals•• Marking items as constMarking items as const•• Reducing register pressureReducing register pressure•• Utilizing vector opsUtilizing vector ops•• Understanding the architecturesUnderstanding the architectures

General Tuning AdviceGeneral Tuning Advice

•• Remember floating point and Remember floating point and compiler basicscompiler basics

•• Keep hardware limitations in mindKeep hardware limitations in mind•• Balance resourcesBalance resources•• Write clear codeWrite clear code

Floating Point BasicsFloating Point Basics

•• Operations are not commutativeOperations are not commutative–– (a*b)*c != a*(b*c)(a*b)*c != a*(b*c)

•• Limits the ability of a compiler to Limits the ability of a compiler to reorder codereorder code

•• Likely to be fairly aggressiveLikely to be fairly aggressive•• Do not rely on extreme reorderingDo not rely on extreme reordering

Remember Hardware LimitationsRemember Hardware Limitations

•• Falling to software is the ultimate Falling to software is the ultimate performance penaltyperformance penalty–– Dynamic addressing not available Dynamic addressing not available

everywhereeverywhere–– Vertex texturing is limited Vertex texturing is limited

Balance ResourcesBalance Resources

•• Three distinct categoriesThree distinct categories–– ComputationComputation•• Multiply accumulateMultiply accumulate•• Transcendental operationsTranscendental operations

–– TextureTexture–– InterpolationInterpolation

•• See IHV information for more See IHV information for more detaildetail

Resource Balancing HintsResource Balancing Hints

•• Bandwidth is scarceBandwidth is scarce–– Texture fetch can absorb 32 GB/sTexture fetch can absorb 32 GB/s

•• Bias toward ALU operationsBias toward ALU operations–– No bandwidth consumedNo bandwidth consumed–– MAD cheaper than MAD cheaper than sqrtsqrt (in general)(in general)

•• Using more varying almost always betterUsing more varying almost always better–– Moving computation back to the vertex Moving computation back to the vertex

shadershader

Write Clear CodeWrite Clear Code

•• Compilers pretty good todayCompilers pretty good today•• GPU compilers are getting much GPU compilers are getting much

betterbetter•• Convoluted code may hurt the Convoluted code may hurt the

ability to optimize ability to optimize

Shader ConditionalsShader Conditionals

•• Powerful featurePowerful feature•• Performance implications are Performance implications are

complexcomplex–– SIMD processing requires coherencySIMD processing requires coherency–– Can increase instruction countCan increase instruction count–– DonDon’’t assume it will be faster to skip t assume it will be faster to skip

workwork

Using Const QualifierUsing Const Qualifier

•• Compile time constants highly Compile time constants highly efficientefficient

•• Uniforms require JIT optimizationsUniforms require JIT optimizations•• Consider specializing shaders for Consider specializing shaders for

uniforms that take 2uniforms that take 2--3 values3 values–– if (if (doShadowdoShadow > 0.0)> 0.0) likely likely

inefficientinefficient

Reduce Register PressureReduce Register Pressure

•• Less important than in the pastLess important than in the past–– Compilers betterCompilers better–– Hardware has evolvedHardware has evolved

•• Minimize array sizesMinimize array sizes•• Avoid storing scalars in vectorsAvoid storing scalars in vectors

Utilize Vector InstructionsUtilize Vector Instructions

•• Specifies explicit parallelismSpecifies explicit parallelism–– Helps compiler outHelps compiler out

•• Reliably accesses Reliably accesses ‘‘specialspecial’’instructionsinstructions–– DP3/DP4DP3/DP4–– Normalize on NVIDIA hardwareNormalize on NVIDIA hardware

Understand the ArchitecturesUnderstand the Architectures

•• Unpublished instruction setsUnpublished instruction sets–– Difficult to understand and properly tuneDifficult to understand and properly tune–– Change from generation to generationChange from generation to generation

•• ATIATI–– Vector + ScalarVector + Scalar

•• NVIDIANVIDIA–– Vector SuperscalarVector Superscalar

•• 3DLabs3DLabs–– ScalarScalar

State ManagementState Management

•• Avoid unnecessary state changeAvoid unnecessary state change•• Avoid extra shader compilesAvoid extra shader compiles•• Avoid frequent sampler remappingAvoid frequent sampler remapping•• Minimize queriesMinimize queries–– Ask once and cache the valueAsk once and cache the value

•• Test Test üüberber shaders carefullyshaders carefully–– Switching shaders may be fasterSwitching shaders may be faster

Fixed Function BitsFixed Function Bits

•• Depth OptimizationsDepth Optimizations•• Use Modern MethodsUse Modern Methods–– Extensions are often performance Extensions are often performance

focusedfocused•• Good state managementGood state management

Depth OptimizationsDepth Optimizations

•• Most modern GPUs have occlusion Most modern GPUs have occlusion cullingculling–– Can massively improve performanceCan massively improve performance–– No cost to the applicationNo cost to the application–– Simple guidelines to maximize Simple guidelines to maximize

resultsresults

Depth GuidelinesDepth Guidelines

•• Clear depth and stencil togetherClear depth and stencil togetherglClear(GL_DEPTHglClear(GL_DEPTH | GL_STENCIL)| GL_STENCIL)

•• Clear the depth bufferClear the depth buffer•• Draw roughly front to backDraw roughly front to back

–– Opaque objectsOpaque objects•• Consider a depth fill passConsider a depth fill pass

–– Only if shaders are expensiveOnly if shaders are expensive•• Avoid killing pixels while updating depthAvoid killing pixels while updating depth

–– Alpha test or DiscardAlpha test or Discard•• Avoid reading the depth bufferAvoid reading the depth buffer

–– glReadPixelsglReadPixels or or glCopyTexSubImageglCopyTexSubImage•• Use a consistent depth functionUse a consistent depth function

–– Preferably directional (GL_LEQUAL)Preferably directional (GL_LEQUAL)

Use ExtensionsUse Extensions

•• Most address specific bottlenecksMost address specific bottlenecks–– Frame Buffer ObjectFrame Buffer Object•• Context switch overheadContext switch overhead

–– Vertex Buffer ObjectVertex Buffer Object•• Vertex cachingVertex caching

–– Pixel Buffer ObjectPixel Buffer Object•• Simple pixel transfer speedsSimple pixel transfer speeds

State ManagementState Management

•• Attempt to minimize state changesAttempt to minimize state changes–– Toggling states costs CPU timeToggling states costs CPU time–– Avoid resetting to default Avoid resetting to default ‘‘just becausejust because’’

•• Some changes are worse than othersSome changes are worse than others–– Blend enable/disable cheapBlend enable/disable cheap–– Shader change expensiveShader change expensive–– Texture format change expensiveTexture format change expensive

Preparing for MultiPreparing for Multi--CPUCPU

•• Understand OpenGL's Understand OpenGL's synchronization semanticssynchronization semantics–– MulticontextMulticontext gotchasgotchas•• When does something delete?When does something delete?

•• Possible strategiesPossible strategies–– Single graphics worker threadSingle graphics worker thread–– Single context multiple threadsSingle context multiple threads–– Multiple contexts multiple threadsMultiple contexts multiple threads

OpenGL Multithreading Basics OpenGL Multithreading Basics

•• Context is limited to one threadContext is limited to one thread•• Limited context synchronizationLimited context synchronization–– Primarily Primarily glFinishglFinish–– Also Also glBindglBind* to a limited degree* to a limited degree

•• Easy to run into trouble Easy to run into trouble

Object LifetimesObject Lifetimes

•• Delete does not mean deleteDelete does not mean delete–– Other contexts could be using itOther contexts could be using it–– Synchronize when destroying an Synchronize when destroying an

objectobject–– OpenGL spec is a little fuzzy hereOpenGL spec is a little fuzzy here•• Expect slightly different behaviorExpect slightly different behavior

Single Graphics ThreadSingle Graphics Thread

•• One context working on a single threadOne context working on a single thread–– No graphics synchronization overheadNo graphics synchronization overhead–– Services rendering requestsServices rendering requests

•• DrawingDrawing•• Resource managementResource management

–– Other CPU free to handle physics, etc.Other CPU free to handle physics, etc.–– Best overall model presentlyBest overall model presently

MulticontextMulticontext with One Threadwith One Thread

•• AdvantagesAdvantages–– Keep state Keep state ‘‘cleanclean’’ on each contexton each context–– Synchronization logic is easySynchronization logic is easy

•• DisadvantagesDisadvantages–– Context switch is expensiveContext switch is expensive•• Unlikely to be successfulUnlikely to be successful

MultiContext/MultiThreadMultiContext/MultiThread

•• Possibly allow background loadingPossibly allow background loading–– Thread A rendersThread A renders–– Thread B loads texturesThread B loads textures

•• Synchronization is complexSynchronization is complex–– May require May require glFinishglFinish–– Driver may have to do extra internal Driver may have to do extra internal

synchronizationsynchronization•• Not likely to be fast todayNot likely to be fast today

MultiMulti--GPU HintsGPU Hints

•• Minimize Minimize interframeinterframe dependencedependence–– Clear buffersClear buffers–– Avoid Avoid glCopyglCopy* operations* operations–– Avoid rendering to textures every Avoid rendering to textures every

other frameother frame•• Simon has additional detailsSimon has additional details……

ThanksThanks

•• ATI Dev TeamATI Dev Team•• NVIDIA Dev TeamNVIDIA Dev Team•• John SpitzerJohn Spitzer–– Original pipeline slidesOriginal pipeline slides