Gcn performance ftw by stephan hodes

GCN Performance FTW

Amd and Microsoft developer day, june 2014, stockholm

Stephan HodesDeveloper Technology Engineer, AMDGCN Performance FTW

| GCN PERFORMANCE FTW | AMD AND MICROSOFT GAME DEVELOPER DAY - june 2 2014, STOCKHOLM#AGENDAGCN architecture explainedTop 10: GCN Performance AdviceQuestions

| GCN PERFORMANCE FTW | AMD AND MICROSOFT GAME DEVELOPER DAY - june 2 2014, STOCKHOLM#

AMD Graphics core nextWhat is GCN?Non VLIW architectureLess dependent on manual vectorization of shaders Susceptible to register pressure

Architecture used in:AMD discrete GPUs since 2012 (HD7700 and better)Kabini and Kaveri APUsFuture AMD hardwareNew consoles

GCN Hardware is required for Mantle DirectX 12 API support


3

4

PRODUCT SPECIFICATIONSAMD RADEON R9 290 SERIES

R9 290XR9 290Compute Units4440Engine ClockUp to 1 GHzUp to 950 MHzCompute Performance5.6 TFLOPS4.9 TFLOPSMemory Configuration4GB GDDR5 / 512-bit4GB GDDR5 / 512-bitMemory Speed5.0 Gbps5.0 GbpsAMD TrueAudio TechnologyYesYesAPI SupportDirectX 11.2OpenGL 4.3MantleDirectX 11.2OpenGL 4.3Mantle


4

GCN Compute Unit SpecificsNon VLIW instruction set architecture

4 [16-lane] Vector ALU (SIMD)One wavefront is 64 threads1 SP (Single-Precision) op: 4 clocks1 DP (Double-Precision) ADD: 8 clocks1 DP MUL/FMA & Transcendental:16 clocks64KB Vector GPRs

1 fully programmable scalar ALU Shared by all threads of a wavefrontUsed for flow control, pointer arithmetic, etc.8KB Scalar GPRs, scalar data cache, etc.

Branch & Message UnitScalar UnitVector Units(4x SIMD-16)Vector Registers(VGPRs, 4x 64KB)Texture Filter Units (4)Local Data Share (LDS, 64KB)L1 Cache(16KB)SchedulerTexture Fetch Load / Store Units (16)Scalar Registers(SGPRs, 8KB)


5

GCN Compute Unit Specifics

Distributed programmable scheduler(up to 2560 threads)Each compute unit can executeinstructions from multiple kernelsSeparate decode/issue for:1 Vector Arithmetic Logic Unit (ALU)1 Scalar ALU or Scalar Memory Reador 1 Branch/Message1 Vector memory access (Read/Write/Atomic)1 Local Data Share operation (LDS)1 Export or Global Data Share operation (GDS)

Plus 1 Special/Internal [no functional unit](s_nop, s_sleep, s_waitcnt, s_barrier, s_setprio)



6

GCN Compute Unit Specifics64KB Local Data Share(LDS)32 banks, with conflict resolutionBandwidth amplification

16KB read/write L1 vector data cache

Texture Units (utilize L1)16 Load/Store units4 Filter units

1 Branch & Message UnitExecutes branch instructions (as dispatched by Scalar Unit)



7

GCN Compute Unit Latency hidingUp to 10 Wavefronts/SIMD Used to hide latencyRound Robin schedulingIndependent kernelsOften limited by GPR or LDS usage

Time (clocks)

Batch 2Batch 3Batch 4

Batch 1

StallRunnable

StallRunnable

StallRunnable

StallRunnable

Done!Done!Done!Done!


8

GDC Compute Unit Register PressureVector GPRs64KB / 64 threads / 4 Byte / 10 wavefronts = 25.6 VGPR/thread => Max 24 VGPR per thread

Scalar GPRs8KB / 4 SIMD / 4 Byte / 10 wavefronts = 51.2 SGPR/wavefronts => Max 48 SGPR per wavefront

LDS32KB/threadgroup and threadgroup size 64 => 2 wavefronts/CU max.32KB/threadgroup and threadgroup size 256 => 8 wavefronts/CU max.16KB/threadgroup and threadgroup size 256 => 16 wavefronts/CU max.


9

GCN Shader Optimization strategiesTry reducing GPR count if you are slightly over a waves-per-SIMD thresholdDeep nestingLocal array declarationsLong-lived temporary variables

Reducing GPRs not always optimalShadercompiler might use GPRs to reduce latencyHigh number of threads/CU can thrash your caches

image_load v6, v[35:38], s[4:11] v_mov_b32 v3, v35image_load v7, v[3:6], s[4:11]v_mov_b32 v38, v36image_load v8, v[37:40], s[4:11]v_mov_b32 v3, v37image_load v9, v[3:6], s[4:11]s_waitcnt vmcnt(2)v_min_f32 v6, v6, v7s_waitcnt vmcnt(1)v_min_f32 v6, v6, v8s_waitcnt vmcnt(0)v_min_f32 v40, v6, v9image_load v6, v[35:38], s[4:11] v_mov_b32 v3, v35image_load v7, v[3:6], s[4:11]v_mov_b32 v38, v36v_mov_b32 v3, v37s_waitcnt vmcnt(0)v_min_f32 v6, v6, v7image_load v7, v[37:40], s[4:11]s_waitcnt vmcnt(0)v_min_f32 v6, v6, v7image_load v7, v[3:6], s[4:11]s_waitcnt vmcnt(0)v_min_f32 v6, v6, v7Always profile your changes!http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-tools-sdks/codexl/http://developer.amd.com/community/blog/2014/05/16/codexl-game-developers-analyze-hlsl-gcn


10

Top 10 Performance Advice


Top 10 Performance Advice

Use the power of DirectCompute

Thread group size should be multiple of 64256 is often a good choice.Dont underestimate the benefits of LDSUse asynchronous computeDont switch between Compute/Rasterizationtoo frequently


12

Top 10 Performance AdviceDont over-tessellate

Small triangles result in poor quad occupancyUse [maxtessfactor(X)] in Hull Shader declarationRecommended value is 15 or lessImplement culling in Hull ShaderUse Adaptive TessellationDistance AdaptiveScreen Space AdaptiveOrientation Adaptive

!

Especially when rendering Shadowmaps!!!


Top 10 Performance AdviceKeep your pipeline shortAvoid large expansion in the Geometry ShaderOften a Vertex Shader-only solution can replace Geometry Shader usageBokeh expansionPointspritesDisable tessellation pipeline if unused

Pack shaderstage outputLimit Vertex and Domain Shader output size to 4 float4/int4 attributes for best performance.

struct PS_INPUT{ float3 vPosition; float3 vNormal; float2 vTexcoord1; float2 vTexcoord2; float2 vTexcoord3;}; // Unoptimalstruct PS_INPUT{ float4 vPositionTexcoord1U; float4 vNormalTexcoord1V; float4 vTexcoords23;}; // Good


Top 10 Performance AdviceUpdate your Data using map/unmapAvoid MAP_WRITE_DISCARDPrefer MAP_WRITE_NO_OVERWRITEAvoid UpdateSubresource Prefer Map and/or CopyResource insteadUpdateSubresource is ok for small ( b, establish VCCs_mov_b64s0,exec//Save current exec masks_and_b64exec,vcc,exec //Do ifs_cbranch_vcczlabel0//Branch if all lanes failv_sub_f32r2,r0,r1//result = a bv_mul_f32r2,r2,r0//result=result * alabel0:s_andn2_b64exec,s0,exec //Do else(s0 & !exec)s_cbranch_execzlabel1//Branch if all lanes failv_sub_f32r2,r1,r0//result = b av_mul_f32r2,r2,r1//result = result * blabel1:s_mov_b64exec,s0//Restore exec mask// Branching code examplefloat fn0(float a,float b){ if(a>b) return((a-b)*a); else return((b-a)*b);}


Top 10 Performance AdvicePack your G-Buffer using RGBA16_UINTFetches from RGBA16 are full rate (without filtering)Bilinear fetches to RGBA16 are half rateExports to RGBA16_INT are full rate (without blending)Caution: Blended exports to RGBA16_INT are speed

Depth buffer: dont render after readBinding a depth buffer as texture will decompress it, this will make subsequent Z ops more expensive.Critical for shadow map atlas rendering!Consider exporting depth to G-Buffer


Top 10 Performance AdviceBatch, Batch, Batch!Add support for geometry instancingPool & batch your updatesLess important with Mantle/DirectX12Reduces Drawcall overheadAllows better scheduling

(DX11) Prefer engine threadingover Deferred ContextsDeferred contexts are a software feature or move to Mantle/DirectX12


18

Top 10 Performance AdviceAvoid LDS bank conflictsAccessing LDS with addresses that are 32 DWORD apart from different threads will cause bank conflictsUnless if its the same address

Don't use gather with offsetsThis will result in 4 image_gather4 instructions

image_gather4_c_lz v4, v[12:15], s[4:11], s[12:15] v_mov_b32 v11, 1 image_gather4_c_lz_o v5, v[11:14], s[4:11], s[12:15] v_mov_b32 v11, 0x00000100 image_gather4_c_lz_o v7, v[11:14], s[4:11], s[12:15] v_mov_b32 v11, 0x00000101 image_gather4_c_lz_o v0, v[11:14], s[4:11], s[12:15] s_waitcnt vmcnt(0) Bonus Adviceimage_gather4_c_lz v0, v[2:5], s[4:11], s[12:15]s_waitcnt vmcnt(0)float4 PsExample( PsInput Input ) : SV_Target{ return tex.GatherCmpRed( g_SamplePointCmp, Input.vTex, Input.depth );}float4 PsExample( PsInput Input ) : SV_Target{ return tex.GatherCmpRed( g_SamplePointCmp, Input.vTex, Input.depth, int2(0,0), int2(1,0), int2(0,1), int2(1,1) );}


Questions?

[email protected]

| GCN PERFORMANCE FTW | AMD AND MICROSOFT GAME DEVELOPER DAY - june 2 2014, STOCKHOLM#Disclaimer & AttributionThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.

| GCN PERFORMANCE FTW | AMD AND MICROSOFT GAME DEVELOPER DAY - june 2 2014, STOCKHOLM#Chart113.17

Dilate & Blur @ 2560x1600 Kernel Diameter 251.0x3.17x

Sheet1Dilate & Blur @ 2560x1600 Kernel Diameter 25Pixel Shader1Direct Compute3.17To resize chart data range, drag lower right corner of range.

Gcn performance ftw by stephan hodes

Technology

Transcript of Gcn performance ftw by stephan hodes