GDC 2013: Powering The Next Generation Of Graphics AMD GCN...
Transcript of GDC 2013: Powering The Next Generation Of Graphics AMD GCN...
1| AMD Radeon™ HD 7900 Series Graphics | December 2011 | Confidential – NDA Required
POWERING THE NEXT GENERATION OF GRAPHICS: AMD GCN ARCHITECTURE
Layla Mah – [email protected] Relations Engineer, AMD
@MissQuickstep
2| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
AgendaPart 1: AMD Graphics Core Next Architecture (GCN)
Part 2: Partially Resident Textures (PRT)
2
GCNPRT
3| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
AMDGRAPHICSCORENEXT
AMDGRAPHICSCORENEXT
4| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GPU Evolution1ST ERA:
Fixed Function2ND ERA:
Simple Shaders3RD ERA:
Graphics Parallel Core
Lighting
3D Geometry Transformation VLIW5
VLIW4
General Purpose Registers
StreamProcessing Units
FMAD+Special
Functions
Branch U
nit
General Purpose Registers
Branch U
nitStreamProcessing Units
5| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GPU Evolution1ST ERA:
Fixed Function2ND ERA:
Simple Shaders3RD ERA:
Graphics Parallel Core
Lighting
3D Geometry Transformation VLIW5
VLIW4
General Purpose Registers
StreamProcessing Units
FMAD+Special
Functions
Branch U
nit
General Purpose Registers
Branch U
nitStreamProcessing Units
6| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GPU Evolution1ST ERA:
Fixed Function2ND ERA:
Simple Shaders3RD ERA:
Graphics Parallel Core
Lighting
3D Geometry Transformation VLIW5
VLIW4
General Purpose Registers
StreamProcessing Units
FMAD+Special
Functions
Branch U
nit
General Purpose Registers
Branch U
nitStreamProcessing Units
Prior to 2002 Graphics-specific hardware
– Texture mapping/filtering Multi-texturing
– “T&L Engines” Geometry processing (Transform) Rasterization (Lighting)
– Dedicated texture and pixel caches
Dot product and scalar multiply-add– Sufficient for basic graphics tasks– No general purpose compute capability
7| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GPU Evolution1ST ERA:
Fixed Function2ND ERA:
Simple Shaders3RD ERA:
Graphics Parallel Core
Lighting
3D Geometry Transformation VLIW5
VLIW4
Memory Interface
8 Vertex Pipes
Setup Engine
Pixel Shader Core
16 Pixel Pipes
General Purpose Registers
StreamProcessing Units
FMAD+Special
Functions
Branch U
nit
General Purpose Registers
Branch U
nitStreamProcessing Units
8| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GPU Evolution1ST ERA:
Fixed Function2ND ERA:
Simple Shaders3RD ERA:
Graphics Parallel Core
Lighting
3D Geometry Transformation VLIW5
VLIW4
Memory Interface
8 Vertex Pipes
Setup Engine
Pixel Shader Core
16 Pixel Pipes
General Purpose Registers
StreamProcessing Units
FMAD+Special
Functions
Branch U
nit
General Purpose Registers
Branch U
nitStreamProcessing Units
The Rise of Shaders
Shader Models 1.0 - 2.0– VS and PS are distinct– Minimal Instruction Sets– Limited Instruction Slots– Limited Shader Lengths– No Dynamic Flow Control– No Looping Constructs– No Vertex Texture Fetch– No Bitwise Operators– No Native Integer ALU– Etc.
2002-2006 Graphics-focused
programmability– DirectX 8/9, OpenGL 2.0– Floating point
processing (IEEE not required) Different precision per IHV
– ATI 24-bit full-speed– NV 16-bit full-speed– NV 32-bit half-speed
– Specialized ALUs for vertex & pixel processing
Added dedicated caches
9| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GPU Evolution1ST ERA:
Fixed Function2ND ERA:
Simple Shaders3RD ERA:
Graphics Parallel Core
Lighting
3D Geometry Transformation VLIW5
VLIW4
General Purpose Registers
StreamProcessing Units
FMAD+Special
Functions
Branch U
nit
General Purpose Registers
Branch U
nitStreamProcessing Units
10| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GPU Evolution1ST ERA:
Fixed Function2ND ERA:
Simple Shaders3RD ERA:
Graphics Parallel Core
Lighting
3D Geometry Transformation VLIW5
VLIW4
General Purpose Registers
StreamProcessing Units
FMAD+Special
Functions
Branch U
nit
General Purpose Registers
Branch U
nitStreamProcessing Units
The Rise of The Unified Shader (VLIW-5)
5-Element Very-Long-Instruction-Word (XYZWT)– Began with Xenos and utilized from R600 until “Cayman”– Flexible and optimized for Graphics workloads
Ideal for 4-element Vector and 4x4 Matrix Operations– Vector/Vector math in a single instruction
Plus One Transcendental-Unit function per Instruction More advanced caching
– Instruction, constant, multi-level texture/data, & later: LDS/GDS
Single Precision 32-bit IEEE-Compliant Floating Point ALUs More flexible: Unified ALU, Branch Unit, Dynamic Flow
Control, Vertex Texture, Geometry Shader, Tessellation Engines, etc.
11| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GPU Evolution1ST ERA:
Fixed Function2ND ERA:
Simple Shaders3RD ERA:
Graphics Parallel Core
Lighting
3D Geometry Transformation VLIW5
VLIW4
General Purpose Registers
StreamProcessing Units
FMAD+Special
Functions
Branch U
nit
General Purpose Registers
Branch U
nitStreamProcessing Units
12| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GPU Evolution1ST ERA:
Fixed Function2ND ERA:
Simple Shaders3RD ERA:
Graphics Parallel Core
Lighting
3D Geometry Transformation VLIW5
VLIW4
General Purpose Registers
StreamProcessing Units
FMAD+Special
Functions
Branch U
nit
General Purpose Registers
Branch U
nitStreamProcessing Units
Optimized For Die Area Efficiency (VLIW-4)
4-Element Very-Long-Instruction-Word (XYZW)– Profiling showed average VLIW utilization was < 3.4/5
Removed dedicated T-Unit – Optimized die area usage– Each ALU has a smaller LUT, combined using 3-term Lagrange
polynomial interpolation (transcendental/clock/VLIW4)– Better optimized for combination of Graphics & Compute
Graphics is still the primary focus, but compute is gaining attention
Still ideal for 4-element Vector and 4x4 Matrix Operations Fewer ALU bubbles in transcendental-light code, better utilization
– Simplified programming and optimization relative to VLIW-5
– Multiple dispatch processors & separate command queues Improved support for DirectCompute™ and OpenCL™
13| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
Graphics Core Next Architecture
Cutting-edge graphics performance and featuresHigh compute density with multi-taskingBuilt for power efficiencyOptimized for heterogeneous computingEnabling the Heterogeneous System Architecture (HSA)Amazing scalability and flexibility
13
A new GPU design for a new era of computing
14| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
Graphics Core Next Architecture
A new GPU design for a new era of computingUnlimited Resources & Samplers (Including Unlimited UAV/SRV at any shader stage)
All UAV formats can be read/write (vs. just single uint32 in D3D11 API spec)
Simpler Assembly LanguageSimpler Shader Code (No More Clauses)Ability to support C/C++ (like)Architectural support for traps, exceptions & debuggingAbility to share virtual x86-64 address space with CPU cores
14
15| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
Graphics Core Next Architecture
A new GPU design for a new era of next generation computing…
15
16| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Compute UnitBasic GPU building block
– New instruction set architecture Non-VLIW Vector unit + scalar co-processor Distributed programmable scheduler
– Each compute unit can executeinstructions from multiple kernels at once
– Increased instructions per clock per mm2
Designed for high utilization,high throughput, and multi-tasking
Branch & Message Unit Scalar UnitVector Units
(4x SIMD-16)
Vector Registers(4x 64KB)
Texture Filter Units (4)
Local Data Share(64KB)
L1 Cache(16KB)
SchedulerTexture Fetch Load / Store
Units (16)
Scalar Registers(8KB)
17| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Compute Unit – Specifics 1 Fully Programmable Scalar ALU – Shared by all threads of a
wavefront– Used for flow control, pointer arithmetic, etc.– Has own GPRs, scalar data cache, etc.
1 Branch & Message Unit– Executes branch instructions
(as dispatched by Scalar unit)
4 [16-lane] Vector ALU (SIMD)– CU Total Throughput: 64 SP ops/clock– 1 SP (Single-Precision) op per 4 clocks– 1 DP (Double-Precision) ADD in 8 clocks– 1 DP MUL/FMA/Transcendental per 16 clocks
Branch & Message Unit Scalar UnitVector Units
(4x SIMD-16)
Vector Registers(4x 64KB)
Texture Filter Units (4)
Local Data Share(64KB)
L1 Cache(16KB)
SchedulerTexture Fetch Load / Store
Units (16)
Scalar Registers(8KB)
18| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Compute Unit – Specifics 64kb Local Data Share(LDS)
– 2x Larger than D3D11 TGSM Limit (32k/thread group)– 32 banks, with conflict resolution– Bandwidth amplification– Separate Instruction Decode
16kb read/write L1 vector data cache
Texture Units (Utilize L1)– 4 Filter, 16 Load/Store
Scheduler (2560 Threads)– Separate decode/issue for VALU, SALU/SMEM, VMEM, LDS,
GDS/Export– + Special instructions (NOPs, barriers, etc.) and branch instructions
Branch & Message Unit Scalar UnitVector Units
(4x SIMD-16)
Vector Registers(4x 64KB)
Texture Filter Units (4)
Local Data Share(64KB)
L1 Cache(16KB)
SchedulerTexture Fetch Load / Store
Units (16)
Scalar Registers(8KB)
19| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Compute Unit – SIMD Specifics Each SIMD unit is assigned its own 40-bit program counter and instruction buffer for 10
wavefronts– The whole CU can have 40 wavefronts in flight– Each potentially from a different work-group or kernel
Each SIMD is a 16-lane ALU– IEEE-754 SP and DP
Full-speed denormals + All Rounding Modes 32-bit FMA and 24-bit INT at full-speed DP and 32-bit INT at reduced rate (1/2 1/16)
– 64kb vector register file– Issue 1 SP instruction per lane per clock
Retire 64 lanes (1 wavefront) of SP ALU in 4 clocks
A GCN GPU with 32 CUs, such as the AMD Radeon™ HD 7970, can be working on up to 81,920 work items at a time!
Branch & Message Unit Scalar UnitVector Units
(4x SIMD-16)
Vector Registers(4x 64KB)
Texture Filter Units (4)
Local Data Share(64KB)
L1 Cache(16KB)
SchedulerTexture Fetch Load / Store
Units (16)
Scalar Registers(8KB)
Local Data Share(64KB)
Vector Registers(4x 64KB)
Vector Units(4x SIMD-16)
20| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Compute Unit – Scheduler Specifics On GCN, each CU has its own dedicated Scheduler unit, supporting up to 2560 threads per CU
– Schedules this work between the 4 SIMDs in groups called “wavefronts”– Each wavefront is a grouping of 64 “threads” which live together on a single SIMD– One wavefront is executed on each SIMD every four cycles
Total CU throughput: 4 wavefronts / 4 cycles That’s 256 threads executed every 4 cycles! Separate protected virtual address spaces Programmed in a purely scalar way
– Scheduler Limits: 40 wavefronts (theoretical max) per CU 10 wavefronts per SIMD These ideal limits may not be attained in practice
– Limited by number of available GPRs– Limited by size of available LDS
Branch & Message Unit Scalar UnitVector Units
(4x SIMD-16)
Vector Registers(4x 64KB)
Texture Filter Units (4)
Local Data Share(64KB)
L1 Cache(16KB)
SchedulerTexture Fetch Load / Store
Units (16)
Scalar Registers(8KB)
Local Data Share(64KB)
Vector Registers(4x 64KB)
Vector Units(4x SIMD-16)
21| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Compute Unit – Scheduler Specifics Cont. Work should be grouped to support collaborative tasks
– All threads within a workgroup are guaranteed to be scheduled at the same time
– A set of synchronization primitives and shared memory (LDS) allows data to be passed between threads in a workgroup 16 Work Group Barriers supported per CU Global and Shared memory atomics
– Don’t forget about the L1 cache “Group discount” on memory reads
– As long as all threads are local to a CU!
– Optimized for throughput – latency is hidden by overlapping execution of wavefronts Workgroup size should be carefully chosen to balance the collaborative gain against hardware
limitations such as GPR count and LDS size
Branch & Message Unit Scalar UnitVector Units
(4x SIMD-16)
Vector Registers(4x 64KB)
Texture Filter Units (4)
Local Data Share(64KB)
L1 Cache(16KB)
SchedulerTexture Fetch Load / Store
Units (16)
Scalar Registers(8KB)
L1 Cache(16KB)
Local Data Share(64KB)
22| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Scheduler Arbitration and Decode A CU is guaranteed to issue instructions for a wavefront sequentially
– Predication & control flow enables any single work-item a unique execution path For a given CU, every clock cycle, waves on one SIMD are considered for instruction
issue– Round robin scheduling algorithm
At most, one instruction from each category may be issued At most, one instruction per wave may be issued Up to a maximum of 5 instructions can issue per cycle, not including “internal”
instructions – 1 Vector Arithmetic Logic Unit (ALU)– 1 Scalar ALU or Scalar Memory Read– 1 Vector memory access (Read/Write/Atomic)– 1 Branch/Message - s_branch and s_cbranch_<cond> – 1 Local Data Share (LDS)– 1 Export or Global Data Share (GDS)– 1 Special/Internal (s_nop, s_sleep, s_waitcnt, s_barrier, s_setprio) – [no functional
unit]
23| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Branch and Message Unit Independent scalar assist unit to handle special classes of instructions concurrently
– Branch Unconditional Branch (s_branch) Conditional Branch (s_cbranch_<cond> )
– Condition SCC==0, SCC=1, EXEC==0, EXEC!=0 , VCC==0, VCC!=0 16-bit signed immediate dword offset from PC provided
– Messages s_sendmsg CPU interrupt with optional halt (with shader supplied code and source), debug message (perf trace data, halt, etc) special graphics synchronization messages
24| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Vector UnitsLANE 0 LANE 1 LANE 2 LANE 15
SIMD
64 Single Precision multiply-add 1 VLIW inst × 4 ALU ops dependency limited Compiler manages register port conflicts Specialized, complex compiler scheduling Difficult assembly creation, analysis, and debug Complicated tool chain support Careful optimization req. for peak performance
VLIW4 SIMD
LANE LANE LANE LANE
SIMD 0 SIMD 1 SIMD 2 SIMD 3
0 1 2 15 0 1 2 15 0 1 2 15 0 1 2 15
64 Single Precision multiply-add 4 SIMDs × 1 ALU op occupancy limited No register port conflicts Standard compiler scheduling & optimizations Simplified assembly creation, analysis, &
debug Simplified tool chain development and support Stable and predictable performance
GCN Quad SIMD
25| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Vector UnitsLANE 0 LANE 1 LANE 2 LANE 15
SIMD
Dependency Limited
Instruction level paralellism Need to fill VLIW with four (or five)
independent ops that can be run in parallel from the same program, each cycle!
VLIW4 SIMD
LANE LANE LANE LANE
SIMD 0 SIMD 1 SIMD 2 SIMD 3
0 1 2 15 0 1 2 15 0 1 2 15 0 1 2 15
Occupancy Limited
Data level parallelism Need to be able to run the same single
instruction on 64 items of data
Thread level parallelism 4x as many wavefronts to occupy all
SIMDs
GCN Quad SIMD
26| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Vector ALU Characteristics FMA (Fused Multiply Add), IEEE 754-2008 precise with all round modes, proper handling of
Nan/Inf/Zero and full de-normal support in hardware for SP and DP
MULADD single cycle issue instruction without truncation, enabling a MULieee followed by ADD ieee to be combined with round and normalization after both multiplication and subsequent addition
VCMP A full set of operations designed to fully implement all the IEEE 754-2008 comparison predicates
IEEE Rounding Modes (Round to nearest even, Round toward +Infinity, Round toward –Infinity, Round toward zero) supported under program control anywhere in the shader. Double and single precision modes are controlled separately.
De-normal Programmable Mode control for SP and DP independently. Separate control for input flush to zero and underflow flush to zero.
27| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Vector ALU Characteristics (Cont . . .) DIVIDE ASSIST OPS IEEE 0.5 ULP Division accomplished with macro in (SP/DP ~15/41
Instruction Slots respectively)
FP Conversion Ops between 16-bit, 32-bit, and 64-bit floats with full IEEE 754 precision and rounding
Exceptions Support in hardware for floating point numbers with software recording and reporting mechanism. Inexact, Underflow, Overflow, division by zero, de-normal, invalid operation, and integer divide by zero operation
64-bit Transcendental Approximation Hardware based double precision approximation for reciprocal, reciprocal square root and square root
24 BIT INT MUL/MULADD/LOGICAL/SPECIAL @ full SP rates– Heavy use for Integer thread group address calculation– 32-bit Integer MUL/MULADD @ DP FP Mul/FMA rate
28| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Scalar UnitLANE LANE LANE LANE
SIMD 0 SIMD 1 SIMD 2 SIMD 3
0 1 2 15 0 1 2 15 0 1 2 15 0 1 2 15
Fully Programmable Scalar Unit replaces FF Branch Logic Operations such as JMP [GPR] are now supported
Opens the door to e.g. virtual function calls Has its own GPR pool and can execute normal ALU code
64-bit bitwise ops to mask thread execution 32-bit bitwise and integer arithmetic operations at full-
speed Potential to offload scalar code (Vector ALU Scalar ALU) A GCN CU can dispatch 1 scalar op/clock (4 ops / 4
clocks)
GCN Scalar Unit
Scalar Unit
29| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
R/W L2 4 CU Shared 16KB Scalar R/O L1
Scalar Decode
GCN Scalar UnitLANE LANE LANE LANE
SIMD 0 SIMD 1 SIMD 2 SIMD 3
0 1 2 15 0 1 2 15 0 1 2 15 0 1 2 15
Natively a 64-bit integer ALU Independent arbitration and instruction decode
One ALU, memory or control flow op per cycle 512 Scalar GPR per SIMD shared between waves
{SGPRn+1, SGPR} pair provide 64 bit register 4 CU Shared Read Only Scalar Data Cache: 16 KB – 64B
lines 4 Way Assoc, LRU replacement policy Peak Bandwidth per CU is 16 bytes/cycle
GCN Scalar Unit
Scalar UnitScalar Unit
8KB Registers
Integer ALU
30| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
LANE LANE LANE LANE
SIMD 0 SIMD 1 SIMD 2 SIMD 3
0 1 2 15 0 1 2 15 0 1 2 15 0 1 2 15Scalar Unit
GCN Compute Unit – Hardware View
A GCN Compute Unit can retire 256 SP Vector ALU ops in 4 clks Each lane can dispatch 1 SP ALU operation per clock Each SP ALU operation takes 4 clocks to complete
The scheduler dispatches from a different wavefront each cycle
GCN Hardware View
31| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
LANE LANE LANE LANE
WAVEFRONT 0
0 1 2 15 16 17 18 31 32 33 34 47 48 49 50 63Scalar Unit
CLOCK 4CLOCK 12CLOCK 0CLOCK 8
GCN Compute Unit – Programmer View
A GCN Compute Unit can perform 64 SP Vector ALU ops / clock Each lane can dispatch 1 SP ALU operation per clock Each SP ALU operation still takes 4 clocks to complete
But you can PRETEND your code runs 1 op on 64-threads at once
GCN Programmer View
CLOCK 16CLOCK 20
WAVEFRONT 2
WAVEFRONT 3
WAVEFRONT 0
WAVEFRONT 1
WAVEFRONT 1
WAVEFRONT 4
WAVEFRONT 6
WAVEFRONT 7
WAVEFRONT 8
WAVEFRONT 9
WAVEFRONT 5
32| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Shader Code Examplefloat fn0(float a,float b){
if(a>b)
return((a-b)*a);else
return((b-a)*b);}
Optional: Use based on the number of instructions in conditional section Executed in branch unit
label0:
label1:
//Registers r0 contains “a”, r1 contains “b”//Value is returned in r2
v_cmp_gt_f32 r0,r1 //a > b, establish VCCs_mov_b64 s0,exec //Save current exec masks_and_b64 exec,vcc,exec //Do “if”s_cbranch_vccz label0 //Branch if all lanes failv_sub_f32 r2,r0,r1 //result = a – bv_mul_f32 r2,r2,r0 //result=result * a
label0:s_andn2_b64 exec,s0,exec //Do “else”(s0 & !exec)s_cbranch_execz label1 //Branch if all lanes failv_sub_f32 r2,r1,r0 //result = b – av_mul_f32 r2,r2,r1 //result = result * blabel1:s_mov_b64 exec,s0 //Restore exec mask
33| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Shader Authoring Tips
GCN VGPR Count <=24 28 32 36 40 48 64 84 <= 128 > 128Max Waves/SIMD 10 9 8 7 6 5 4 3 2 1
GCN has greatly improved branch performance, and it continues to improve– Don’t be afraid to use it! But, remember: use it wisely – improved != free
It’s at its best for highly coherent workloads (where most threads take the same path)
But, the new architecture is more susceptible to register pressure– Using too many registers within a shader can reduce the maximum waves per
SIMD! – NOTE: A WAVEFRONT CAN ALLOCATE 104 USER SCALAR REGISTERS AS SEVERAL SCALAR
REGISTERS ARE RESERVED FOR ARCHITECTURAL STATE
– Take caution with respect to the following: Excessive nested branching/looping
– Loop Unrolling Variable declarations (especially arrays) Excessive function calls requiring storing of results
34| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
L1
L1
L1
L1
L1
L1
L1
L1
L1
Cache Hierarchy
L264-bit Dual
Channel Memory Controller
L1 read/write caches
L2 read/write cache partitions
64 Bytes per clockL1 bandwidth per CU
Each CU has its own registers and local data share
I$ K$
32 KB instruction cache (I$) +16 KB scalar data cache (K$)
shared per 4 CUswith L2 backing
I$ K$
GDS
64 Bytes per clockL2 bandwidth per partition
Global data share facilitates synchronization
between CUs(64 KB)L2
64-bit Dual Channel Memory
Controller
L264-bit Dual
Channel Memory Controller
35| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Vector Memory Instructions
MUBUF – read from or perform write/atomic to an un-typed memory buffer/address– Data type/size is specified by the instruction operation
MTBUF – read from or write to a typed memory buffer/address– Data type is specified in the resource constant
MIMG – read/write/atomic operations on elements from an image surface – Image objects (1-4 dimensional addresses and 1-4 dwords of homogenous data)– Image objects use resource and sampler constants for access and filtering
VECTOR MEMORY INSTRUCTIONS SUPPORT VARIABLE GRANULARITY FOR ADDRESSES AND DATA, RANGING FROM 32-BIT DATA TO 128-BIT PIXEL QUADS
36| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Export Memory Instruction
Exports move data from 1-4 VGPRs to Graphic Pipeline– Color (MRT0-7), Depth, Position, and Parameter
Global Shared Memory Ops (Utilize GDS)
37| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN LOW-LEVEL TIPS – How GCN Exports Work
The export unit writes results from the programmable stages of the graphics pipeline to the fixed function ones, such as tessellation, rasterization and the render back-ends, via the GDS
The GDS is identical to the local data shares, except that it is shared by all compute units, so it acts as an explicit global synchronization point between all wavefronts.
The atomic units in the GDS additionally support ordered count operations
38| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Local Data Share (LDS)
64 kb, 32 bank (or 16 bank) Shared Memory, fully decoupled from ALU instructions Direct mode
– Vector Instruction Operand 32/16/8 bit broadcast value– Graphics Interpolation @ rate, no bank conflicts
Index Mode – Load/Store/Atomic Operations– Bandwidth Amplification, up-to 32 – 32 bit lanes serviced per clock peak– Direct decoupled return to VGPRs– Hardware conflict detection with auto scheduling
Software consistency/coherency for thread groups via hardware barrier Fast & low power vector load return from R/W L1
39| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Local Data Share (LDS)
An LDS bank is 512 entries, each 32-bits wide– A bank can read and write a 32-bit value across an all-to-all crossbar and swizzle unit that
includes 32 atomic integer units– This means that several threads can read the same LDS location at the same time for
FREE– Writing to the same address from multiple threads also occurs at rate, last thread to write
wins
Typically, the LDS will coalesce 32 lanes from one SIMD each cycle– One wavefront is serviced completely every 2 cycles– Conflicts automatically detected across 32 lanes from a wavefront and resolved in
hardware– An instruction which accesses different elements in the same bank takes additional cycles
40| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Local Data Share (LDS)
41| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN R/W CACHE Reads and writes cached
– Bandwidth amplification– Improved behavior on more memory access patterns– Improved write to read reuse performance
Relaxed consistency memory model– Consistency controls available to control locality of load/store
GPU Coherent – Acquire/Release semantics control data visibility across the machine (GLC bit on
load/store)– L2 coherent = all CUs can have the same view of data
Global atomics– Performed in L2 cache
42| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN L1 R/W Cache Architecture
– Each CU has its own L1 Cache16 KB L1, 64B lines, 4 sets x 64 way~64B/CLK per compute unit bandwidthWrite-through – alloc on write (no read) w/dirty byte mask
– Write-through at end of wavefront– Decompression on cache read out
– Instruction GLC bit defines cache behaviorGLC = 0;
– Local caching (full lines left valid)– Shader write back invalidate instructions
GLC = 1;– Global coherent (hits within wavefront boundaries)
43| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN L2 R/W Cache Architecture
– 64-128KB L2 per Memory Controller Channel64B lines, 16 way set associative~64B/CLK per channel for L2/L1 bandwidthWrite-back - alloc on write (no read) w/ dirty byte maskAcquire/Release semantics control data visibility across CUs
– L2 coherent = all CUs can have the same view of dataRemote Atomic Operations
– Common Integer set & float Min/Max/CmpSwap
44| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Latency & Bandwidth
– Each CU has 64 bytes per cycle of L1 bandwidthShared with the GDS
– Per L2 there’s 64 bytes of data per cycle as well
– Peak Scalar Data Cache Bandwidth per CU is 16 bytes/cycle– Peak I-Cache Bandwidth per CU is 32 bytes/cycle (Optimally 8 instructions)– LDS Peak Bandwidth is 128 bytes of data per cycle via bandwidth amplification
– That’s nearly 4 TB/s of LDS BW, 2 TB/s of L1 BW, and 700 GB/s of L2 BW!
– 384-bit GDDR5 Main Memory has over 264 GB/sec bandwidth– PCI Express 3.0 x16 bus interface to system (32GBps)
45| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN L1 Texture Cache
The memory hierarchy is re-used for graphics Some dedicated graphics hardware added
Address-gen unit receives 4 texture addr/clock– Calculates 16 sample addr (nearest neighbors)– Reads samples from L1 data cache– Decompresses samples in Texture Mapping Unit (TMU)
TMU filters adjacent samples, produces <= 4 interpolated texels/clock TMU output undergoes format conversion and is written into the vector register file The format conversion hardware is also used for writing certain formats to memory from graphics
shaders
46| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Virtual Memory and x86
The GCN cache hierarchy was designed to integrate with x86 microprocessors
The GCN virtual memory system can support 4KB pages– Natural mapping granularity for the x86 address space– Paves the way for a shared address space in the future– IOMMU used for DMA transfers can already translate requests into x86 address space
GCN caches use 64B lines, which is the same size as x86 processors use
The stage is set for heterogeneous systems to transparently share data between the GPU and CPU through the traditional caching system, without explicit programmer control!
47| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
Graphics Core Next Architecture
AMD Radeon™ HD 7900 Series –Codename “Tahiti”
48| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
Graphics Core Next Architecture Up to 32 Compute Units
AMD Radeon™ HD 7900 Series –Codename “Tahiti”
49| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
Graphics Core Next Architecture Up to 32 Compute Units Dual Geometry Engines
AMD Radeon™ HD 7900 Series –Codename “Tahiti”
50| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
AMD Radeon™ HD 7900 Series –Codename “Tahiti” Graphics Core Next Architecture Up to 32 Compute Units Dual Geometry Engines 8 Render Back-ends
– 32 color ROPs per clock– 128 Z/stencil ROPs per clock
51| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
Graphics Core Next Architecture Up to 32 Compute Units Dual Geometry Engines 8 Render Back-ends
– 32 color ROPs per clock– 128 Z/stencil ROPs per clock
Up to 768KB read/write L2 cache
AMD Radeon™ HD 7900 Series –Codename “Tahiti”
52| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
Graphics Core Next Architecture Up to 32 Compute Units Dual Geometry Engines 8 Render Back-ends
– 32 color ROPs per clock– 128 Z/stencil ROPs per clock
Up to 768KB read/write L2 cache Fast 384-bit GDDR5 memory interface
– Up to 264 GB/sec PCI Express 3.0 x16 bus interface
AMD Radeon™ HD 7900 Series –Codename “Tahiti”
53| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
AMD Radeon™ HD 7900 Series –Codename “Tahiti” Graphics Core Next Architecture Up to 32 Compute Units Dual Geometry Engines 8 Render Back-ends
– 32 color ROPs per clock– 128 Z/stencil ROPs per clock
Up to 768KB read/write L2 cache Fast 384-bit GDDR5 memory interface
– Up to 264 GB/sec PCI Express 3.0 x16 bus interface 4.3 billion 28nm transistors 3.79 Peak Single-Precision TFLOPS
54| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
AMD Radeon™ HD 7900 Series –Compute ArchitectureDual Asynchronous Compute Engines (ACE)
– Operate in parallel with graphics command processor
– Independent scheduling and work item dispatch for efficient multi-tasking 3 Devices with 3 Command Queues!
– Fast context switching– Exposed in OpenCL™
Dual DMA engines– Can saturate PCIe 3.0 x16 bus
bandwidth(16 GB/sec bidirectional)
55| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
AMD Radeon™ HD 7900 Series –Compute ArchitectureHigh performance double precision floating point processing
– Up to 947 DP GFLOPS– Higher utilization = more usable FLOPS– IEEE compliant
More efficient flow control & branching
Full ECC protection for DRAM & SRAM
First GPU to fully support OpenCL 1.2, Direct3D + Compute 11.1, and C++ AMP
New compute instructions
56| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Architecture – ACE Intimate DetailsACEs are responsible for compute shader scheduling & resource allocation
Each ACE fetches commands from cache or memory & forms task queues
Tasks have a priority level for scheduling
– Background RealtimeACE dispatch tasks to shader arrays as resources permit
Tasks complete out-of-order, tracked by ACE for correctness
Every cycle, an ACE can create a workgroup and dispatch one wavefront from the workgroup to the CUs
57| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Architecture – ACE Intimate DetailsACE are independent
– But, can synchronize and communicate via cache/mem/GDS
Can form task graphs– Individual tasks can have
dependencies on one another Can depend on another ACE Can depend on part of the graphics pipeline
Can control task switching– Stop and start tasks and dispatch work
to shader engines
58| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Architecture – Enabling Compute WorkloadsThe focus in GPU hardware is shifting away from graphics-specific units, towards general-purpose compute units
7900 Series GCN-based ASICS already have “3:1” ratio of ACE : Graphics CP
– Graphics CP can dispatch compute– ACE cannot dispatch graphics
If you aren’t writing Compute Shaders, you’re probably not getting the absolute most out of modern GPUs Control of LDS, barriers, thread layout, etc.
Future Trends:More Compute Units
– ALU outpaces BWCPU + GPU Flat Mem
– APU + dGPULess FF Graphics
– Can you write a Compute-based graphics pipeline? Start thinking about
it…
59| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
Utilization and Efficiency Higher utilization = higher performance per sq.mm
Mandelbrot DP
AES256
SHA256
LuxMark
SmallptGPU
0x 1x 2x 3x 4x 5x
AMD Radeon HD 6970AMD Radeon HD 7970
Utilization improvementGFLOPS increase (1.4x)
60| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Geometry EngineGS in conjunction with Tessellation is faster than before…
However… memory is still the bottleneck!
– Minimize the number of inputs and outputs for best performance… Small expansions can be done in LDS!
Each rasterizer can read in a single triangle per cycle, and write out 16 pixels
– Caveat: tiny triangles can mean that we don’t reach this potential, and become raster-bound!
Tessellation off
Image from Battlefield 3, EA DICE
Tessellation on
61| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN TessellationLatest iteration of hardware tessellation units
– Increased vertex re-use– Off-chip buffering improvements– Larger parameter caches
Improves performance at all tessellation factors
– Up to 4x throughput ofAMD Radeon HD 6900 series(Gen 8)
Tessellation off
Image from Battlefield 3, EA DICE
Tessellation on
62| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Tessellation – Performance
AMD Radeon™ HD 7970 AMD Radeon™ HD 6970
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 310.0x
0.5x
1.0x
1.5x
2.0x
2.5x
3.0x
3.5x
4.0x
4.5x
Tessellation Factor
Tess
ella
tion
Rate
Crysis 2
Total War:Shogun 2
Lost Planet 2
Unigine Heaven
S.T.A.L.K.E.R.:Call of Pripyat
0 0.5 1 1.5 2 2.5
Relative Performance
60% faster
139% faster
62% faster
55% faster
82% faster
63| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
While performance is much improved, it is still a potential bottleneck!– Produces a great deal of ring bus traffic, starving other parts of the
pipelineBest performance achieved with tessellation factors less than 15!
Continue to Optimize:– Pre-triangulate– Frustum Culling– Backface Culling– Distance-adaptive– Screen-space adaptive– Orientation-adaptive– Etc…
GCN Tessellation – Best Practices
64| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
Render Back-End (RBE) ON GCN ASICS Once the pixels fragments in a tile have been shaded, they flow to the Render Back-Ends (RBEs)
– 16KB Color Cache Up to 8 color samples (i.e. 8x MSAA)
– 4KB Depth Cache Up to 16 coverage samples (i.e. 16x EQAA)
– Write out through the memory controllers
Logic Operations as alternative to Blending– Exposed in DX11.1– Already available in OpenGL
Dual-Source Color Blending with MRTs– Only available in OpenGL
65| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
DEPTH IMPROVEMENTS ON GCN ASICS Allows fast accept of fully-visible triangles spanning one or more tile
– If a triangle is fully covering a tile then cost is only 1 clock/tile
Depth Bounds Testing Extension– Exposed in OpenGL: GL_EXT_depth_bounds_test – Also Exposed in Direct3D via an extension – ask us if you’d like to try
it
24-BIT DEPTH FORMATS ARE INTERNALLY REPRESENTED AS 32-BITS
66| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
STENCIL IMPROVEMENTS ON GCN ASICS GCN has support for new extended stencil ops compared to prior ASICS
– Only available in OpenGL: GL_AMD_stencil_operation_extended
– Additional stencil ops:AND, XOR, NORREPLACE_VALUE_AMDetc.
– Also exposes additional stencil op source valueCan be used as an alternative to stencil ref value
Stencil ref and op source value can now be exported from pixel shader– Only available in OpenGL: GL_AMD_shader_stencil_value_export
67| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN LOW-LEVEL TIPS – GPR UtilizationGPRs and GPR pressure
General Purpose Registers (GPR) are a limited resource– Separate banks of GPRs for Vector and Scalar (per SIMD)– Maximum of 256 VGPRs and 512 SGPRs shared across all waves (upto 10) owned by a
SIMD– Organized as 64 words of 32-bits – two adjacent GPR can be combined for 64-bit (4 for
128-bit)– Number of GPRs required by a shader affects SIMD scheduling and execution efficiency– Shader tools can be used to determine how many GPRs are used
GPR pressure is affected by:– Loop Unrolling– Long lifetime of temporary variables– Nested Dynamic Flow Control instructions– Fetch dependencies (e.g. indexed constants)
68| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN LOW-LEVEL TIPS – Texture Filtering– All shader stages can fetch textures– Point sampling is full-rate on all formats– Trilinear costs up to 2x the bilinear filtering cost– Anisotropic(N taps) costs <= ( N * bilinear )– Avoid cache thrashing
Use MIPmappingUse Gather() where applicableExploit implicit neighbouring pixel shader threadCU locality:
– Remember that sampling from neighbouring texels has a lower costfor a shader running within the same hardware tile, because it is
more likelyto experience a cache hit within the Compute Unit’s local texture
cache Exploit this explicitly by using Compute Shaders
Quarter-rate• RGBA32 and RGBA32F
Half-rate• RG32, RG32F• RGBA16, RGBA16F• BC6
Full-rate• Everything else!
GCN BILINEAR COSTS
69| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN LOW-LEVEL TIPS – Color Output PS output: each additional color output increases export cost
Export cost can be more costly than PS execution– Each (fast) export is equivalent to 64 ALU ops on 7970– If shader is export-bound then use “free” ALU for packing instead
Watch out for those cases– E.g. G-Buffer parameter writes– MINIMIZE SHADER INPUTS AND OUTPUTS!– Pack, pack, pack, pack!
Costs of outputting and blending various formats – Discard/clip allow the shader hardware to skip the rest of the work!
Quarter-rate• RGBA16 with blending• RGBA32F with blending
Half-rate• R16, RG16 with blending• RG32F with blending• RGBA32, RGBA32F
Full-rate• Everything else!
70| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Media Processing InstructionsSAD = Sum of Absolute Differences Critical to many video & image processing algorithms
– Motion detection– Gesture recognition– Video & image search– Stereo depth extraction– Computer vision
SAD (4x1) and QSAD (4 4x1) instructions– New QSAD combines SAD with alignment ops for higher
performance and reduced power draw– Evaluate up to 256 pixels per CU per clock cycle!
Maskable MQSAD instruction– Allows background pixels to be ignored– Accelerated isolation of moving objects
2 54 07 54 1
5 47 19 43 1
3 54 02 58 1
5 57 19 71 7
9 32 91 56 8
1 46 63 23 0
3 27 45 09 2
9 95 84 04 7
3 02 28 21 0
7 12 03 93 6
SAD = 59
SAD = 45
SAD = 58
SAD = 22
Closest match
SAD = 22
AMD Radeon HD 7970 can evaluate
7.6 Terapixels/sec *
* Peak theoretical performance for 8-bit integer pixels
71| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
GCN Video Codec EngineVideo Codec Engine (VCE)
– Hardware H.264 Compression and Decompression Ultra-low-power, fully fixed-function mode
– Capable of 1080p @ 60 frames / second
– Programmable for Ultra High Quality and or Speed Entropy encoding block fully accessible to software
– AMD Accelerated Parallel Programming SDK– OpenCL ™
Create hybrid faster-than-real-time encoders!– Custom motion estimation– Inverse DCT and motion compensation– Combine with hardware entropy encoding!
AMD Radeon HD 7970 can compress
Realtime+ 1080p H.264
72| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
IMPORTANT GCN ARCHITECTURE IMPROVEMENTS– Increased flexibility and efficiency, with reduced complexity!
Non-VLIW Architecture improves efficiency while reducing programmer burden
Constants/resources are just address + offset now in the hardwareUAV/SRV/SUV read/write any format – like CPU C++ reinterpret_cast & static_cast
GPU has virtual memory, forward looking towards x86 CPU + GPU flat memory
– Strong forward-looking focus on ComputeScalar ALU for complex dynamic control flow + branch & message unit64k LDS/CU, 64k GDS, atomics at every stage, coherent cache hierarchyMultiple Asynchronous Compute Engines (ACE) for multitasking compute
73| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
MAIN GCN ARCHITECTURE TAKEAWAYS– GCN generally simplifies your life as a programmer
Don’t: fret too much about instruction grouping, or vectorizationDo: Think about GPR utilization & LDS usage (impacts max # of wavefronts)Do: Think about thread/cache locality when you structure your algorithmDo: Pack shader inputs and outputs – aim to be as IO/bandwidth thin as possible!
– Unlimited number of addressable constants/resourcesN constants aren’t free anymore – each consume resources, use sparingly!
– Compute is the future – exploit its power for GPGPU work & graphics!
74| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
THANK YOUIF WE HAVE TIME REMAINING, WE CAN COVER PARTIALLY RESIDENT TEXTURES
Layla [email protected] @MissQuickste
p
75| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
Partially Resident Textures (PRT)
MegaTexture in id Tech5
76| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
Partially Resident Textures (PRT) – Introduction Enables application to manage more texture data than can physically fit in a fixed footprint
– A.k.a. “Virtual texturing“ and “Sparse texturing”
The principle behind PRT is that not all texture contents are likely to be needed at any given time
– Current render view may only require selected portions of a texture to be resident in memory
– Or, only selected MIPMap levels…
PRT textures only have a portion of their data mapped into GPU-accessible memory at a time– Texture data can be streamed in on-demand– Texture sizes up-to 32TB (16k x 16k x 8k x 128-bit)
OpenGL extension – GL_AMD_sparse_texture
77| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
Partially Resident Textures (PRT) – TEXTURE TILES The PRT texture is chunked into 64 KB tiles
– Fixed memory size– Not dependant on texture type or format
Highlighted areas represent texture data that needs
highest resolutionChunked texture Texture tiles needing to
be resident in GPU memory
Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008
78| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
PRT – Translation Table The GPU virtual memory page table translates 64kb tiles into a resident texture tile pool
Page Table Texture Tile Pool (Video Memory)
(linear storage)
Unmapped page entryMapped page entry
64Kb tile
Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008
Texture Map
79| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
PRT – Translation Table – Mip MapsNot all tiles from the texture map are actually resident in video memoryPRT hardware page table stores virtual physical mappings
Texture Map Page Table Texture Tile Pool (Video Memory)
Unmapped page entryMapped page entry
64Kb tile
Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008MIP Levels
80| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
PRT – TILE MANAGEMENT The Application is responsible for uploading/releasing new PRT tiles!
A common scenario is to upload lower MIPMaps to texture tile pool– This allows a full representation of the PRT contents to be resident in memory (albeit at
lower resolution)– e.g. MIP LOD 6 and above for 16kx16k 32 bits texture is about 650Kb (256x256
resolution)
Texture tiles corresponding to higher resolution areas are uploaded by the application as needed
– e.g. As camera gets closer to a PRT-textured polygon the requirement for texels:screen pixels ratio increases, thus higher LOD tiles need uploading
81| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
PRT – “FAILED” FETCH How does the application know which texture tiles to upload?
Answer: PRT-specific texture fetch instructions in pixel shader– Return a “Failed” texel fetch condition when sampling a PRT pixel whose tile is currently
not in the pool– OpenGL example: int glSparseTexture( gsampler2D sampler,
vec2 P, inout gvec4 texel );
This information is then stored in render target or UAV– Texel fetch failed for a given (x,y) tile location
...and then copied to the CPU so that application can upload required tiles App chooses what to render until missing data gets uploaded
82| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
PRT – “LOD WARNING” TEXEL FETCH CONDITION PRT fetch condition code can also indicate an “LOD Warning”
The minimum LOD warning is specified by the application on a per texture basis– OpenGL example: glTexParameteri( <target>,
MIN_WARNING_LOD_AMD, <LOD warning value> );
If a fetched pixel’s LOD is < the specified LOD warning value then the condition code is returned
This functionality is typically used to try to predict when higher-resolution MIP levels will be needed
– E.g. Camera getting closer to PRT-mapped geometry
83| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
PRT – Example Usage 1. App allocates PRT (e.g. 16kx16k DXT1) using PRT API
2. App uploads MIP levels using API calls
3. Shader fetches PRT data at specified texcoords
Two possibilities:3.a. Texel data belongs to a resident (64KB) tile
- Valid color returned, no error code
3.b. Texel data points to non-resident tile or specified LOD- Error/LOD Warning code returned- Shader writes tile location and error code to RT or UAV
4. App reads RT or UAV and upload/release new tiles as needed
84| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
PRT Types, Formats and DimensionsAll texture types and formats supported
–1D, 2D, cube, arrays and 3D volume textures
–All common texture formatsIncluding compressed formats
–Maximum dimensions:16K x 16K x 8K x 128 bit textures
85| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
Hardware PRT > Software Implementation
PRTEase of implementation• Complexity hidden behind HW & API
Full filtering support• Includes anisotropic filtering
Full-speed filtering• SW solution requires “manual” filtering• Software anisotropic is very costly
SW Implementation
Don’t go overboard with PRT allocation!• Page table entry size is 4 DWORDs• Have to be resident in video memory
86| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
问题?Questions? 質問がありますか?
^_^
Layla [email protected] @MissQuickstep
87| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
Trademark AttributionAMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners.
©2012 Advanced Micro Devices, Inc. All rights reserved.
88| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
Backup Slides
89| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |
SHADER CODE EXAMPLE 2float fn0(float a,float b){ float c = 0.0; float d = 0.0;
for(int i=0;i<100;i++) { if(c>113.0) break; c = c * a + b; d = d + 1.0; } return(d);}
// Registers r0 contains “a”, r1 contains “b”, r2 contains “c”// and r3 contains “d”// Value is returned in r3
v_mov_b32 r2, #0.0 // float c = 0.0 v_mov_b32 r3, #0.0 // float d = 0.0 s_mov_b64 exec, s0 // Save execution mask s_mov_b32 s2, #0 // i=0label0: s_cmp_lt_s32 s2, #100 // i<100 s_cbranch_sccz label1 // Exit loop if not true v_cmp_le_f32 r2, #113.0 // c > 113.0 s_and_b64 exec, vcc, exec // Update exec mask on fail s_branch_execz label1 // Exit if all lanes pass v_mul_f32 r2, r2, r0 // c = c*a v_add_f32 r2, r2, r1 // c = c+b v_add_f32 r3, r3, #1.0 // d = d+1.0 s_add_s32 s2, s2, #1 // i++ s_branch label0 // Jump to start of looplabel1: s_mov_b64 exec, s0 // Restore exec mask