GDC 2013: Powering The Next Generation Of Graphics AMD GCN...

1| AMD Radeon™ HD 7900 Series Graphics | December 2011 | Confidential – NDA Required

POWERING THE NEXT GENERATION OF GRAPHICS: AMD GCN ARCHITECTURE

Layla Mah – [email protected] Relations Engineer, AMD

@MissQuickstep

mailto:[email protected]


AMDGRAPHICSCORENEXT

AMDGRAPHICSCORENEXT


GPU Evolution1ST ERA:

Fixed Function2ND ERA:

Simple Shaders3RD ERA:

Graphics Parallel Core

Lighting

3D Geometry Transformation VLIW5

VLIW4

General Purpose Registers

StreamProcessing Units

FMAD+Special

Functions

Branch U

nit


Branch U

nitStreamProcessing Units






Lighting


VLIW4



FMAD+Special

Functions

Branch U

nit


Branch U







Lighting


VLIW4



FMAD+Special

Functions

Branch U

nit


Branch U


Prior to 2002 Graphics-specific hardware

– Texture mapping/filtering Multi-texturing

– “T&L Engines” Geometry processing (Transform) Rasterization (Lighting)

– Dedicated texture and pixel caches

Dot product and scalar multiply-add– Sufficient for basic graphics tasks– No general purpose compute capability






Lighting


VLIW4

Memory Interface

8 Vertex Pipes

Setup Engine

Pixel Shader Core

16 Pixel Pipes



FMAD+Special

Functions

Branch U

nit


Branch U







Lighting


VLIW4

Memory Interface

8 Vertex Pipes

Setup Engine

Pixel Shader Core

16 Pixel Pipes



FMAD+Special

Functions

Branch U

nit


Branch U


The Rise of Shaders

Shader Models 1.0 - 2.0– VS and PS are distinct– Minimal Instruction Sets– Limited Instruction Slots– Limited Shader Lengths– No Dynamic Flow Control– No Looping Constructs– No Vertex Texture Fetch– No Bitwise Operators– No Native Integer ALU– Etc.

2002-2006 Graphics-focused

programmability– DirectX 8/9, OpenGL 2.0– Floating point

processing (IEEE not required) Different precision per IHV

– ATI 24-bit full-speed– NV 16-bit full-speed– NV 32-bit half-speed

– Specialized ALUs for vertex & pixel processing

Added dedicated caches






Lighting


VLIW4



FMAD+Special

Functions

Branch U

nit


Branch U







Lighting


VLIW4



FMAD+Special

Functions

Branch U

nit


Branch U


The Rise of The Unified Shader (VLIW-5)

5-Element Very-Long-Instruction-Word (XYZWT)– Began with Xenos and utilized from R600 until “Cayman”– Flexible and optimized for Graphics workloads

Ideal for 4-element Vector and 4x4 Matrix Operations– Vector/Vector math in a single instruction

Plus One Transcendental-Unit function per Instruction More advanced caching

– Instruction, constant, multi-level texture/data, & later: LDS/GDS

Single Precision 32-bit IEEE-Compliant Floating Point ALUs More flexible: Unified ALU, Branch Unit, Dynamic Flow

Control, Vertex Texture, Geometry Shader, Tessellation Engines, etc.






Lighting


VLIW4



FMAD+Special

Functions

Branch U

nit


Branch U







Lighting


VLIW4



FMAD+Special

Functions

Branch U

nit


Branch U


Optimized For Die Area Efficiency (VLIW-4)

4-Element Very-Long-Instruction-Word (XYZW)– Profiling showed average VLIW utilization was < 3.4/5

Removed dedicated T-Unit – Optimized die area usage– Each ALU has a smaller LUT, combined using 3-term Lagrange

polynomial interpolation (transcendental/clock/VLIW4)– Better optimized for combination of Graphics & Compute

Graphics is still the primary focus, but compute is gaining attention

Still ideal for 4-element Vector and 4x4 Matrix Operations Fewer ALU bubbles in transcendental-light code, better utilization

– Simplified programming and optimization relative to VLIW-5

– Multiple dispatch processors & separate command queues Improved support for DirectCompute™ and OpenCL™


Graphics Core Next Architecture

Cutting-edge graphics performance and featuresHigh compute density with multi-taskingBuilt for power efficiencyOptimized for heterogeneous computingEnabling the Heterogeneous System Architecture (HSA)Amazing scalability and flexibility

13

A new GPU design for a new era of computing



A new GPU design for a new era of computingUnlimited Resources & Samplers (Including Unlimited UAV/SRV at any shader stage)

All UAV formats can be read/write (vs. just single uint32 in D3D11 API spec)

Simpler Assembly LanguageSimpler Shader Code (No More Clauses)Ability to support C/C++ (like)Architectural support for traps, exceptions & debuggingAbility to share virtual x86-64 address space with CPU cores

14



A new GPU design for a new era of next generation computing…

15


GCN Compute UnitBasic GPU building block

– New instruction set architecture Non-VLIW Vector unit + scalar co-processor Distributed programmable scheduler

– Each compute unit can executeinstructions from multiple kernels at once

– Increased instructions per clock per mm2

Designed for high utilization,high throughput, and multi-tasking

Branch & Message Unit Scalar UnitVector Units

(4x SIMD-16)

Vector Registers(4x 64KB)

Texture Filter Units (4)

Local Data Share(64KB)

L1 Cache(16KB)

SchedulerTexture Fetch Load / Store

Units (16)

Scalar Registers(8KB)


GCN Compute Unit – Specifics 1 Fully Programmable Scalar ALU – Shared by all threads of a

wavefront– Used for flow control, pointer arithmetic, etc.– Has own GPRs, scalar data cache, etc.

1 Branch & Message Unit– Executes branch instructions

(as dispatched by Scalar unit)

4 [16-lane] Vector ALU (SIMD)– CU Total Throughput: 64 SP ops/clock– 1 SP (Single-Precision) op per 4 clocks– 1 DP (Double-Precision) ADD in 8 clocks– 1 DP MUL/FMA/Transcendental per 16 clocks


(4x SIMD-16)




L1 Cache(16KB)


Units (16)



GCN Compute Unit – Specifics 64kb Local Data Share(LDS)

– 2x Larger than D3D11 TGSM Limit (32k/thread group)– 32 banks, with conflict resolution– Bandwidth amplification– Separate Instruction Decode

16kb read/write L1 vector data cache

Texture Units (Utilize L1)– 4 Filter, 16 Load/Store

Scheduler (2560 Threads)– Separate decode/issue for VALU, SALU/SMEM, VMEM, LDS,

GDS/Export– + Special instructions (NOPs, barriers, etc.) and branch instructions


(4x SIMD-16)




L1 Cache(16KB)


Units (16)



GCN Compute Unit – SIMD Specifics Each SIMD unit is assigned its own 40-bit program counter and instruction buffer for 10

wavefronts– The whole CU can have 40 wavefronts in flight– Each potentially from a different work-group or kernel

Each SIMD is a 16-lane ALU– IEEE-754 SP and DP

Full-speed denormals + All Rounding Modes 32-bit FMA and 24-bit INT at full-speed DP and 32-bit INT at reduced rate (1/2 1/16)

– 64kb vector register file– Issue 1 SP instruction per lane per clock

Retire 64 lanes (1 wavefront) of SP ALU in 4 clocks

A GCN GPU with 32 CUs, such as the AMD Radeon™ HD 7970, can be working on up to 81,920 work items at a time!


(4x SIMD-16)




L1 Cache(16KB)


Units (16)




Vector Units(4x SIMD-16)


GCN Compute Unit – Scheduler Specifics On GCN, each CU has its own dedicated Scheduler unit, supporting up to 2560 threads per CU

– Schedules this work between the 4 SIMDs in groups called “wavefronts”– Each wavefront is a grouping of 64 “threads” which live together on a single SIMD– One wavefront is executed on each SIMD every four cycles

Total CU throughput: 4 wavefronts / 4 cycles That’s 256 threads executed every 4 cycles! Separate protected virtual address spaces Programmed in a purely scalar way

– Scheduler Limits: 40 wavefronts (theoretical max) per CU 10 wavefronts per SIMD These ideal limits may not be attained in practice

– Limited by number of available GPRs– Limited by size of available LDS


(4x SIMD-16)




L1 Cache(16KB)


Units (16)




Vector Units(4x SIMD-16)


GCN Compute Unit – Scheduler Specifics Cont. Work should be grouped to support collaborative tasks

– All threads within a workgroup are guaranteed to be scheduled at the same time

– A set of synchronization primitives and shared memory (LDS) allows data to be passed between threads in a workgroup 16 Work Group Barriers supported per CU Global and Shared memory atomics

– Don’t forget about the L1 cache “Group discount” on memory reads

– As long as all threads are local to a CU!

– Optimized for throughput – latency is hidden by overlapping execution of wavefronts Workgroup size should be carefully chosen to balance the collaborative gain against hardware

limitations such as GPR count and LDS size


(4x SIMD-16)




L1 Cache(16KB)


Units (16)


L1 Cache(16KB)



GCN Scheduler Arbitration and Decode A CU is guaranteed to issue instructions for a wavefront sequentially

– Predication & control flow enables any single work-item a unique execution path For a given CU, every clock cycle, waves on one SIMD are considered for instruction

issue– Round robin scheduling algorithm

At most, one instruction from each category may be issued At most, one instruction per wave may be issued Up to a maximum of 5 instructions can issue per cycle, not including “internal”

instructions – 1 Vector Arithmetic Logic Unit (ALU)– 1 Scalar ALU or Scalar Memory Read– 1 Vector memory access (Read/Write/Atomic)– 1 Branch/Message - s_branch and s_cbranch_<cond> – 1 Local Data Share (LDS)– 1 Export or Global Data Share (GDS)– 1 Special/Internal (s_nop, s_sleep, s_waitcnt, s_barrier, s_setprio) – [no functional

unit]


GCN Branch and Message Unit Independent scalar assist unit to handle special classes of instructions concurrently

– Branch Unconditional Branch (s_branch) Conditional Branch (s_cbranch_<cond> )

– Condition SCC==0, SCC=1, EXEC==0, EXEC!=0 , VCC==0, VCC!=0 16-bit signed immediate dword offset from PC provided

– Messages s_sendmsg CPU interrupt with optional halt (with shader supplied code and source), debug message (perf trace data, halt, etc) special graphics synchronization messages


GCN Vector UnitsLANE 0 LANE 1 LANE 2 LANE 15

SIMD

64 Single Precision multiply-add 1 VLIW inst × 4 ALU ops dependency limited Compiler manages register port conflicts Specialized, complex compiler scheduling Difficult assembly creation, analysis, and debug Complicated tool chain support Careful optimization req. for peak performance

VLIW4 SIMD

LANE LANE LANE LANE

SIMD 0 SIMD 1 SIMD 2 SIMD 3

0 1 2 15 0 1 2 15 0 1 2 15 0 1 2 15

64 Single Precision multiply-add 4 SIMDs × 1 ALU op occupancy limited No register port conflicts Standard compiler scheduling & optimizations Simplified assembly creation, analysis, &

debug Simplified tool chain development and support Stable and predictable performance

GCN Quad SIMD


GCN Vector UnitsLANE 0 LANE 1 LANE 2 LANE 15

SIMD

Dependency Limited

Instruction level paralellism Need to fill VLIW with four (or five)

independent ops that can be run in parallel from the same program, each cycle!

VLIW4 SIMD

LANE LANE LANE LANE


0 1 2 15 0 1 2 15 0 1 2 15 0 1 2 15

Occupancy Limited

Data level parallelism Need to be able to run the same single

instruction on 64 items of data

Thread level parallelism 4x as many wavefronts to occupy all

SIMDs

GCN Quad SIMD


GCN Vector ALU Characteristics FMA (Fused Multiply Add), IEEE 754-2008 precise with all round modes, proper handling of

Nan/Inf/Zero and full de-normal support in hardware for SP and DP

MULADD single cycle issue instruction without truncation, enabling a MULieee followed by ADD ieee to be combined with round and normalization after both multiplication and subsequent addition

VCMP A full set of operations designed to fully implement all the IEEE 754-2008 comparison predicates

IEEE Rounding Modes (Round to nearest even, Round toward +Infinity, Round toward –Infinity, Round toward zero) supported under program control anywhere in the shader. Double and single precision modes are controlled separately.

De-normal Programmable Mode control for SP and DP independently. Separate control for input flush to zero and underflow flush to zero.


GCN Vector ALU Characteristics (Cont . . .) DIVIDE ASSIST OPS IEEE 0.5 ULP Division accomplished with macro in (SP/DP ~15/41

Instruction Slots respectively)

FP Conversion Ops between 16-bit, 32-bit, and 64-bit floats with full IEEE 754 precision and rounding

Exceptions Support in hardware for floating point numbers with software recording and reporting mechanism. Inexact, Underflow, Overflow, division by zero, de-normal, invalid operation, and integer divide by zero operation

64-bit Transcendental Approximation Hardware based double precision approximation for reciprocal, reciprocal square root and square root

24 BIT INT MUL/MULADD/LOGICAL/SPECIAL @ full SP rates– Heavy use for Integer thread group address calculation– 32-bit Integer MUL/MULADD @ DP FP Mul/FMA rate


GCN Scalar UnitLANE LANE LANE LANE


0 1 2 15 0 1 2 15 0 1 2 15 0 1 2 15

Fully Programmable Scalar Unit replaces FF Branch Logic Operations such as JMP [GPR] are now supported

Opens the door to e.g. virtual function calls Has its own GPR pool and can execute normal ALU code

64-bit bitwise ops to mask thread execution 32-bit bitwise and integer arithmetic operations at full-

speed Potential to offload scalar code (Vector ALU Scalar ALU) A GCN CU can dispatch 1 scalar op/clock (4 ops / 4

clocks)

GCN Scalar Unit

Scalar Unit


R/W L2 4 CU Shared 16KB Scalar R/O L1

Scalar Decode

GCN Scalar UnitLANE LANE LANE LANE


0 1 2 15 0 1 2 15 0 1 2 15 0 1 2 15

Natively a 64-bit integer ALU Independent arbitration and instruction decode

One ALU, memory or control flow op per cycle 512 Scalar GPR per SIMD shared between waves

{SGPRn+1, SGPR} pair provide 64 bit register 4 CU Shared Read Only Scalar Data Cache: 16 KB – 64B

lines 4 Way Assoc, LRU replacement policy Peak Bandwidth per CU is 16 bytes/cycle

GCN Scalar Unit

Scalar UnitScalar Unit

8KB Registers

Integer ALU


LANE LANE LANE LANE


0 1 2 15 0 1 2 15 0 1 2 15 0 1 2 15Scalar Unit

GCN Compute Unit – Hardware View

A GCN Compute Unit can retire 256 SP Vector ALU ops in 4 clks Each lane can dispatch 1 SP ALU operation per clock Each SP ALU operation takes 4 clocks to complete

The scheduler dispatches from a different wavefront each cycle

GCN Hardware View


LANE LANE LANE LANE

WAVEFRONT 0

0 1 2 15 16 17 18 31 32 33 34 47 48 49 50 63Scalar Unit

CLOCK 4CLOCK 12CLOCK 0CLOCK 8

GCN Compute Unit – Programmer View

A GCN Compute Unit can perform 64 SP Vector ALU ops / clock Each lane can dispatch 1 SP ALU operation per clock Each SP ALU operation still takes 4 clocks to complete

But you can PRETEND your code runs 1 op on 64-threads at once

GCN Programmer View

CLOCK 16CLOCK 20

WAVEFRONT 2

WAVEFRONT 3

WAVEFRONT 0

WAVEFRONT 1

WAVEFRONT 1

WAVEFRONT 4

WAVEFRONT 6

WAVEFRONT 7

WAVEFRONT 8

WAVEFRONT 9

WAVEFRONT 5


GCN Shader Code Examplefloat fn0(float a,float b){

if(a>b)

return((a-b)*a);else

return((b-a)*b);}

Optional: Use based on the number of instructions in conditional section Executed in branch unit

label0:

label1:

//Registers r0 contains “a”, r1 contains “b”//Value is returned in r2

v_cmp_gt_f32 r0,r1 //a > b, establish VCCs_mov_b64 s0,exec //Save current exec masks_and_b64 exec,vcc,exec //Do “if”s_cbranch_vccz label0 //Branch if all lanes failv_sub_f32 r2,r0,r1 //result = a – bv_mul_f32 r2,r2,r0 //result=result * a

label0:s_andn2_b64 exec,s0,exec //Do “else”(s0 & !exec)s_cbranch_execz label1 //Branch if all lanes failv_sub_f32 r2,r1,r0 //result = b – av_mul_f32 r2,r2,r1 //result = result * blabel1:s_mov_b64 exec,s0 //Restore exec mask


GCN Shader Authoring Tips

GCN VGPR Count <=24 28 32 36 40 48 64 84 <= 128 > 128Max Waves/SIMD 10 9 8 7 6 5 4 3 2 1

GCN has greatly improved branch performance, and it continues to improve– Don’t be afraid to use it! But, remember: use it wisely – improved != free

It’s at its best for highly coherent workloads (where most threads take the same path)

But, the new architecture is more susceptible to register pressure– Using too many registers within a shader can reduce the maximum waves per

SIMD! – NOTE: A WAVEFRONT CAN ALLOCATE 104 USER SCALAR REGISTERS AS SEVERAL SCALAR

REGISTERS ARE RESERVED FOR ARCHITECTURAL STATE

– Take caution with respect to the following: Excessive nested branching/looping

– Loop Unrolling Variable declarations (especially arrays) Excessive function calls requiring storing of results


L1

L1

L1

L1

L1

L1

L1

L1

L1

Cache Hierarchy

L264-bit Dual

Channel Memory Controller

L1 read/write caches

L2 read/write cache partitions

64 Bytes per clockL1 bandwidth per CU

Each CU has its own registers and local data share

I$ K$

32 KB instruction cache (I$) +16 KB scalar data cache (K$)

shared per 4 CUswith L2 backing

I$ K$

GDS

64 Bytes per clockL2 bandwidth per partition

Global data share facilitates synchronization

between CUs(64 KB)L2

64-bit Dual Channel Memory

Controller

L264-bit Dual

Channel Memory Controller


GCN Vector Memory Instructions

MUBUF – read from or perform write/atomic to an un-typed memory buffer/address– Data type/size is specified by the instruction operation

MTBUF – read from or write to a typed memory buffer/address– Data type is specified in the resource constant

MIMG – read/write/atomic operations on elements from an image surface – Image objects (1-4 dimensional addresses and 1-4 dwords of homogenous data)– Image objects use resource and sampler constants for access and filtering

VECTOR MEMORY INSTRUCTIONS SUPPORT VARIABLE GRANULARITY FOR ADDRESSES AND DATA, RANGING FROM 32-BIT DATA TO 128-BIT PIXEL QUADS


GCN Export Memory Instruction

Exports move data from 1-4 VGPRs to Graphic Pipeline– Color (MRT0-7), Depth, Position, and Parameter

Global Shared Memory Ops (Utilize GDS)


GCN LOW-LEVEL TIPS – How GCN Exports Work

The export unit writes results from the programmable stages of the graphics pipeline to the fixed function ones, such as tessellation, rasterization and the render back-ends, via the GDS

The GDS is identical to the local data shares, except that it is shared by all compute units, so it acts as an explicit global synchronization point between all wavefronts.

The atomic units in the GDS additionally support ordered count operations


GCN Local Data Share (LDS)

64 kb, 32 bank (or 16 bank) Shared Memory, fully decoupled from ALU instructions Direct mode

– Vector Instruction Operand 32/16/8 bit broadcast value– Graphics Interpolation @ rate, no bank conflicts

Index Mode – Load/Store/Atomic Operations– Bandwidth Amplification, up-to 32 – 32 bit lanes serviced per clock peak– Direct decoupled return to VGPRs– Hardware conflict detection with auto scheduling

Software consistency/coherency for thread groups via hardware barrier Fast & low power vector load return from R/W L1



An LDS bank is 512 entries, each 32-bits wide– A bank can read and write a 32-bit value across an all-to-all crossbar and swizzle unit that

includes 32 atomic integer units– This means that several threads can read the same LDS location at the same time for

FREE– Writing to the same address from multiple threads also occurs at rate, last thread to write

wins

Typically, the LDS will coalesce 32 lanes from one SIMD each cycle– One wavefront is serviced completely every 2 cycles– Conflicts automatically detected across 32 lanes from a wavefront and resolved in

hardware– An instruction which accesses different elements in the same bank takes additional cycles


GCN R/W CACHE Reads and writes cached

– Bandwidth amplification– Improved behavior on more memory access patterns– Improved write to read reuse performance

Relaxed consistency memory model– Consistency controls available to control locality of load/store

GPU Coherent – Acquire/Release semantics control data visibility across the machine (GLC bit on

load/store)– L2 coherent = all CUs can have the same view of data

Global atomics– Performed in L2 cache


GCN L1 R/W Cache Architecture

– Each CU has its own L1 Cache16 KB L1, 64B lines, 4 sets x 64 way~64B/CLK per compute unit bandwidthWrite-through – alloc on write (no read) w/dirty byte mask

– Write-through at end of wavefront– Decompression on cache read out

– Instruction GLC bit defines cache behaviorGLC = 0;

– Local caching (full lines left valid)– Shader write back invalidate instructions

GLC = 1;– Global coherent (hits within wavefront boundaries)


GCN L2 R/W Cache Architecture

– 64-128KB L2 per Memory Controller Channel64B lines, 16 way set associative~64B/CLK per channel for L2/L1 bandwidthWrite-back - alloc on write (no read) w/ dirty byte maskAcquire/Release semantics control data visibility across CUs

– L2 coherent = all CUs can have the same view of dataRemote Atomic Operations

– Common Integer set & float Min/Max/CmpSwap


GCN Latency & Bandwidth

– Each CU has 64 bytes per cycle of L1 bandwidthShared with the GDS

– Per L2 there’s 64 bytes of data per cycle as well

– Peak Scalar Data Cache Bandwidth per CU is 16 bytes/cycle– Peak I-Cache Bandwidth per CU is 32 bytes/cycle (Optimally 8 instructions)– LDS Peak Bandwidth is 128 bytes of data per cycle via bandwidth amplification

– That’s nearly 4 TB/s of LDS BW, 2 TB/s of L1 BW, and 700 GB/s of L2 BW!

– 384-bit GDDR5 Main Memory has over 264 GB/sec bandwidth– PCI Express 3.0 x16 bus interface to system (32GBps)


GCN L1 Texture Cache

The memory hierarchy is re-used for graphics Some dedicated graphics hardware added

Address-gen unit receives 4 texture addr/clock– Calculates 16 sample addr (nearest neighbors)– Reads samples from L1 data cache– Decompresses samples in Texture Mapping Unit (TMU)

TMU filters adjacent samples, produces <= 4 interpolated texels/clock TMU output undergoes format conversion and is written into the vector register file The format conversion hardware is also used for writing certain formats to memory from graphics

shaders


GCN Virtual Memory and x86

The GCN cache hierarchy was designed to integrate with x86 microprocessors

The GCN virtual memory system can support 4KB pages– Natural mapping granularity for the x86 address space– Paves the way for a shared address space in the future– IOMMU used for DMA transfers can already translate requests into x86 address space

GCN caches use 64B lines, which is the same size as x86 processors use

The stage is set for heterogeneous systems to transparently share data between the GPU and CPU through the traditional caching system, without explicit programmer control!



AMD Radeon™ HD 7900 Series –Codename “Tahiti”


Graphics Core Next Architecture Up to 32 Compute Units



Graphics Core Next Architecture Up to 32 Compute Units Dual Geometry Engines



AMD Radeon™ HD 7900 Series –Codename “Tahiti” Graphics Core Next Architecture Up to 32 Compute Units Dual Geometry Engines 8 Render Back-ends

– 32 color ROPs per clock– 128 Z/stencil ROPs per clock


Graphics Core Next Architecture Up to 32 Compute Units Dual Geometry Engines 8 Render Back-ends


Up to 768KB read/write L2 cache



Graphics Core Next Architecture Up to 32 Compute Units Dual Geometry Engines 8 Render Back-ends


Up to 768KB read/write L2 cache Fast 384-bit GDDR5 memory interface

– Up to 264 GB/sec PCI Express 3.0 x16 bus interface



AMD Radeon™ HD 7900 Series –Codename “Tahiti” Graphics Core Next Architecture Up to 32 Compute Units Dual Geometry Engines 8 Render Back-ends


Up to 768KB read/write L2 cache Fast 384-bit GDDR5 memory interface

– Up to 264 GB/sec PCI Express 3.0 x16 bus interface 4.3 billion 28nm transistors 3.79 Peak Single-Precision TFLOPS


AMD Radeon™ HD 7900 Series –Compute ArchitectureDual Asynchronous Compute Engines (ACE)

– Operate in parallel with graphics command processor

– Independent scheduling and work item dispatch for efficient multi-tasking 3 Devices with 3 Command Queues!

– Fast context switching– Exposed in OpenCL™

Dual DMA engines– Can saturate PCIe 3.0 x16 bus

bandwidth(16 GB/sec bidirectional)


AMD Radeon™ HD 7900 Series –Compute ArchitectureHigh performance double precision floating point processing

– Up to 947 DP GFLOPS– Higher utilization = more usable FLOPS– IEEE compliant

More efficient flow control & branching

Full ECC protection for DRAM & SRAM

First GPU to fully support OpenCL 1.2, Direct3D + Compute 11.1, and C++ AMP

New compute instructions


GCN Architecture – ACE Intimate DetailsACEs are responsible for compute shader scheduling & resource allocation

Each ACE fetches commands from cache or memory & forms task queues

Tasks have a priority level for scheduling

– Background RealtimeACE dispatch tasks to shader arrays as resources permit

Tasks complete out-of-order, tracked by ACE for correctness

Every cycle, an ACE can create a workgroup and dispatch one wavefront from the workgroup to the CUs


GCN Architecture – ACE Intimate DetailsACE are independent

– But, can synchronize and communicate via cache/mem/GDS

Can form task graphs– Individual tasks can have

dependencies on one another Can depend on another ACE Can depend on part of the graphics pipeline

Can control task switching– Stop and start tasks and dispatch work

to shader engines


GCN Architecture – Enabling Compute WorkloadsThe focus in GPU hardware is shifting away from graphics-specific units, towards general-purpose compute units

7900 Series GCN-based ASICS already have “3:1” ratio of ACE : Graphics CP

– Graphics CP can dispatch compute– ACE cannot dispatch graphics

If you aren’t writing Compute Shaders, you’re probably not getting the absolute most out of modern GPUs Control of LDS, barriers, thread layout, etc.

Future Trends:More Compute Units

– ALU outpaces BWCPU + GPU Flat Mem

– APU + dGPULess FF Graphics

– Can you write a Compute-based graphics pipeline? Start thinking about

it…


Utilization and Efficiency Higher utilization = higher performance per sq.mm

Mandelbrot DP

AES256

SHA256

LuxMark

SmallptGPU

0x 1x 2x 3x 4x 5x

AMD Radeon HD 6970AMD Radeon HD 7970

Utilization improvementGFLOPS increase (1.4x)


GCN Geometry EngineGS in conjunction with Tessellation is faster than before…

However… memory is still the bottleneck!

– Minimize the number of inputs and outputs for best performance… Small expansions can be done in LDS!

Each rasterizer can read in a single triangle per cycle, and write out 16 pixels

– Caveat: tiny triangles can mean that we don’t reach this potential, and become raster-bound!

Tessellation off

Image from Battlefield 3, EA DICE

Tessellation on


GCN TessellationLatest iteration of hardware tessellation units

– Increased vertex re-use– Off-chip buffering improvements– Larger parameter caches

Improves performance at all tessellation factors

– Up to 4x throughput ofAMD Radeon HD 6900 series(Gen 8)

Tessellation off

Image from Battlefield 3, EA DICE

Tessellation on


GCN Tessellation – Performance

AMD Radeon™ HD 7970 AMD Radeon™ HD 6970

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 310.0x

0.5x

1.0x

1.5x

2.0x

2.5x

3.0x

3.5x

4.0x

4.5x

Tessellation Factor

Tess

ella

tion

Rate

Crysis 2

Total War:Shogun 2

Lost Planet 2

Unigine Heaven

S.T.A.L.K.E.R.:Call of Pripyat

0 0.5 1 1.5 2 2.5

Relative Performance

60% faster

139% faster

62% faster

55% faster

82% faster


While performance is much improved, it is still a potential bottleneck!– Produces a great deal of ring bus traffic, starving other parts of the

pipelineBest performance achieved with tessellation factors less than 15!

Continue to Optimize:– Pre-triangulate– Frustum Culling– Backface Culling– Distance-adaptive– Screen-space adaptive– Orientation-adaptive– Etc…

GCN Tessellation – Best Practices


Render Back-End (RBE) ON GCN ASICS Once the pixels fragments in a tile have been shaded, they flow to the Render Back-Ends (RBEs)

– 16KB Color Cache Up to 8 color samples (i.e. 8x MSAA)

– 4KB Depth Cache Up to 16 coverage samples (i.e. 16x EQAA)

– Write out through the memory controllers

Logic Operations as alternative to Blending– Exposed in DX11.1– Already available in OpenGL

Dual-Source Color Blending with MRTs– Only available in OpenGL


DEPTH IMPROVEMENTS ON GCN ASICS Allows fast accept of fully-visible triangles spanning one or more tile

– If a triangle is fully covering a tile then cost is only 1 clock/tile

Depth Bounds Testing Extension– Exposed in OpenGL: GL_EXT_depth_bounds_test – Also Exposed in Direct3D via an extension – ask us if you’d like to try

it

24-BIT DEPTH FORMATS ARE INTERNALLY REPRESENTED AS 32-BITS


STENCIL IMPROVEMENTS ON GCN ASICS GCN has support for new extended stencil ops compared to prior ASICS

– Only available in OpenGL: GL_AMD_stencil_operation_extended

– Additional stencil ops:AND, XOR, NORREPLACE_VALUE_AMDetc.

– Also exposes additional stencil op source valueCan be used as an alternative to stencil ref value

Stencil ref and op source value can now be exported from pixel shader– Only available in OpenGL: GL_AMD_shader_stencil_value_export


GCN LOW-LEVEL TIPS – GPR UtilizationGPRs and GPR pressure

General Purpose Registers (GPR) are a limited resource– Separate banks of GPRs for Vector and Scalar (per SIMD)– Maximum of 256 VGPRs and 512 SGPRs shared across all waves (upto 10) owned by a

SIMD– Organized as 64 words of 32-bits – two adjacent GPR can be combined for 64-bit (4 for

128-bit)– Number of GPRs required by a shader affects SIMD scheduling and execution efficiency– Shader tools can be used to determine how many GPRs are used

GPR pressure is affected by:– Loop Unrolling– Long lifetime of temporary variables– Nested Dynamic Flow Control instructions– Fetch dependencies (e.g. indexed constants)


GCN LOW-LEVEL TIPS – Texture Filtering– All shader stages can fetch textures– Point sampling is full-rate on all formats– Trilinear costs up to 2x the bilinear filtering cost– Anisotropic(N taps) costs <= ( N * bilinear )– Avoid cache thrashing

Use MIPmappingUse Gather() where applicableExploit implicit neighbouring pixel shader threadCU locality:

– Remember that sampling from neighbouring texels has a lower costfor a shader running within the same hardware tile, because it is

more likelyto experience a cache hit within the Compute Unit’s local texture

cache Exploit this explicitly by using Compute Shaders

Quarter-rate• RGBA32 and RGBA32F

Half-rate• RG32, RG32F• RGBA16, RGBA16F• BC6

Full-rate• Everything else!

GCN BILINEAR COSTS


GCN LOW-LEVEL TIPS – Color Output PS output: each additional color output increases export cost

Export cost can be more costly than PS execution– Each (fast) export is equivalent to 64 ALU ops on 7970– If shader is export-bound then use “free” ALU for packing instead

Watch out for those cases– E.g. G-Buffer parameter writes– MINIMIZE SHADER INPUTS AND OUTPUTS!– Pack, pack, pack, pack!

Costs of outputting and blending various formats – Discard/clip allow the shader hardware to skip the rest of the work!

Quarter-rate• RGBA16 with blending• RGBA32F with blending

Half-rate• R16, RG16 with blending• RG32F with blending• RGBA32, RGBA32F

Full-rate• Everything else!


GCN Media Processing InstructionsSAD = Sum of Absolute Differences Critical to many video & image processing algorithms

– Motion detection– Gesture recognition– Video & image search– Stereo depth extraction– Computer vision

SAD (4x1) and QSAD (4 4x1) instructions– New QSAD combines SAD with alignment ops for higher

performance and reduced power draw– Evaluate up to 256 pixels per CU per clock cycle!

Maskable MQSAD instruction– Allows background pixels to be ignored– Accelerated isolation of moving objects

2 54 07 54 1

5 47 19 43 1

3 54 02 58 1

5 57 19 71 7

9 32 91 56 8

1 46 63 23 0

3 27 45 09 2

9 95 84 04 7

3 02 28 21 0

7 12 03 93 6

SAD = 59

SAD = 45

SAD = 58

SAD = 22

Closest match

SAD = 22

AMD Radeon HD 7970 can evaluate

7.6 Terapixels/sec *

* Peak theoretical performance for 8-bit integer pixels


GCN Video Codec EngineVideo Codec Engine (VCE)

– Hardware H.264 Compression and Decompression Ultra-low-power, fully fixed-function mode

– Capable of 1080p @ 60 frames / second

– Programmable for Ultra High Quality and or Speed Entropy encoding block fully accessible to software

– AMD Accelerated Parallel Programming SDK– OpenCL ™

Create hybrid faster-than-real-time encoders!– Custom motion estimation– Inverse DCT and motion compensation– Combine with hardware entropy encoding!

AMD Radeon HD 7970 can compress

Realtime+ 1080p H.264


IMPORTANT GCN ARCHITECTURE IMPROVEMENTS– Increased flexibility and efficiency, with reduced complexity!

Non-VLIW Architecture improves efficiency while reducing programmer burden

Constants/resources are just address + offset now in the hardwareUAV/SRV/SUV read/write any format – like CPU C++ reinterpret_cast & static_cast

GPU has virtual memory, forward looking towards x86 CPU + GPU flat memory

– Strong forward-looking focus on ComputeScalar ALU for complex dynamic control flow + branch & message unit64k LDS/CU, 64k GDS, atomics at every stage, coherent cache hierarchyMultiple Asynchronous Compute Engines (ACE) for multitasking compute


MAIN GCN ARCHITECTURE TAKEAWAYS– GCN generally simplifies your life as a programmer

Don’t: fret too much about instruction grouping, or vectorizationDo: Think about GPR utilization & LDS usage (impacts max # of wavefronts)Do: Think about thread/cache locality when you structure your algorithmDo: Pack shader inputs and outputs – aim to be as IO/bandwidth thin as possible!

– Unlimited number of addressable constants/resourcesN constants aren’t free anymore – each consume resources, use sparingly!

– Compute is the future – exploit its power for GPGPU work & graphics!


THANK YOUIF WE HAVE TIME REMAINING, WE CAN COVER PARTIALLY RESIDENT TEXTURES

Layla [email protected] @MissQuickste

p



Partially Resident Textures (PRT)

MegaTexture in id Tech5


Partially Resident Textures (PRT) – Introduction Enables application to manage more texture data than can physically fit in a fixed footprint

– A.k.a. “Virtual texturing“ and “Sparse texturing”

The principle behind PRT is that not all texture contents are likely to be needed at any given time

– Current render view may only require selected portions of a texture to be resident in memory

– Or, only selected MIPMap levels…

PRT textures only have a portion of their data mapped into GPU-accessible memory at a time– Texture data can be streamed in on-demand– Texture sizes up-to 32TB (16k x 16k x 8k x 128-bit)

OpenGL extension – GL_AMD_sparse_texture


Partially Resident Textures (PRT) – TEXTURE TILES The PRT texture is chunked into 64 KB tiles

– Fixed memory size– Not dependant on texture type or format

Highlighted areas represent texture data that needs

highest resolutionChunked texture Texture tiles needing to

be resident in GPU memory

Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008


PRT – Translation Table The GPU virtual memory page table translates 64kb tiles into a resident texture tile pool

Page Table Texture Tile Pool (Video Memory)

(linear storage)

Unmapped page entryMapped page entry

64Kb tile

Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008

Texture Map


PRT – Translation Table – Mip MapsNot all tiles from the texture map are actually resident in video memoryPRT hardware page table stores virtual physical mappings

Texture Map Page Table Texture Tile Pool (Video Memory)

Unmapped page entryMapped page entry

64Kb tile

Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008MIP Levels


PRT – TILE MANAGEMENT The Application is responsible for uploading/releasing new PRT tiles!

A common scenario is to upload lower MIPMaps to texture tile pool– This allows a full representation of the PRT contents to be resident in memory (albeit at

lower resolution)– e.g. MIP LOD 6 and above for 16kx16k 32 bits texture is about 650Kb (256x256

resolution)

Texture tiles corresponding to higher resolution areas are uploaded by the application as needed

– e.g. As camera gets closer to a PRT-textured polygon the requirement for texels:screen pixels ratio increases, thus higher LOD tiles need uploading


PRT – “FAILED” FETCH How does the application know which texture tiles to upload?

Answer: PRT-specific texture fetch instructions in pixel shader– Return a “Failed” texel fetch condition when sampling a PRT pixel whose tile is currently

not in the pool– OpenGL example: int glSparseTexture( gsampler2D sampler,

vec2 P, inout gvec4 texel );

This information is then stored in render target or UAV– Texel fetch failed for a given (x,y) tile location

...and then copied to the CPU so that application can upload required tiles App chooses what to render until missing data gets uploaded


PRT – “LOD WARNING” TEXEL FETCH CONDITION PRT fetch condition code can also indicate an “LOD Warning”

The minimum LOD warning is specified by the application on a per texture basis– OpenGL example: glTexParameteri( <target>,

MIN_WARNING_LOD_AMD, <LOD warning value> );

If a fetched pixel’s LOD is < the specified LOD warning value then the condition code is returned

This functionality is typically used to try to predict when higher-resolution MIP levels will be needed

– E.g. Camera getting closer to PRT-mapped geometry


PRT – Example Usage 1. App allocates PRT (e.g. 16kx16k DXT1) using PRT API

2. App uploads MIP levels using API calls

3. Shader fetches PRT data at specified texcoords

Two possibilities:3.a. Texel data belongs to a resident (64KB) tile

- Valid color returned, no error code

3.b. Texel data points to non-resident tile or specified LOD- Error/LOD Warning code returned- Shader writes tile location and error code to RT or UAV

4. App reads RT or UAV and upload/release new tiles as needed


PRT Types, Formats and DimensionsAll texture types and formats supported

–1D, 2D, cube, arrays and 3D volume textures

–All common texture formatsIncluding compressed formats

–Maximum dimensions:16K x 16K x 8K x 128 bit textures


Hardware PRT > Software Implementation

PRTEase of implementation• Complexity hidden behind HW & API

Full filtering support• Includes anisotropic filtering

Full-speed filtering• SW solution requires “manual” filtering• Software anisotropic is very costly

SW Implementation

Don’t go overboard with PRT allocation!• Page table entry size is 4 DWORDs• Have to be resident in video memory


问题？Questions? 質問がありますか？

^_^

Layla [email protected] @MissQuickstep



Trademark AttributionAMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners.

©2012 Advanced Micro Devices, Inc. All rights reserved.


Backup Slides


SHADER CODE EXAMPLE 2float fn0(float a,float b){ float c = 0.0; float d = 0.0;

for(int i=0;i<100;i++) { if(c>113.0) break; c = c * a + b; d = d + 1.0; } return(d);}

// Registers r0 contains “a”, r1 contains “b”, r2 contains “c”// and r3 contains “d”// Value is returned in r3

v_mov_b32 r2, #0.0 // float c = 0.0 v_mov_b32 r3, #0.0 // float d = 0.0 s_mov_b64 exec, s0 // Save execution mask s_mov_b32 s2, #0 // i=0label0: s_cmp_lt_s32 s2, #100 // i<100 s_cbranch_sccz label1 // Exit loop if not true v_cmp_le_f32 r2, #113.0 // c > 113.0 s_and_b64 exec, vcc, exec // Update exec mask on fail s_branch_execz label1 // Exit if all lanes pass v_mul_f32 r2, r2, r0 // c = c*a v_add_f32 r2, r2, r1 // c = c+b v_add_f32 r3, r3, #1.0 // d = d+1.0 s_add_s32 s2, s2, #1 // i++ s_branch label0 // Jump to start of looplabel1: s_mov_b64 exec, s0 // Restore exec mask

GDC 2013: Powering The Next Generation Of Graphics AMD GCN...

Documents

Transcript of GDC 2013: Powering The Next Generation Of Graphics AMD GCN...