GDC 2013: Powering The Next Generation Of Graphics AMD GCN...

89
1 AMD Radeon™ HD 7900 Series Graphics | December 2011 | Confidential – NDA Required POWERING THE NEXT GENERATION OF GRAPHICS: AMD GCN ARCHITECTURE Layla Mah – [email protected] Developer Relations Engineer, AMD @MissQuickstep

Transcript of GDC 2013: Powering The Next Generation Of Graphics AMD GCN...

Page 1: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

1| AMD Radeon™ HD 7900 Series Graphics | December 2011 | Confidential – NDA Required

POWERING THE NEXT GENERATION OF GRAPHICS: AMD GCN ARCHITECTURE

Layla Mah – [email protected] Relations Engineer, AMD

@MissQuickstep

Page 2: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

2| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

AgendaPart 1: AMD Graphics Core Next Architecture (GCN)

Part 2: Partially Resident Textures (PRT)

2

GCNPRT

Page 3: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

3| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

AMDGRAPHICSCORENEXT

AMDGRAPHICSCORENEXT

Page 4: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

4| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GPU Evolution1ST ERA:

Fixed Function2ND ERA:

Simple Shaders3RD ERA:

Graphics Parallel Core

Lighting

3D Geometry Transformation VLIW5

VLIW4

General Purpose Registers

StreamProcessing Units

FMAD+Special

Functions

Branch U

nit

General Purpose Registers

Branch U

nitStreamProcessing Units

Page 5: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

5| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GPU Evolution1ST ERA:

Fixed Function2ND ERA:

Simple Shaders3RD ERA:

Graphics Parallel Core

Lighting

3D Geometry Transformation VLIW5

VLIW4

General Purpose Registers

StreamProcessing Units

FMAD+Special

Functions

Branch U

nit

General Purpose Registers

Branch U

nitStreamProcessing Units

Page 6: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

6| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GPU Evolution1ST ERA:

Fixed Function2ND ERA:

Simple Shaders3RD ERA:

Graphics Parallel Core

Lighting

3D Geometry Transformation VLIW5

VLIW4

General Purpose Registers

StreamProcessing Units

FMAD+Special

Functions

Branch U

nit

General Purpose Registers

Branch U

nitStreamProcessing Units

Prior to 2002 Graphics-specific hardware

– Texture mapping/filtering Multi-texturing

– “T&L Engines” Geometry processing (Transform) Rasterization (Lighting)

– Dedicated texture and pixel caches

Dot product and scalar multiply-add– Sufficient for basic graphics tasks– No general purpose compute capability

Page 7: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

7| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GPU Evolution1ST ERA:

Fixed Function2ND ERA:

Simple Shaders3RD ERA:

Graphics Parallel Core

Lighting

3D Geometry Transformation VLIW5

VLIW4

Memory Interface

8 Vertex Pipes

Setup Engine

Pixel Shader Core

16 Pixel Pipes

General Purpose Registers

StreamProcessing Units

FMAD+Special

Functions

Branch U

nit

General Purpose Registers

Branch U

nitStreamProcessing Units

Page 8: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

8| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GPU Evolution1ST ERA:

Fixed Function2ND ERA:

Simple Shaders3RD ERA:

Graphics Parallel Core

Lighting

3D Geometry Transformation VLIW5

VLIW4

Memory Interface

8 Vertex Pipes

Setup Engine

Pixel Shader Core

16 Pixel Pipes

General Purpose Registers

StreamProcessing Units

FMAD+Special

Functions

Branch U

nit

General Purpose Registers

Branch U

nitStreamProcessing Units

The Rise of Shaders

Shader Models 1.0 - 2.0– VS and PS are distinct– Minimal Instruction Sets– Limited Instruction Slots– Limited Shader Lengths– No Dynamic Flow Control– No Looping Constructs– No Vertex Texture Fetch– No Bitwise Operators– No Native Integer ALU– Etc.

2002-2006 Graphics-focused

programmability– DirectX 8/9, OpenGL 2.0– Floating point

processing (IEEE not required) Different precision per IHV

– ATI 24-bit full-speed– NV 16-bit full-speed– NV 32-bit half-speed

– Specialized ALUs for vertex & pixel processing

Added dedicated caches

Page 9: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

9| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GPU Evolution1ST ERA:

Fixed Function2ND ERA:

Simple Shaders3RD ERA:

Graphics Parallel Core

Lighting

3D Geometry Transformation VLIW5

VLIW4

General Purpose Registers

StreamProcessing Units

FMAD+Special

Functions

Branch U

nit

General Purpose Registers

Branch U

nitStreamProcessing Units

Page 10: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

10| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GPU Evolution1ST ERA:

Fixed Function2ND ERA:

Simple Shaders3RD ERA:

Graphics Parallel Core

Lighting

3D Geometry Transformation VLIW5

VLIW4

General Purpose Registers

StreamProcessing Units

FMAD+Special

Functions

Branch U

nit

General Purpose Registers

Branch U

nitStreamProcessing Units

The Rise of The Unified Shader (VLIW-5)

5-Element Very-Long-Instruction-Word (XYZWT)– Began with Xenos and utilized from R600 until “Cayman”– Flexible and optimized for Graphics workloads

Ideal for 4-element Vector and 4x4 Matrix Operations– Vector/Vector math in a single instruction

Plus One Transcendental-Unit function per Instruction More advanced caching

– Instruction, constant, multi-level texture/data, & later: LDS/GDS

Single Precision 32-bit IEEE-Compliant Floating Point ALUs More flexible: Unified ALU, Branch Unit, Dynamic Flow

Control, Vertex Texture, Geometry Shader, Tessellation Engines, etc.

Page 11: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

11| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GPU Evolution1ST ERA:

Fixed Function2ND ERA:

Simple Shaders3RD ERA:

Graphics Parallel Core

Lighting

3D Geometry Transformation VLIW5

VLIW4

General Purpose Registers

StreamProcessing Units

FMAD+Special

Functions

Branch U

nit

General Purpose Registers

Branch U

nitStreamProcessing Units

Page 12: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

12| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GPU Evolution1ST ERA:

Fixed Function2ND ERA:

Simple Shaders3RD ERA:

Graphics Parallel Core

Lighting

3D Geometry Transformation VLIW5

VLIW4

General Purpose Registers

StreamProcessing Units

FMAD+Special

Functions

Branch U

nit

General Purpose Registers

Branch U

nitStreamProcessing Units

Optimized For Die Area Efficiency (VLIW-4)

4-Element Very-Long-Instruction-Word (XYZW)– Profiling showed average VLIW utilization was < 3.4/5

Removed dedicated T-Unit – Optimized die area usage– Each ALU has a smaller LUT, combined using 3-term Lagrange

polynomial interpolation (transcendental/clock/VLIW4)– Better optimized for combination of Graphics & Compute

Graphics is still the primary focus, but compute is gaining attention

Still ideal for 4-element Vector and 4x4 Matrix Operations Fewer ALU bubbles in transcendental-light code, better utilization

– Simplified programming and optimization relative to VLIW-5

– Multiple dispatch processors & separate command queues Improved support for DirectCompute™ and OpenCL™

Page 13: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

13| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

Graphics Core Next Architecture

Cutting-edge graphics performance and featuresHigh compute density with multi-taskingBuilt for power efficiencyOptimized for heterogeneous computingEnabling the Heterogeneous System Architecture (HSA)Amazing scalability and flexibility

13

A new GPU design for a new era of computing

Page 14: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

14| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

Graphics Core Next Architecture

A new GPU design for a new era of computingUnlimited Resources & Samplers (Including Unlimited UAV/SRV at any shader stage)

All UAV formats can be read/write (vs. just single uint32 in D3D11 API spec)

Simpler Assembly LanguageSimpler Shader Code (No More Clauses)Ability to support C/C++ (like)Architectural support for traps, exceptions & debuggingAbility to share virtual x86-64 address space with CPU cores

14

Page 15: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

15| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

Graphics Core Next Architecture

A new GPU design for a new era of next generation computing…

15

Page 16: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

16| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Compute UnitBasic GPU building block

– New instruction set architecture Non-VLIW Vector unit + scalar co-processor Distributed programmable scheduler

– Each compute unit can executeinstructions from multiple kernels at once

– Increased instructions per clock per mm2

Designed for high utilization,high throughput, and multi-tasking

Branch & Message Unit Scalar UnitVector Units

(4x SIMD-16)

Vector Registers(4x 64KB)

Texture Filter Units (4)

Local Data Share(64KB)

L1 Cache(16KB)

SchedulerTexture Fetch Load / Store

Units (16)

Scalar Registers(8KB)

Page 17: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

17| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Compute Unit – Specifics 1 Fully Programmable Scalar ALU – Shared by all threads of a

wavefront– Used for flow control, pointer arithmetic, etc.– Has own GPRs, scalar data cache, etc.

1 Branch & Message Unit– Executes branch instructions

(as dispatched by Scalar unit)

4 [16-lane] Vector ALU (SIMD)– CU Total Throughput: 64 SP ops/clock– 1 SP (Single-Precision) op per 4 clocks– 1 DP (Double-Precision) ADD in 8 clocks– 1 DP MUL/FMA/Transcendental per 16 clocks

Branch & Message Unit Scalar UnitVector Units

(4x SIMD-16)

Vector Registers(4x 64KB)

Texture Filter Units (4)

Local Data Share(64KB)

L1 Cache(16KB)

SchedulerTexture Fetch Load / Store

Units (16)

Scalar Registers(8KB)

Page 18: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

18| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Compute Unit – Specifics 64kb Local Data Share(LDS)

– 2x Larger than D3D11 TGSM Limit (32k/thread group)– 32 banks, with conflict resolution– Bandwidth amplification– Separate Instruction Decode

16kb read/write L1 vector data cache

Texture Units (Utilize L1)– 4 Filter, 16 Load/Store

Scheduler (2560 Threads)– Separate decode/issue for VALU, SALU/SMEM, VMEM, LDS,

GDS/Export– + Special instructions (NOPs, barriers, etc.) and branch instructions

Branch & Message Unit Scalar UnitVector Units

(4x SIMD-16)

Vector Registers(4x 64KB)

Texture Filter Units (4)

Local Data Share(64KB)

L1 Cache(16KB)

SchedulerTexture Fetch Load / Store

Units (16)

Scalar Registers(8KB)

Page 19: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

19| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Compute Unit – SIMD Specifics Each SIMD unit is assigned its own 40-bit program counter and instruction buffer for 10

wavefronts– The whole CU can have 40 wavefronts in flight– Each potentially from a different work-group or kernel

Each SIMD is a 16-lane ALU– IEEE-754 SP and DP

Full-speed denormals + All Rounding Modes 32-bit FMA and 24-bit INT at full-speed DP and 32-bit INT at reduced rate (1/2 1/16)

– 64kb vector register file– Issue 1 SP instruction per lane per clock

Retire 64 lanes (1 wavefront) of SP ALU in 4 clocks

A GCN GPU with 32 CUs, such as the AMD Radeon™ HD 7970, can be working on up to 81,920 work items at a time!

Branch & Message Unit Scalar UnitVector Units

(4x SIMD-16)

Vector Registers(4x 64KB)

Texture Filter Units (4)

Local Data Share(64KB)

L1 Cache(16KB)

SchedulerTexture Fetch Load / Store

Units (16)

Scalar Registers(8KB)

Local Data Share(64KB)

Vector Registers(4x 64KB)

Vector Units(4x SIMD-16)

Page 20: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

20| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Compute Unit – Scheduler Specifics On GCN, each CU has its own dedicated Scheduler unit, supporting up to 2560 threads per CU

– Schedules this work between the 4 SIMDs in groups called “wavefronts”– Each wavefront is a grouping of 64 “threads” which live together on a single SIMD– One wavefront is executed on each SIMD every four cycles

Total CU throughput: 4 wavefronts / 4 cycles That’s 256 threads executed every 4 cycles! Separate protected virtual address spaces Programmed in a purely scalar way

– Scheduler Limits: 40 wavefronts (theoretical max) per CU 10 wavefronts per SIMD These ideal limits may not be attained in practice

– Limited by number of available GPRs– Limited by size of available LDS

Branch & Message Unit Scalar UnitVector Units

(4x SIMD-16)

Vector Registers(4x 64KB)

Texture Filter Units (4)

Local Data Share(64KB)

L1 Cache(16KB)

SchedulerTexture Fetch Load / Store

Units (16)

Scalar Registers(8KB)

Local Data Share(64KB)

Vector Registers(4x 64KB)

Vector Units(4x SIMD-16)

Page 21: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

21| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Compute Unit – Scheduler Specifics Cont. Work should be grouped to support collaborative tasks

– All threads within a workgroup are guaranteed to be scheduled at the same time

– A set of synchronization primitives and shared memory (LDS) allows data to be passed between threads in a workgroup 16 Work Group Barriers supported per CU Global and Shared memory atomics

– Don’t forget about the L1 cache “Group discount” on memory reads

– As long as all threads are local to a CU!

– Optimized for throughput – latency is hidden by overlapping execution of wavefronts Workgroup size should be carefully chosen to balance the collaborative gain against hardware

limitations such as GPR count and LDS size

Branch & Message Unit Scalar UnitVector Units

(4x SIMD-16)

Vector Registers(4x 64KB)

Texture Filter Units (4)

Local Data Share(64KB)

L1 Cache(16KB)

SchedulerTexture Fetch Load / Store

Units (16)

Scalar Registers(8KB)

L1 Cache(16KB)

Local Data Share(64KB)

Page 22: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

22| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Scheduler Arbitration and Decode A CU is guaranteed to issue instructions for a wavefront sequentially

– Predication & control flow enables any single work-item a unique execution path For a given CU, every clock cycle, waves on one SIMD are considered for instruction

issue– Round robin scheduling algorithm

At most, one instruction from each category may be issued At most, one instruction per wave may be issued Up to a maximum of 5 instructions can issue per cycle, not including “internal”

instructions – 1 Vector Arithmetic Logic Unit (ALU)– 1 Scalar ALU or Scalar Memory Read– 1 Vector memory access (Read/Write/Atomic)– 1 Branch/Message - s_branch and s_cbranch_<cond> – 1 Local Data Share (LDS)– 1 Export or Global Data Share (GDS)– 1 Special/Internal (s_nop, s_sleep, s_waitcnt, s_barrier, s_setprio) – [no functional

unit]

Page 23: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

23| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Branch and Message Unit Independent scalar assist unit to handle special classes of instructions concurrently

– Branch Unconditional Branch (s_branch) Conditional Branch (s_cbranch_<cond> )

– Condition SCC==0, SCC=1, EXEC==0, EXEC!=0 , VCC==0, VCC!=0 16-bit signed immediate dword offset from PC provided

– Messages s_sendmsg CPU interrupt with optional halt (with shader supplied code and source), debug message (perf trace data, halt, etc) special graphics synchronization messages

Page 24: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

24| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Vector UnitsLANE 0 LANE 1 LANE 2 LANE 15

SIMD

64 Single Precision multiply-add 1 VLIW inst × 4 ALU ops dependency limited Compiler manages register port conflicts Specialized, complex compiler scheduling Difficult assembly creation, analysis, and debug Complicated tool chain support Careful optimization req. for peak performance

VLIW4 SIMD

LANE LANE LANE LANE

SIMD 0 SIMD 1 SIMD 2 SIMD 3

0 1 2 15 0 1 2 15 0 1 2 15 0 1 2 15

64 Single Precision multiply-add 4 SIMDs × 1 ALU op occupancy limited No register port conflicts Standard compiler scheduling & optimizations Simplified assembly creation, analysis, &

debug Simplified tool chain development and support Stable and predictable performance

GCN Quad SIMD

Page 25: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

25| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Vector UnitsLANE 0 LANE 1 LANE 2 LANE 15

SIMD

Dependency Limited

Instruction level paralellism Need to fill VLIW with four (or five)

independent ops that can be run in parallel from the same program, each cycle!

VLIW4 SIMD

LANE LANE LANE LANE

SIMD 0 SIMD 1 SIMD 2 SIMD 3

0 1 2 15 0 1 2 15 0 1 2 15 0 1 2 15

Occupancy Limited

Data level parallelism Need to be able to run the same single

instruction on 64 items of data

Thread level parallelism 4x as many wavefronts to occupy all

SIMDs

GCN Quad SIMD

Page 26: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

26| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Vector ALU Characteristics FMA (Fused Multiply Add), IEEE 754-2008 precise with all round modes, proper handling of

Nan/Inf/Zero and full de-normal support in hardware for SP and DP

MULADD single cycle issue instruction without truncation, enabling a MULieee followed by ADD ieee to be combined with round and normalization after both multiplication and subsequent addition

VCMP A full set of operations designed to fully implement all the IEEE 754-2008 comparison predicates

IEEE Rounding Modes (Round to nearest even, Round toward +Infinity, Round toward –Infinity, Round toward zero) supported under program control anywhere in the shader. Double and single precision modes are controlled separately.

De-normal Programmable Mode control for SP and DP independently. Separate control for input flush to zero and underflow flush to zero.

Page 27: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

27| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Vector ALU Characteristics (Cont . . .) DIVIDE ASSIST OPS IEEE 0.5 ULP Division accomplished with macro in (SP/DP ~15/41

Instruction Slots respectively)

FP Conversion Ops between 16-bit, 32-bit, and 64-bit floats with full IEEE 754 precision and rounding

Exceptions Support in hardware for floating point numbers with software recording and reporting mechanism. Inexact, Underflow, Overflow, division by zero, de-normal, invalid operation, and integer divide by zero operation

64-bit Transcendental Approximation Hardware based double precision approximation for reciprocal, reciprocal square root and square root

24 BIT INT MUL/MULADD/LOGICAL/SPECIAL @ full SP rates– Heavy use for Integer thread group address calculation– 32-bit Integer MUL/MULADD @ DP FP Mul/FMA rate

Page 28: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

28| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Scalar UnitLANE LANE LANE LANE

SIMD 0 SIMD 1 SIMD 2 SIMD 3

0 1 2 15 0 1 2 15 0 1 2 15 0 1 2 15

Fully Programmable Scalar Unit replaces FF Branch Logic Operations such as JMP [GPR] are now supported

Opens the door to e.g. virtual function calls Has its own GPR pool and can execute normal ALU code

64-bit bitwise ops to mask thread execution 32-bit bitwise and integer arithmetic operations at full-

speed Potential to offload scalar code (Vector ALU Scalar ALU) A GCN CU can dispatch 1 scalar op/clock (4 ops / 4

clocks)

GCN Scalar Unit

Scalar Unit

Page 29: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

29| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

R/W L2 4 CU Shared 16KB Scalar R/O L1

Scalar Decode

GCN Scalar UnitLANE LANE LANE LANE

SIMD 0 SIMD 1 SIMD 2 SIMD 3

0 1 2 15 0 1 2 15 0 1 2 15 0 1 2 15

Natively a 64-bit integer ALU Independent arbitration and instruction decode

One ALU, memory or control flow op per cycle 512 Scalar GPR per SIMD shared between waves

{SGPRn+1, SGPR} pair provide 64 bit register 4 CU Shared Read Only Scalar Data Cache: 16 KB – 64B

lines 4 Way Assoc, LRU replacement policy Peak Bandwidth per CU is 16 bytes/cycle

GCN Scalar Unit

Scalar UnitScalar Unit

8KB Registers

Integer ALU

Page 30: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

30| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

LANE LANE LANE LANE

SIMD 0 SIMD 1 SIMD 2 SIMD 3

0 1 2 15 0 1 2 15 0 1 2 15 0 1 2 15Scalar Unit

GCN Compute Unit – Hardware View

A GCN Compute Unit can retire 256 SP Vector ALU ops in 4 clks Each lane can dispatch 1 SP ALU operation per clock Each SP ALU operation takes 4 clocks to complete

The scheduler dispatches from a different wavefront each cycle

GCN Hardware View

Page 31: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

31| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

LANE LANE LANE LANE

WAVEFRONT 0

0 1 2 15 16 17 18 31 32 33 34 47 48 49 50 63Scalar Unit

CLOCK 4CLOCK 12CLOCK 0CLOCK 8

GCN Compute Unit – Programmer View

A GCN Compute Unit can perform 64 SP Vector ALU ops / clock Each lane can dispatch 1 SP ALU operation per clock Each SP ALU operation still takes 4 clocks to complete

But you can PRETEND your code runs 1 op on 64-threads at once

GCN Programmer View

CLOCK 16CLOCK 20

WAVEFRONT 2

WAVEFRONT 3

WAVEFRONT 0

WAVEFRONT 1

WAVEFRONT 1

WAVEFRONT 4

WAVEFRONT 6

WAVEFRONT 7

WAVEFRONT 8

WAVEFRONT 9

WAVEFRONT 5

Page 32: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

32| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Shader Code Examplefloat fn0(float a,float b){

if(a>b)

return((a-b)*a);else

return((b-a)*b);}

Optional: Use based on the number of instructions in conditional section Executed in branch unit

label0:

label1:

//Registers r0 contains “a”, r1 contains “b”//Value is returned in r2

v_cmp_gt_f32 r0,r1 //a > b, establish VCCs_mov_b64 s0,exec //Save current exec masks_and_b64 exec,vcc,exec //Do “if”s_cbranch_vccz label0 //Branch if all lanes failv_sub_f32 r2,r0,r1 //result = a – bv_mul_f32 r2,r2,r0 //result=result * a

label0:s_andn2_b64 exec,s0,exec //Do “else”(s0 & !exec)s_cbranch_execz label1 //Branch if all lanes failv_sub_f32 r2,r1,r0 //result = b – av_mul_f32 r2,r2,r1 //result = result * blabel1:s_mov_b64 exec,s0 //Restore exec mask

Page 33: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

33| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Shader Authoring Tips

GCN VGPR Count <=24 28 32 36 40 48 64 84 <= 128 > 128Max Waves/SIMD 10 9 8 7 6 5 4 3 2 1

GCN has greatly improved branch performance, and it continues to improve– Don’t be afraid to use it! But, remember: use it wisely – improved != free

It’s at its best for highly coherent workloads (where most threads take the same path)

But, the new architecture is more susceptible to register pressure– Using too many registers within a shader can reduce the maximum waves per

SIMD! – NOTE: A WAVEFRONT CAN ALLOCATE 104 USER SCALAR REGISTERS AS SEVERAL SCALAR

REGISTERS ARE RESERVED FOR ARCHITECTURAL STATE

– Take caution with respect to the following: Excessive nested branching/looping

– Loop Unrolling Variable declarations (especially arrays) Excessive function calls requiring storing of results

Page 34: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

34| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

L1

L1

L1

L1

L1

L1

L1

L1

L1

Cache Hierarchy

L264-bit Dual

Channel Memory Controller

L1 read/write caches

L2 read/write cache partitions

64 Bytes per clockL1 bandwidth per CU

Each CU has its own registers and local data share

I$ K$

32 KB instruction cache (I$) +16 KB scalar data cache (K$)

shared per 4 CUswith L2 backing

I$ K$

GDS

64 Bytes per clockL2 bandwidth per partition

Global data share facilitates synchronization

between CUs(64 KB)L2

64-bit Dual Channel Memory

Controller

L264-bit Dual

Channel Memory Controller

Page 35: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

35| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Vector Memory Instructions

MUBUF – read from or perform write/atomic to an un-typed memory buffer/address– Data type/size is specified by the instruction operation

MTBUF – read from or write to a typed memory buffer/address– Data type is specified in the resource constant

MIMG – read/write/atomic operations on elements from an image surface – Image objects (1-4 dimensional addresses and 1-4 dwords of homogenous data)– Image objects use resource and sampler constants for access and filtering

VECTOR MEMORY INSTRUCTIONS SUPPORT VARIABLE GRANULARITY FOR ADDRESSES AND DATA, RANGING FROM 32-BIT DATA TO 128-BIT PIXEL QUADS

Page 36: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

36| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Export Memory Instruction

Exports move data from 1-4 VGPRs to Graphic Pipeline– Color (MRT0-7), Depth, Position, and Parameter

Global Shared Memory Ops (Utilize GDS)

Page 37: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

37| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN LOW-LEVEL TIPS – How GCN Exports Work

The export unit writes results from the programmable stages of the graphics pipeline to the fixed function ones, such as tessellation, rasterization and the render back-ends, via the GDS

The GDS is identical to the local data shares, except that it is shared by all compute units, so it acts as an explicit global synchronization point between all wavefronts.

The atomic units in the GDS additionally support ordered count operations

Page 38: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

38| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Local Data Share (LDS)

64 kb, 32 bank (or 16 bank) Shared Memory, fully decoupled from ALU instructions Direct mode

– Vector Instruction Operand 32/16/8 bit broadcast value– Graphics Interpolation @ rate, no bank conflicts

Index Mode – Load/Store/Atomic Operations– Bandwidth Amplification, up-to 32 – 32 bit lanes serviced per clock peak– Direct decoupled return to VGPRs– Hardware conflict detection with auto scheduling

Software consistency/coherency for thread groups via hardware barrier Fast & low power vector load return from R/W L1

Page 39: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

39| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Local Data Share (LDS)

An LDS bank is 512 entries, each 32-bits wide– A bank can read and write a 32-bit value across an all-to-all crossbar and swizzle unit that

includes 32 atomic integer units– This means that several threads can read the same LDS location at the same time for

FREE– Writing to the same address from multiple threads also occurs at rate, last thread to write

wins

Typically, the LDS will coalesce 32 lanes from one SIMD each cycle– One wavefront is serviced completely every 2 cycles– Conflicts automatically detected across 32 lanes from a wavefront and resolved in

hardware– An instruction which accesses different elements in the same bank takes additional cycles

Page 40: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

40| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Local Data Share (LDS)

Page 41: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

41| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN R/W CACHE Reads and writes cached

– Bandwidth amplification– Improved behavior on more memory access patterns– Improved write to read reuse performance

Relaxed consistency memory model– Consistency controls available to control locality of load/store

GPU Coherent – Acquire/Release semantics control data visibility across the machine (GLC bit on

load/store)– L2 coherent = all CUs can have the same view of data

Global atomics– Performed in L2 cache

Page 42: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

42| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN L1 R/W Cache Architecture

– Each CU has its own L1 Cache16 KB L1, 64B lines, 4 sets x 64 way~64B/CLK per compute unit bandwidthWrite-through – alloc on write (no read) w/dirty byte mask

– Write-through at end of wavefront– Decompression on cache read out

– Instruction GLC bit defines cache behaviorGLC = 0;

– Local caching (full lines left valid)– Shader write back invalidate instructions

GLC = 1;– Global coherent (hits within wavefront boundaries)

Page 43: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

43| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN L2 R/W Cache Architecture

– 64-128KB L2 per Memory Controller Channel64B lines, 16 way set associative~64B/CLK per channel for L2/L1 bandwidthWrite-back - alloc on write (no read) w/ dirty byte maskAcquire/Release semantics control data visibility across CUs

– L2 coherent = all CUs can have the same view of dataRemote Atomic Operations

– Common Integer set & float Min/Max/CmpSwap

Page 44: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

44| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Latency & Bandwidth

– Each CU has 64 bytes per cycle of L1 bandwidthShared with the GDS

– Per L2 there’s 64 bytes of data per cycle as well

– Peak Scalar Data Cache Bandwidth per CU is 16 bytes/cycle– Peak I-Cache Bandwidth per CU is 32 bytes/cycle (Optimally 8 instructions)– LDS Peak Bandwidth is 128 bytes of data per cycle via bandwidth amplification

– That’s nearly 4 TB/s of LDS BW, 2 TB/s of L1 BW, and 700 GB/s of L2 BW!

– 384-bit GDDR5 Main Memory has over 264 GB/sec bandwidth– PCI Express 3.0 x16 bus interface to system (32GBps)

Page 45: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

45| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN L1 Texture Cache

The memory hierarchy is re-used for graphics Some dedicated graphics hardware added

Address-gen unit receives 4 texture addr/clock– Calculates 16 sample addr (nearest neighbors)– Reads samples from L1 data cache– Decompresses samples in Texture Mapping Unit (TMU)

TMU filters adjacent samples, produces <= 4 interpolated texels/clock TMU output undergoes format conversion and is written into the vector register file The format conversion hardware is also used for writing certain formats to memory from graphics

shaders

Page 46: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

46| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Virtual Memory and x86

The GCN cache hierarchy was designed to integrate with x86 microprocessors

The GCN virtual memory system can support 4KB pages– Natural mapping granularity for the x86 address space– Paves the way for a shared address space in the future– IOMMU used for DMA transfers can already translate requests into x86 address space

GCN caches use 64B lines, which is the same size as x86 processors use

The stage is set for heterogeneous systems to transparently share data between the GPU and CPU through the traditional caching system, without explicit programmer control!

Page 47: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

47| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

Graphics Core Next Architecture

AMD Radeon™ HD 7900 Series –Codename “Tahiti”

Page 48: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

48| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

Graphics Core Next Architecture Up to 32 Compute Units

AMD Radeon™ HD 7900 Series –Codename “Tahiti”

Page 49: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

49| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

Graphics Core Next Architecture Up to 32 Compute Units Dual Geometry Engines

AMD Radeon™ HD 7900 Series –Codename “Tahiti”

Page 50: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

50| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

AMD Radeon™ HD 7900 Series –Codename “Tahiti” Graphics Core Next Architecture Up to 32 Compute Units Dual Geometry Engines 8 Render Back-ends

– 32 color ROPs per clock– 128 Z/stencil ROPs per clock

Page 51: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

51| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

Graphics Core Next Architecture Up to 32 Compute Units Dual Geometry Engines 8 Render Back-ends

– 32 color ROPs per clock– 128 Z/stencil ROPs per clock

Up to 768KB read/write L2 cache

AMD Radeon™ HD 7900 Series –Codename “Tahiti”

Page 52: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

52| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

Graphics Core Next Architecture Up to 32 Compute Units Dual Geometry Engines 8 Render Back-ends

– 32 color ROPs per clock– 128 Z/stencil ROPs per clock

Up to 768KB read/write L2 cache Fast 384-bit GDDR5 memory interface

– Up to 264 GB/sec PCI Express 3.0 x16 bus interface

AMD Radeon™ HD 7900 Series –Codename “Tahiti”

Page 53: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

53| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

AMD Radeon™ HD 7900 Series –Codename “Tahiti” Graphics Core Next Architecture Up to 32 Compute Units Dual Geometry Engines 8 Render Back-ends

– 32 color ROPs per clock– 128 Z/stencil ROPs per clock

Up to 768KB read/write L2 cache Fast 384-bit GDDR5 memory interface

– Up to 264 GB/sec PCI Express 3.0 x16 bus interface 4.3 billion 28nm transistors 3.79 Peak Single-Precision TFLOPS

Page 54: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

54| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

AMD Radeon™ HD 7900 Series –Compute ArchitectureDual Asynchronous Compute Engines (ACE)

– Operate in parallel with graphics command processor

– Independent scheduling and work item dispatch for efficient multi-tasking 3 Devices with 3 Command Queues!

– Fast context switching– Exposed in OpenCL™

Dual DMA engines– Can saturate PCIe 3.0 x16 bus

bandwidth(16 GB/sec bidirectional)

Page 55: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

55| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

AMD Radeon™ HD 7900 Series –Compute ArchitectureHigh performance double precision floating point processing

– Up to 947 DP GFLOPS– Higher utilization = more usable FLOPS– IEEE compliant

More efficient flow control & branching

Full ECC protection for DRAM & SRAM

First GPU to fully support OpenCL 1.2, Direct3D + Compute 11.1, and C++ AMP

New compute instructions

Page 56: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

56| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Architecture – ACE Intimate DetailsACEs are responsible for compute shader scheduling & resource allocation

Each ACE fetches commands from cache or memory & forms task queues

Tasks have a priority level for scheduling

– Background RealtimeACE dispatch tasks to shader arrays as resources permit

Tasks complete out-of-order, tracked by ACE for correctness

Every cycle, an ACE can create a workgroup and dispatch one wavefront from the workgroup to the CUs

Page 57: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

57| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Architecture – ACE Intimate DetailsACE are independent

– But, can synchronize and communicate via cache/mem/GDS

Can form task graphs– Individual tasks can have

dependencies on one another Can depend on another ACE Can depend on part of the graphics pipeline

Can control task switching– Stop and start tasks and dispatch work

to shader engines

Page 58: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

58| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Architecture – Enabling Compute WorkloadsThe focus in GPU hardware is shifting away from graphics-specific units, towards general-purpose compute units

7900 Series GCN-based ASICS already have “3:1” ratio of ACE : Graphics CP

– Graphics CP can dispatch compute– ACE cannot dispatch graphics

If you aren’t writing Compute Shaders, you’re probably not getting the absolute most out of modern GPUs Control of LDS, barriers, thread layout, etc.

Future Trends:More Compute Units

– ALU outpaces BWCPU + GPU Flat Mem

– APU + dGPULess FF Graphics

– Can you write a Compute-based graphics pipeline? Start thinking about

it…

Page 59: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

59| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

Utilization and Efficiency Higher utilization = higher performance per sq.mm

Mandelbrot DP

AES256

SHA256

LuxMark

SmallptGPU

0x 1x 2x 3x 4x 5x

AMD Radeon HD 6970AMD Radeon HD 7970

Utilization improvementGFLOPS increase (1.4x)

Page 60: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

60| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Geometry EngineGS in conjunction with Tessellation is faster than before…

However… memory is still the bottleneck!

– Minimize the number of inputs and outputs for best performance… Small expansions can be done in LDS!

Each rasterizer can read in a single triangle per cycle, and write out 16 pixels

– Caveat: tiny triangles can mean that we don’t reach this potential, and become raster-bound!

Tessellation off

Image from Battlefield 3, EA DICE

Tessellation on

Page 61: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

61| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN TessellationLatest iteration of hardware tessellation units

– Increased vertex re-use– Off-chip buffering improvements– Larger parameter caches

Improves performance at all tessellation factors

– Up to 4x throughput ofAMD Radeon HD 6900 series(Gen 8)

Tessellation off

Image from Battlefield 3, EA DICE

Tessellation on

Page 62: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

62| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Tessellation – Performance

AMD Radeon™ HD 7970 AMD Radeon™ HD 6970

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 310.0x

0.5x

1.0x

1.5x

2.0x

2.5x

3.0x

3.5x

4.0x

4.5x

Tessellation Factor

Tess

ella

tion

Rate

Crysis 2

Total War:Shogun 2

Lost Planet 2

Unigine Heaven

S.T.A.L.K.E.R.:Call of Pripyat

0 0.5 1 1.5 2 2.5

Relative Performance

60% faster

139% faster

62% faster

55% faster

82% faster

Page 63: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

63| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

While performance is much improved, it is still a potential bottleneck!– Produces a great deal of ring bus traffic, starving other parts of the

pipelineBest performance achieved with tessellation factors less than 15!

Continue to Optimize:– Pre-triangulate– Frustum Culling– Backface Culling– Distance-adaptive– Screen-space adaptive– Orientation-adaptive– Etc…

GCN Tessellation – Best Practices

Page 64: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

64| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

Render Back-End (RBE) ON GCN ASICS Once the pixels fragments in a tile have been shaded, they flow to the Render Back-Ends (RBEs)

– 16KB Color Cache Up to 8 color samples (i.e. 8x MSAA)

– 4KB Depth Cache Up to 16 coverage samples (i.e. 16x EQAA)

– Write out through the memory controllers

Logic Operations as alternative to Blending– Exposed in DX11.1– Already available in OpenGL

Dual-Source Color Blending with MRTs– Only available in OpenGL

Page 65: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

65| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

DEPTH IMPROVEMENTS ON GCN ASICS Allows fast accept of fully-visible triangles spanning one or more tile

– If a triangle is fully covering a tile then cost is only 1 clock/tile

Depth Bounds Testing Extension– Exposed in OpenGL: GL_EXT_depth_bounds_test – Also Exposed in Direct3D via an extension – ask us if you’d like to try

it

24-BIT DEPTH FORMATS ARE INTERNALLY REPRESENTED AS 32-BITS

Page 66: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

66| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

STENCIL IMPROVEMENTS ON GCN ASICS GCN has support for new extended stencil ops compared to prior ASICS

– Only available in OpenGL: GL_AMD_stencil_operation_extended

– Additional stencil ops:AND, XOR, NORREPLACE_VALUE_AMDetc.

– Also exposes additional stencil op source valueCan be used as an alternative to stencil ref value

Stencil ref and op source value can now be exported from pixel shader– Only available in OpenGL: GL_AMD_shader_stencil_value_export

Page 67: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

67| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN LOW-LEVEL TIPS – GPR UtilizationGPRs and GPR pressure

General Purpose Registers (GPR) are a limited resource– Separate banks of GPRs for Vector and Scalar (per SIMD)– Maximum of 256 VGPRs and 512 SGPRs shared across all waves (upto 10) owned by a

SIMD– Organized as 64 words of 32-bits – two adjacent GPR can be combined for 64-bit (4 for

128-bit)– Number of GPRs required by a shader affects SIMD scheduling and execution efficiency– Shader tools can be used to determine how many GPRs are used

GPR pressure is affected by:– Loop Unrolling– Long lifetime of temporary variables– Nested Dynamic Flow Control instructions– Fetch dependencies (e.g. indexed constants)

Page 68: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

68| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN LOW-LEVEL TIPS – Texture Filtering– All shader stages can fetch textures– Point sampling is full-rate on all formats– Trilinear costs up to 2x the bilinear filtering cost– Anisotropic(N taps) costs <= ( N * bilinear )– Avoid cache thrashing

Use MIPmappingUse Gather() where applicableExploit implicit neighbouring pixel shader threadCU locality:

– Remember that sampling from neighbouring texels has a lower costfor a shader running within the same hardware tile, because it is

more likelyto experience a cache hit within the Compute Unit’s local texture

cache Exploit this explicitly by using Compute Shaders

Quarter-rate• RGBA32 and RGBA32F

Half-rate• RG32, RG32F• RGBA16, RGBA16F• BC6

Full-rate• Everything else!

GCN BILINEAR COSTS

Page 69: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

69| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN LOW-LEVEL TIPS – Color Output PS output: each additional color output increases export cost

Export cost can be more costly than PS execution– Each (fast) export is equivalent to 64 ALU ops on 7970– If shader is export-bound then use “free” ALU for packing instead

Watch out for those cases– E.g. G-Buffer parameter writes– MINIMIZE SHADER INPUTS AND OUTPUTS!– Pack, pack, pack, pack!

Costs of outputting and blending various formats – Discard/clip allow the shader hardware to skip the rest of the work!

Quarter-rate• RGBA16 with blending• RGBA32F with blending

Half-rate• R16, RG16 with blending• RG32F with blending• RGBA32, RGBA32F

Full-rate• Everything else!

Page 70: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

70| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Media Processing InstructionsSAD = Sum of Absolute Differences Critical to many video & image processing algorithms

– Motion detection– Gesture recognition– Video & image search– Stereo depth extraction– Computer vision

SAD (4x1) and QSAD (4 4x1) instructions– New QSAD combines SAD with alignment ops for higher

performance and reduced power draw– Evaluate up to 256 pixels per CU per clock cycle!

Maskable MQSAD instruction– Allows background pixels to be ignored– Accelerated isolation of moving objects

2 54 07 54 1

5 47 19 43 1

3 54 02 58 1

5 57 19 71 7

9 32 91 56 8

1 46 63 23 0

3 27 45 09 2

9 95 84 04 7

3 02 28 21 0

7 12 03 93 6

SAD = 59

SAD = 45

SAD = 58

SAD = 22

Closest match

SAD = 22

AMD Radeon HD 7970 can evaluate

7.6 Terapixels/sec *

* Peak theoretical performance for 8-bit integer pixels

Page 71: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

71| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

GCN Video Codec EngineVideo Codec Engine (VCE)

– Hardware H.264 Compression and Decompression Ultra-low-power, fully fixed-function mode

– Capable of 1080p @ 60 frames / second

– Programmable for Ultra High Quality and or Speed Entropy encoding block fully accessible to software

– AMD Accelerated Parallel Programming SDK– OpenCL ™

Create hybrid faster-than-real-time encoders!– Custom motion estimation– Inverse DCT and motion compensation– Combine with hardware entropy encoding!

AMD Radeon HD 7970 can compress

Realtime+ 1080p H.264

Page 72: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

72| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

IMPORTANT GCN ARCHITECTURE IMPROVEMENTS– Increased flexibility and efficiency, with reduced complexity!

Non-VLIW Architecture improves efficiency while reducing programmer burden

Constants/resources are just address + offset now in the hardwareUAV/SRV/SUV read/write any format – like CPU C++ reinterpret_cast & static_cast

GPU has virtual memory, forward looking towards x86 CPU + GPU flat memory

– Strong forward-looking focus on ComputeScalar ALU for complex dynamic control flow + branch & message unit64k LDS/CU, 64k GDS, atomics at every stage, coherent cache hierarchyMultiple Asynchronous Compute Engines (ACE) for multitasking compute

Page 73: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

73| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

MAIN GCN ARCHITECTURE TAKEAWAYS– GCN generally simplifies your life as a programmer

Don’t: fret too much about instruction grouping, or vectorizationDo: Think about GPR utilization & LDS usage (impacts max # of wavefronts)Do: Think about thread/cache locality when you structure your algorithmDo: Pack shader inputs and outputs – aim to be as IO/bandwidth thin as possible!

– Unlimited number of addressable constants/resourcesN constants aren’t free anymore – each consume resources, use sparingly!

– Compute is the future – exploit its power for GPGPU work & graphics!

Page 74: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

74| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

THANK YOUIF WE HAVE TIME REMAINING, WE CAN COVER PARTIALLY RESIDENT TEXTURES

Layla [email protected] @MissQuickste

p

Page 75: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

75| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

Partially Resident Textures (PRT)

MegaTexture in id Tech5

Page 76: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

76| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

Partially Resident Textures (PRT) – Introduction Enables application to manage more texture data than can physically fit in a fixed footprint

– A.k.a. “Virtual texturing“ and “Sparse texturing”

The principle behind PRT is that not all texture contents are likely to be needed at any given time

– Current render view may only require selected portions of a texture to be resident in memory

– Or, only selected MIPMap levels…

PRT textures only have a portion of their data mapped into GPU-accessible memory at a time– Texture data can be streamed in on-demand– Texture sizes up-to 32TB (16k x 16k x 8k x 128-bit)

OpenGL extension – GL_AMD_sparse_texture

Page 77: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

77| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

Partially Resident Textures (PRT) – TEXTURE TILES The PRT texture is chunked into 64 KB tiles

– Fixed memory size– Not dependant on texture type or format

Highlighted areas represent texture data that needs

highest resolutionChunked texture Texture tiles needing to

be resident in GPU memory

Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008

Page 78: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

78| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

PRT – Translation Table The GPU virtual memory page table translates 64kb tiles into a resident texture tile pool

Page Table Texture Tile Pool (Video Memory)

(linear storage)

Unmapped page entryMapped page entry

64Kb tile

Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008

Texture Map

Page 79: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

79| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

PRT – Translation Table – Mip MapsNot all tiles from the texture map are actually resident in video memoryPRT hardware page table stores virtual physical mappings

Texture Map Page Table Texture Tile Pool (Video Memory)

Unmapped page entryMapped page entry

64Kb tile

Smiley texture courtesy of Sparse Virtual Texturing, GDC 2008MIP Levels

Page 80: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

80| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

PRT – TILE MANAGEMENT The Application is responsible for uploading/releasing new PRT tiles!

A common scenario is to upload lower MIPMaps to texture tile pool– This allows a full representation of the PRT contents to be resident in memory (albeit at

lower resolution)– e.g. MIP LOD 6 and above for 16kx16k 32 bits texture is about 650Kb (256x256

resolution)

Texture tiles corresponding to higher resolution areas are uploaded by the application as needed

– e.g. As camera gets closer to a PRT-textured polygon the requirement for texels:screen pixels ratio increases, thus higher LOD tiles need uploading

Page 81: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

81| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

PRT – “FAILED” FETCH How does the application know which texture tiles to upload?

Answer: PRT-specific texture fetch instructions in pixel shader– Return a “Failed” texel fetch condition when sampling a PRT pixel whose tile is currently

not in the pool– OpenGL example: int glSparseTexture( gsampler2D sampler,

vec2 P, inout gvec4 texel );

This information is then stored in render target or UAV– Texel fetch failed for a given (x,y) tile location

...and then copied to the CPU so that application can upload required tiles App chooses what to render until missing data gets uploaded

Page 82: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

82| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

PRT – “LOD WARNING” TEXEL FETCH CONDITION PRT fetch condition code can also indicate an “LOD Warning”

The minimum LOD warning is specified by the application on a per texture basis– OpenGL example: glTexParameteri( <target>,

MIN_WARNING_LOD_AMD, <LOD warning value> );

If a fetched pixel’s LOD is < the specified LOD warning value then the condition code is returned

This functionality is typically used to try to predict when higher-resolution MIP levels will be needed

– E.g. Camera getting closer to PRT-mapped geometry

Page 83: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

83| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

PRT – Example Usage 1. App allocates PRT (e.g. 16kx16k DXT1) using PRT API

2. App uploads MIP levels using API calls

3. Shader fetches PRT data at specified texcoords

Two possibilities:3.a. Texel data belongs to a resident (64KB) tile

- Valid color returned, no error code

3.b. Texel data points to non-resident tile or specified LOD- Error/LOD Warning code returned- Shader writes tile location and error code to RT or UAV

4. App reads RT or UAV and upload/release new tiles as needed

Page 84: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

84| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

PRT Types, Formats and DimensionsAll texture types and formats supported

–1D, 2D, cube, arrays and 3D volume textures

–All common texture formatsIncluding compressed formats

–Maximum dimensions:16K x 16K x 8K x 128 bit textures

Page 85: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

85| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

Hardware PRT > Software Implementation

PRTEase of implementation• Complexity hidden behind HW & API

Full filtering support• Includes anisotropic filtering

Full-speed filtering• SW solution requires “manual” filtering• Software anisotropic is very costly

SW Implementation

Don’t go overboard with PRT allocation!• Page table entry size is 4 DWORDs• Have to be resident in video memory

Page 86: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

86| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

问题?Questions? 質問がありますか?

^_^

Layla [email protected] @MissQuickstep

Page 87: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

87| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

Trademark AttributionAMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names used in this presentation are for identification purposes only and may be trademarks of their respective owners.

©2012 Advanced Micro Devices, Inc. All rights reserved.

Page 88: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

88| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

Backup Slides

Page 89: GDC 2013: Powering The Next Generation Of Graphics AMD GCN ...developer.amd.com/wordpress/media/2012/...AMD_GCN_Architecture2.ppsxGDC 2013: Powering The Next Generation Of Graphics

89| Powering the Next Generation of Graphics: The AMD GCN Architecture | Layla Mah | March 2013 | Game Developers Conference |

SHADER CODE EXAMPLE 2float fn0(float a,float b){ float c = 0.0; float d = 0.0;

for(int i=0;i<100;i++) { if(c>113.0) break; c = c * a + b; d = d + 1.0; } return(d);}

// Registers r0 contains “a”, r1 contains “b”, r2 contains “c”// and r3 contains “d”// Value is returned in r3

v_mov_b32 r2, #0.0 // float c = 0.0 v_mov_b32 r3, #0.0 // float d = 0.0 s_mov_b64 exec, s0 // Save execution mask s_mov_b32 s2, #0 // i=0label0: s_cmp_lt_s32 s2, #100 // i<100 s_cbranch_sccz label1 // Exit loop if not true v_cmp_le_f32 r2, #113.0 // c > 113.0 s_and_b64 exec, vcc, exec // Update exec mask on fail s_branch_execz label1 // Exit if all lanes pass v_mul_f32 r2, r2, r0 // c = c*a v_add_f32 r2, r2, r1 // c = c+b v_add_f32 r3, r3, #1.0 // d = d+1.0 s_add_s32 s2, s2, #1 // i++ s_branch label0 // Jump to start of looplabel1: s_mov_b64 exec, s0 // Restore exec mask