ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30,...

25
ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008

Transcript of ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30,...

Page 1: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

ATI Stream ComputingATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

Micah VillmowMay 30, 2008

Page 2: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential2 2 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

Outline

• ATI Radeon™ HD 3800 Series GPU – What changed.

• ATI Radeon™ HD 3400/3600 Series and X2 GPU variants

• ATI Radeon™ HD 4800 – A new architecture?

• Compute Shader – A new paradigm

Page 3: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential3 3 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

ATI Radeon™ HD 3800 Series GPUWhat Changed

• Double Precision

• Memory Controller Modifications

• Tex Modifications

• Linear Memory

• Global Buffer support

• Limited Render backends

Page 4: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential4 4 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

ALU Hardware – Double Precision

• Combine thin pipes together to produce double

• Combines two F32 components for F64

MSB is in y/w component

LSB is in x/z component

• Two pipe instructions:

DADD

Double-Float Conversion Ops

DLDEXP

DFRAC

Double Comparison Ops

• Four pipe instructions:

DFREXP

DMUL

DMAD

IL:dmad r10.xy__, r0.xy, r5.xy, r10.xyISA:21 x: MULADD_64 T0.x, R5.y, R1.y, T0.y y: MULADD_64 T0.y, R5.y, R1.y, T0.y z: MULADD_64 ____, R5.y, R1.y, T0.y w: MULADD_64 ____, R5.x, R1.x, T0.x t: MULADD R4.y, R5.z, R3.z, T0.z

IL:dadd r10.xy__, r0.xy, r5.xydadd r10.__zw, r0.zw, r5.zwISA:20 x: ADD_64 T3.x, R3.y, R1.y y: ADD_64 T3.y, R3.x, R1.x z: ADD_64 T3.z, R3.w, R1.w VEC_120 w: ADD_64 T3.w, R3.z, R1.z t: MULADD T0.w, R4.y, R1.x, T0.w

Page 5: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential5 5 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

Memory Hardware – Memory Controller

• Die-shrink from 80nm to 55nm

• 512-bit ring bus, 256r/256w

• 72 GB/s bandwidth peak

• 32-bit memory channels

Page 6: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential6 6 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

Memory Hardware – Texture Unit

• Four 32KB four-way associative L1 caches

• L1 cache size is 4x8KB per SIMD engine

• Data is split across all four 8K L1 cache’s

• L1 cacheline is 128 bytes or 2 quads of data

• 256KB unified cache over all SIMDs

Page 7: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential7 7 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

Memory Hardware – Linear Layout

Tiled Layout

P

Pitch

Width

He

igh

t

• Possible wasted space between width and pitch• Euclidean coordinates for addressing• Macro-micro tiling format is non-linear• Outputs through color buffer backend

Linear Layout

Pitch

He

igh

t

• Addressable space is pitch * height• No wasted space in allocated texture• Linear macro tiling format• Outputs through SMX

Page 8: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential8 8 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

Memory Hardware – RB Changes

Memory Controller

DPP Array

Memory Controller

DPP Array

ATI Radeon™ HD 2900 Series GPU ATI Radeon™ HD 3800 Series GPU

Page 9: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential9 9 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

ATI Radeon™ HD 4800 Series GPUs - Improvements

• 2.5x more floating point compute power than ATI Radeon™ HD 3800 Series GPUs

• Includes all the features added to ATI Radeon™ HD 3800 Series GPUs

• Higher bandwidths w/ GDDR5 memory

• 115GB/s memory bandwidth

• 1.2 Teraflops peak ALU performance

• New compute shader paradigm

• Inter- and Intra- thread sharing

Page 10: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential1010 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

ATI Radeon™ HD 4800 Series GPUs – Architecture Features

• ALU Improvements• 10 SIMD engines• 16 TP’s per SIMD• 5 streaming cores per TP• 800 total streaming cores• Shared global registers

• TEX Improvements• 4 TEX units per SIMD• 40 total TEX units• Local data share• Global data share

• MEM Improvements• 8KB L1 cache per SIMD• 480 GB/s L1 BW• 4 32KB L2 caches• 384 GB/s L2->L1 BW

Page 11: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential1111 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

ATI Radeon™ HD 4800 Series GPUs - Hardware Layout

•Optimized for distributed memory layout and GDDR5

•Various Sections:

ALU – Red

TEX – Brown

MEM – Orange

RAM – Green

PCIE – Blue

Display - Yellow

Page 12: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential1212 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

ATI Radeon™ HD 4800 Series GPUs - ALU Units

•Same ALUs as ATI Radeon™ HD 3800 Series GPUs, just more

•Integer shifts on all streaming cores

•Improved double and integer performance

•16KB on-chip local data share with write private-read anywhere memory model

•Global R/W registers per SIMD

•32KB on-chip global data share

Page 13: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential1313 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

ATI Radeon™ HD 4800 Series GPUs –Memory Hardware – TEX Units

Page 14: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential1414 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

ATI Radeon™ HD 4800 Series GPU –Memory Hardware – Memory Controller

Page 15: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential1515 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

ATI Radeon™ HD 4800 Series GPU –Memory Hardware – Render Backends

• 4 Render backends

• 256 bit memory lines

• Write combining cache

• Global buffer via DB instead of SMX

• Scratch buffer bandwidth doubled

• Scatter bandwidth inline with color writes

Page 16: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential1616 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

Compute Shader – A New Paradigm

• A general approach to the compute paradigm

• Disconnect the output domain from the problem domain

• Gives more control to the shader writer

• Read anywhere, write anywhere

• The new terminology – threads and groups

• Data sharing – shared registers and local data share

• Linear memory format

Page 17: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential1717 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

Compute Shader – A New Paradigm (cont’d)

• Removes graphics-centric terminology and ideas

• An array of parallel processing elements

• Removes graphics pipeline from the picture (no ES, PS, GS, VS etc.)

• Inputs and outputs are disconnected from the output domain

• Domain is now specified by the number of threads to run in a 2D fashion.

Page 18: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential1818 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

Compute Shader - Terminology

• Thread – A single invocation of the kernel

• Group – A set number of threads that can share data and run together on a single SIMD. Multiple groups can run on a single SIMD if registers allow

• Shared Registers – Registers that are global to a SIMD

• Local Data Share – 16KB on-chip memory per SIMD shared between threads in a group

• Wavefront – group of 64 threads run concurrently on a SIMD• Fence – Synchronization mechanism for threads within a group

_threads – Generic barrier that synchronizes all threads to a point

_memory – Synchronize threads on global memory accesses _sr – Synchronize on Shared Register access _lds – Synchronize on local data share

Page 19: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential1919 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

Data Sharing and Synchronization

• SR – Globally shared registers

– Sharing between all wavefronts in a SIMD

– Column sharing on the SIMD

– Persistent registers

– Atomicity guaranteed in same instruction

• LDS – Local Data Share

– Write local, read global system

– Share between all threads in a group

– Synchronization required

New Indexing Values – No more vPos/vWinCoord

– vTid – ID of thread within a group

– vaTid – ID of thread within a domain

– vTgroupid – ID of group within a domain

Page 20: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential2020 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

Shared Registers

Wavefront 1

Wavefront 2

Wavefront 3

Wavefront 4

Wavefront 0

Wavefront 5

Wavefront 7

Wavefront 6

Shared Registers

SIMD 0

Wavefront 1

Wavefront 2

Wavefront 3

Wavefront 4

Wavefront 0

Wavefront 5

Wavefront 7

Wavefront 6

Shared Registers

SIMD N

Data is shared between columns of a wavefront per SIMD

- Accesses in the same ALU clause are atomic, indexing is not allowed- Shared registers are carved out of the register pool- Same as accessing normal registers

Page 21: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential2121 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

IL SR Usage

il_cs_2_0

dcl_cb cb0[1]

dcl_shared_temp sr1

add sr0, sr0, r0.1111

mov g[vaTid0.x], sr0

ret

end

•Atomic Read-Modify-Write

•Uses:

Reductions

– Max

– Min

– Sum

– Average

Order Agnostic Data Updates

– Histogram

– Global Counters

– Semaphores

Page 22: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential2222 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

Local Data Share

• 16KB of memory per SIMD, 4 banks of 1k Dwords, max 128 per thread.

• Write address is based on thread ID, and offsets are static

• Reads are done by thread ID + offset.

• Dispatches one write command every cycle

• Dispatches read over four cycles with waterfall

• 40-44 cycle latency that needs to be hidden by ALU

Page 23: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential2323 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

LDS

Group 1

Wavefront 1

Wavefront 2

Wavefront 3

Wavefront 0

Group 2

Wavefront 0

Wavefront 1

Wavefront 3

Wavefront 2

SIMD 0

Wavefront 0

Wavefront 1

Wavefront 2

Wavefront 3

SIM

D 0

LD

S M

emor

y

Write self only

Write self only

Write self only

Write self only

Group 1

Wavefront 1

Wavefront 2

Wavefront 3

Wavefront 0

Group 2

Wavefront 0

Wavefront 1

Wavefront 3

Wavefront 2

SIMD N

Wavefront 0

Wavefront 1

Wavefront 2

Wavefront 3

SIM

D N

LD

S M

emor

yRead Any

Read Any

Read Any

Read Any

Page 24: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential2424 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

IL LDS Usage

il_cs_2_0

dcl_cb cb0[1]

dcl_num_thread_per_group 1024

dcl_lds_size_per_thread 4

dcl_lds_sharing_mode _wavefrontRel

dcl_literal l0, 0x0, 0x04, 0x8, 0x1

mov r0, cb0[0].xxxx

lds_write_vec mem, vTid0.x

iadd r0, r0, vTid0.x0xx

lds_read_vec_sharingMode(abs) r2, r0.x0

mov g[vaTid0.x], r2

ret

end

Page 25: ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30, 2008.

| ATI Stream Computing Update | Confidential2525 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview

Disclaimer & AttributionDISCLAIMERThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION© 2010 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI Logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other names are for informational purposes only and may be trademarks of their respective owners.