ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30,...
-
Upload
myles-torrey -
Category
Documents
-
view
225 -
download
2
Transcript of ATI Stream Computing ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview Micah Villmow May 30,...
ATI Stream ComputingATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
Micah VillmowMay 30, 2008
| ATI Stream Computing Update | Confidential2 2 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
Outline
• ATI Radeon™ HD 3800 Series GPU – What changed.
• ATI Radeon™ HD 3400/3600 Series and X2 GPU variants
• ATI Radeon™ HD 4800 – A new architecture?
• Compute Shader – A new paradigm
| ATI Stream Computing Update | Confidential3 3 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
ATI Radeon™ HD 3800 Series GPUWhat Changed
• Double Precision
• Memory Controller Modifications
• Tex Modifications
• Linear Memory
• Global Buffer support
• Limited Render backends
| ATI Stream Computing Update | Confidential4 4 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
ALU Hardware – Double Precision
• Combine thin pipes together to produce double
• Combines two F32 components for F64
MSB is in y/w component
LSB is in x/z component
• Two pipe instructions:
DADD
Double-Float Conversion Ops
DLDEXP
DFRAC
Double Comparison Ops
• Four pipe instructions:
DFREXP
DMUL
DMAD
IL:dmad r10.xy__, r0.xy, r5.xy, r10.xyISA:21 x: MULADD_64 T0.x, R5.y, R1.y, T0.y y: MULADD_64 T0.y, R5.y, R1.y, T0.y z: MULADD_64 ____, R5.y, R1.y, T0.y w: MULADD_64 ____, R5.x, R1.x, T0.x t: MULADD R4.y, R5.z, R3.z, T0.z
IL:dadd r10.xy__, r0.xy, r5.xydadd r10.__zw, r0.zw, r5.zwISA:20 x: ADD_64 T3.x, R3.y, R1.y y: ADD_64 T3.y, R3.x, R1.x z: ADD_64 T3.z, R3.w, R1.w VEC_120 w: ADD_64 T3.w, R3.z, R1.z t: MULADD T0.w, R4.y, R1.x, T0.w
| ATI Stream Computing Update | Confidential5 5 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
Memory Hardware – Memory Controller
• Die-shrink from 80nm to 55nm
• 512-bit ring bus, 256r/256w
• 72 GB/s bandwidth peak
• 32-bit memory channels
| ATI Stream Computing Update | Confidential6 6 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
Memory Hardware – Texture Unit
• Four 32KB four-way associative L1 caches
• L1 cache size is 4x8KB per SIMD engine
• Data is split across all four 8K L1 cache’s
• L1 cacheline is 128 bytes or 2 quads of data
• 256KB unified cache over all SIMDs
| ATI Stream Computing Update | Confidential7 7 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
Memory Hardware – Linear Layout
Tiled Layout
P
Pitch
Width
He
igh
t
• Possible wasted space between width and pitch• Euclidean coordinates for addressing• Macro-micro tiling format is non-linear• Outputs through color buffer backend
Linear Layout
Pitch
He
igh
t
• Addressable space is pitch * height• No wasted space in allocated texture• Linear macro tiling format• Outputs through SMX
| ATI Stream Computing Update | Confidential8 8 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
Memory Hardware – RB Changes
Memory Controller
DPP Array
Memory Controller
DPP Array
ATI Radeon™ HD 2900 Series GPU ATI Radeon™ HD 3800 Series GPU
| ATI Stream Computing Update | Confidential9 9 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
ATI Radeon™ HD 4800 Series GPUs - Improvements
• 2.5x more floating point compute power than ATI Radeon™ HD 3800 Series GPUs
• Includes all the features added to ATI Radeon™ HD 3800 Series GPUs
• Higher bandwidths w/ GDDR5 memory
• 115GB/s memory bandwidth
• 1.2 Teraflops peak ALU performance
• New compute shader paradigm
• Inter- and Intra- thread sharing
| ATI Stream Computing Update | Confidential1010 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
ATI Radeon™ HD 4800 Series GPUs – Architecture Features
• ALU Improvements• 10 SIMD engines• 16 TP’s per SIMD• 5 streaming cores per TP• 800 total streaming cores• Shared global registers
• TEX Improvements• 4 TEX units per SIMD• 40 total TEX units• Local data share• Global data share
• MEM Improvements• 8KB L1 cache per SIMD• 480 GB/s L1 BW• 4 32KB L2 caches• 384 GB/s L2->L1 BW
| ATI Stream Computing Update | Confidential1111 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
ATI Radeon™ HD 4800 Series GPUs - Hardware Layout
•Optimized for distributed memory layout and GDDR5
•Various Sections:
ALU – Red
TEX – Brown
MEM – Orange
RAM – Green
PCIE – Blue
Display - Yellow
| ATI Stream Computing Update | Confidential1212 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
ATI Radeon™ HD 4800 Series GPUs - ALU Units
•Same ALUs as ATI Radeon™ HD 3800 Series GPUs, just more
•Integer shifts on all streaming cores
•Improved double and integer performance
•16KB on-chip local data share with write private-read anywhere memory model
•Global R/W registers per SIMD
•32KB on-chip global data share
| ATI Stream Computing Update | Confidential1313 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
ATI Radeon™ HD 4800 Series GPUs –Memory Hardware – TEX Units
| ATI Stream Computing Update | Confidential1414 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
ATI Radeon™ HD 4800 Series GPU –Memory Hardware – Memory Controller
| ATI Stream Computing Update | Confidential1515 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
ATI Radeon™ HD 4800 Series GPU –Memory Hardware – Render Backends
• 4 Render backends
• 256 bit memory lines
• Write combining cache
• Global buffer via DB instead of SMX
• Scratch buffer bandwidth doubled
• Scatter bandwidth inline with color writes
| ATI Stream Computing Update | Confidential1616 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
Compute Shader – A New Paradigm
• A general approach to the compute paradigm
• Disconnect the output domain from the problem domain
• Gives more control to the shader writer
• Read anywhere, write anywhere
• The new terminology – threads and groups
• Data sharing – shared registers and local data share
• Linear memory format
| ATI Stream Computing Update | Confidential1717 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
Compute Shader – A New Paradigm (cont’d)
• Removes graphics-centric terminology and ideas
• An array of parallel processing elements
• Removes graphics pipeline from the picture (no ES, PS, GS, VS etc.)
• Inputs and outputs are disconnected from the output domain
• Domain is now specified by the number of threads to run in a 2D fashion.
| ATI Stream Computing Update | Confidential1818 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
Compute Shader - Terminology
• Thread – A single invocation of the kernel
• Group – A set number of threads that can share data and run together on a single SIMD. Multiple groups can run on a single SIMD if registers allow
• Shared Registers – Registers that are global to a SIMD
• Local Data Share – 16KB on-chip memory per SIMD shared between threads in a group
• Wavefront – group of 64 threads run concurrently on a SIMD• Fence – Synchronization mechanism for threads within a group
_threads – Generic barrier that synchronizes all threads to a point
_memory – Synchronize threads on global memory accesses _sr – Synchronize on Shared Register access _lds – Synchronize on local data share
| ATI Stream Computing Update | Confidential1919 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
Data Sharing and Synchronization
• SR – Globally shared registers
– Sharing between all wavefronts in a SIMD
– Column sharing on the SIMD
– Persistent registers
– Atomicity guaranteed in same instruction
• LDS – Local Data Share
– Write local, read global system
– Share between all threads in a group
– Synchronization required
New Indexing Values – No more vPos/vWinCoord
– vTid – ID of thread within a group
– vaTid – ID of thread within a domain
– vTgroupid – ID of group within a domain
| ATI Stream Computing Update | Confidential2020 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
Shared Registers
Wavefront 1
Wavefront 2
Wavefront 3
Wavefront 4
Wavefront 0
Wavefront 5
Wavefront 7
Wavefront 6
Shared Registers
SIMD 0
Wavefront 1
Wavefront 2
Wavefront 3
Wavefront 4
Wavefront 0
Wavefront 5
Wavefront 7
Wavefront 6
Shared Registers
SIMD N
Data is shared between columns of a wavefront per SIMD
- Accesses in the same ALU clause are atomic, indexing is not allowed- Shared registers are carved out of the register pool- Same as accessing normal registers
| ATI Stream Computing Update | Confidential2121 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
IL SR Usage
il_cs_2_0
dcl_cb cb0[1]
dcl_shared_temp sr1
add sr0, sr0, r0.1111
mov g[vaTid0.x], sr0
ret
end
•Atomic Read-Modify-Write
•Uses:
Reductions
– Max
– Min
– Sum
– Average
Order Agnostic Data Updates
– Histogram
– Global Counters
– Semaphores
| ATI Stream Computing Update | Confidential2222 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
Local Data Share
• 16KB of memory per SIMD, 4 banks of 1k Dwords, max 128 per thread.
• Write address is based on thread ID, and offsets are static
• Reads are done by thread ID + offset.
• Dispatches one write command every cycle
• Dispatches read over four cycles with waterfall
• 40-44 cycle latency that needs to be hidden by ALU
| ATI Stream Computing Update | Confidential2323 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
LDS
Group 1
Wavefront 1
Wavefront 2
Wavefront 3
Wavefront 0
Group 2
Wavefront 0
Wavefront 1
Wavefront 3
Wavefront 2
SIMD 0
Wavefront 0
Wavefront 1
Wavefront 2
Wavefront 3
SIM
D 0
LD
S M
emor
y
Write self only
Write self only
Write self only
Write self only
Group 1
Wavefront 1
Wavefront 2
Wavefront 3
Wavefront 0
Group 2
Wavefront 0
Wavefront 1
Wavefront 3
Wavefront 2
SIMD N
Wavefront 0
Wavefront 1
Wavefront 2
Wavefront 3
SIM
D N
LD
S M
emor
yRead Any
Read Any
Read Any
Read Any
| ATI Stream Computing Update | Confidential2424 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
IL LDS Usage
il_cs_2_0
dcl_cb cb0[1]
dcl_num_thread_per_group 1024
dcl_lds_size_per_thread 4
dcl_lds_sharing_mode _wavefrontRel
dcl_literal l0, 0x0, 0x04, 0x8, 0x1
mov r0, cb0[0].xxxx
lds_write_vec mem, vTid0.x
iadd r0, r0, vTid0.x0xx
lds_read_vec_sharingMode(abs) r2, r0.x0
mov g[vaTid0.x], r2
ret
end
| ATI Stream Computing Update | Confidential2525 | ATI Stream Computing – ATI Radeon™ HD 3800/4800 Series GPU Hardware Overview
Disclaimer & AttributionDISCLAIMERThe information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION© 2010 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI Logo, Radeon, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other names are for informational purposes only and may be trademarks of their respective owners.