GPU and PC System Architecture UC Santa Cruz BSoE – March 2009

36
GPU and PC System Architecture UC Santa Cruz BSoE – March 2009 John Tynefield / NVIDIA Corporation

description

GPU and PC System Architecture UC Santa Cruz BSoE – March 2009 John Tynefield / NVIDIA Corporation. My Goals. Survey history and direction of GPU/PC system architecture Demonstrate the process of system level architectural problem solving Motivate some of you to become architects. - PowerPoint PPT Presentation

Transcript of GPU and PC System Architecture UC Santa Cruz BSoE – March 2009

Page 1: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

GPU and PC System Architecture UC Santa Cruz BSoE – March 2009John Tynefield / NVIDIA Corporation

Page 2: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

My Goals

Survey history and direction of GPU/PC system architectureDemonstrate the process of system level architectural problem solvingMotivate some of you to become architects

Page 3: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Disclaimers

I work for NVIDIAPublic InfoAll numbers and dates approximate

Rounding is our friendNo bus/processor is 100% efficient, etc, etc

All examples are meant to be illustrativeNot comprehensive

“ there were >40 gfx companies in 1995”

Page 4: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

About Me

I love games and graphicsI love building things

Page 5: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Structure

Intro to PC and GPU ArchitectureA Sampling of Architectures

1996 - Voodoo Graphics / Pentium2000 - GeForce 256 / P32004 - GeForce 6800 / P42008 - Geforce GTX280 / Core2

Ideas for the future of the platform

Page 6: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

What do architects do?

Impose structure on complex design problemsMake tradeoffsValidate high risk design betsStructure verification

Page 7: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Why this is a great time to be an Architect

Radical design mobilityI have contributed to 10 completely new processor designs

7 of which shipped in millions of units.Steep competition

Not for everybodyChanging the World…no…really!

Heterogeneous many core computing is here to stay and it has changed the nature of computing

Page 8: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Design Tension

Fixed Function vs. ProgrammableScalar vs. VectorBandwidth vs. LatencyIn Order vs. Out of OrderLimited vs. Unlimited ( virtualized ) resources

Page 9: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Technology Trends

CPUs get fasterGPUs get fasterInterconnects get fasterMemory gets fasterMemory gets denserLatency increasesFeature load increasesPhysics intrudes more and more

All at different rates

1996 2000 2004 20080%

5000%

10000%

15000%

20000%

25000%

30000%

CPU CoresCPU Interconnect BWGPU CoresGPU Interconnect BWSystem Memory BWGPU Memory B/W

Page 10: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

The long time horizon

The Awesome ideas of now take 2+ years to reach marketAwesome depreciates rapidly

PredictableSilicon Process RoadmapPC Arch Roadmap3rd Party Component RoadmapYour capabilities and resources

UnpredictableMarket Shifts ( commodity prices, supply shocks )3rd Party Strategic Errors ( os/platform/partner slips )Innovative Competition ( N-way struggle for design initiative )

Page 11: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

GPU Memory

GPUCPU

Ultra Simplified PC Anatomy

CPU Core Logic GPU

GPU Memory

System Memory

Page 12: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

ProcessorProcessor

Processor

DRAM MGMTDRAM

MGMT

Ultra Simplified GPU Anatomy

Host Logic

DRAM MGMT

Page 13: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Ultra Simplified GPU Anatomy (2) ProcessorProcessor

Processor

DRAM MGMTDRAM

MGMTHost Logic

DRAM MGMT

Geom Gather

GeomProc

TriangleProc

PixelProc Z / Blend

Memory

Page 14: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

GPU Prehistory

1960s – 1970sSingle Purpose BIG IRONE&S, GE, Lockheed, …

1980s – 1990sGeneral Purpose BIG IRONCustom ASICs, WorkstationsSGI, Sun, Intergraph, ..

1994Maybe we can fit this on a single consumer add-in card?

Page 15: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Fast consumer CPUs with floating pointTry 3D rendering in fixed point!

PCIVGA and VESAId Software’s DOOMContract Fabrication facilities offering .6 micronASIC design Tools

Enabling Technologies in 1994

Page 16: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

1996 3dfx - Voodoo Graphics

PIO Programming ModelPure Pipelined GraphicsPartial Triangle Setup – FP32Fixed Point Integer Texture Mapping and Gouraud ShadingZ Buffer and Full OpenGL BlendingAll at 1 PPC, all the time, with no caches

32-bit PCI - .09 GB/s 128-bit EDO 50 Mhz DRAM - .8 GB/s

Page 17: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Voodoo Graphics System Architecture

Geom Gather

GeomProc

TriangleProc

PixelProc Z / Blend

CPU Core Logic FBI

FB Memory

System Memory

TMU TEX Memory

GPUCPU

Page 18: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Arch Decision – Triangle Setup

Target 3D Triangle with texture and Gouraud shading3 * XYW RGBA ST = 72 bytes/triangle pre setup

32-bit PCI 33Mhz – 90 MB/s 1.25 M triangles / second speed of light ( 1M is magic )

Observe that post setup3 * XY WRGBAST start values + screen space derivatives + Area

76 bytes/triangle – 1.18M Tris ( still magic )Setup can be coded on Pentium in ~100 clocks

1M triangles on P100 ( mktg happy )Data-limited setup on chip - >10% die cost

Typical game scenes <<1000 triangles/frame

Page 19: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

2000 Nvidia GeForce 256

Decoupled input queuingHardware Transform & Lighting

FP32 FF TransformFP22 FF Lighting

Complex fixed function pixel shading4 Pipelines

AGP4X – 1.06 GB/s 256 Bit DDR 300 Mhz Memory – 19.2 GB/s

Page 20: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

GeForce 256 System Architecture

Geom Gather

GeomProc

TriangleProc

PixelProc Z / Blend

CPU Core Logic GPU

GPU Memory

System Memory

GPUCPU

Page 21: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Architecture Detail – Combiners

Logical fixed function extension of OpenGL MachineSurface Color = Diffuse * Texture + Specular

Diffuse Color Texture

Specular

Page 22: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Multi Texture

If one texture is good, more are betterDiff * ( Tex1 + Tex 2 ) + Spec or Diff * Tex1 * Tex2 or …

Diffuse Color Texture

0.0

Texture

Specular

Diffuse Color Texture

Texture2

1.0

Specular

Page 23: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Combiners

Cascading Mux / SOP / Mux / SOP pipelineVery, flexible, harder to program with deeper nesting

Everything is full speed!

A MUX B MUX

AB Partial

C MUX D MUX

CD Partial

Inputs for Next Stage of Pipeline

Texture Fog Light

Page 24: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Programmable Shading

But the future was obviously Renderman-like shaders

normal surfaceN; color C = { 1.0, 0.5, 0.0 }; normal lightDirection;

Ci = C * dot ( surfaceN, lightDirection );

Page 25: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

2004 Nvidia GeForce 6800

Fully general Vertex and Pixel ISA6 Geometry Processors16 Pixel Processors

Deep recirculating pipelines to hide latencyFP32 datapath end to end

AGP8X – 2.11 GB/s256 Bit 700 Mhz GDDR3 – 44 GB/s

Page 26: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

GeForce 6800 System Architecture

Geom Gather

GeomProc

TriangleProc

PixelProc

Z / Blend

CPU Core Logic GPU

GPU Memory

System Memory

GPUCPU

Physics and AI

Scene Mgmt

Page 27: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Architecture Decision – Tex/Shader Structure

Problem: Build a general programmable pipelineOptimize for common workloadsTEX – BLEND – FOGCommon Game Shaders ( eg. Doom 3 )

Page 28: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Plan A – Uncoupled

ElegantSmall fundamental unitMany “passes” for common shaders

TBF TEXMTHTEXBLNDBLND R

egis

ters

Texture Math

Page 29: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Less ElegantLarger Fundamental UnitSingle pass for common shadersGood scaling for longer shadersBig perf / area win given workloadsNot forward looking

Plan B - Coupled

Reg

iste

rs

Math

Texture

Math

Page 30: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

2008 - GeForce GTX280

Fully unified programmable architecture240 instances of the same processorIEEE FP32 and FP64

Gen2 PCIE – 8GB/s512 bit 1100 Mhz GDDR3 – 144 GB/s

Page 31: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

GeForce GTX280 System Architecture

Geom Gather

GeomProc

TriangleProc

PixelProc

Z / Blend

CPU Core Logic GPU

GPU Memory

System Memory

GPUCPU

Physics and AI

Scene Mgmt

Page 32: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Architecture Decision – Heterogeneous Computing Support

Build a bigger ChipRadically improve ability of GPU to share work with the CPU

Thread

Local Memory

Grid 0

. . .

GlobalMemory

. . .

Grid 1

SequentialGridsin Time

Block

SharedMemory

Register File

Page 33: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Computing Support

Add Efficient Thread LaunchingAdd General Load / Store Instructions and DatapathAdd Shared MemoryAdd computational loads to performance design requirements

Page 34: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Future Graphics Directions

Higher densityHigher refreshHigher dynamic rangeUbiquity

Lower PowerShaving off the last burrs

Global IlluminationHigher quality modelingVirtualized resources at interactive rates

Page 35: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Future PC Architecture Directions

Highly Integrated – Low CostRequire a minimum visual feature setWeb/video/run today’s apps

And everyone elseDifferentiated PCsMore bandwidth and more parallel horsepower

More mature unified programming models C on CUDADX11 OpenCL

More resource virtualization

Page 36: GPU and PC System Architecture UC Santa Cruz  BSoE  – March 2009

Q & A