Steve Scott, Evolution of GPU Computing -...
Transcript of Steve Scott, Evolution of GPU Computing -...
![Page 1: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/1.jpg)
The Evolution of GPU Computing
Steve Scott CTO Tesla, NVIDIA
SC12, Salt Lake City
![Page 2: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/2.jpg)
Frame
Buffer
Memory
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007
ECE 498AL, University of Illinois, Urbana-Champaign
Early 3D Graphics Pipeline
Application
Scene Management
Geometry
Rasterization
Pixel Processing
ROP/FBI/Display
Host
GPU
![Page 3: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/3.jpg)
All images © their respective owners
Moving Toward Programmability
1998 1999 2000 2001 2002 2003 2004
DirectX 5
Riva 128
DirectX 6
Multitexturing
Riva TNT
DirectX 8
SM 1.x
GeForce 3
Cg
DirectX 7
T&L TextureStageStage
GeForce 256
DirectX 9
SM 2.0
GeForceFX
DirectX 9.0c
SM 3.0
GeForce 6
Quake 3 Giants Halo Far Cry UE3 Half-Life
![Page 4: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/4.jpg)
Copyright © NVIDIA Corporation 2006 Unreal © Epic
Per-Vertex Lighting No Lighting Per-Pixel Lighting
![Page 5: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/5.jpg)
Programmable Shaders: GeForceFX (2002)
Vertex and fragment operations specified in small (macro) assembly
language (separate processors)
User-specified mapping of input data to operations
Limited ability to use intermediate computed values to index input
data (textures and vertex uniforms)
Input 2Input 1Input 0
OP
Temp 2Temp 1Temp 0
ADDR R0.xyz, eyePosition.xyzx, -f[TEX0].xyzx;
DP3R R0.w, R0.xyzx, R0.xyzx;
RSQR R0.w, R0.w;
MULR R0.xyz, R0.w, R0.xyzx;
ADDR R1.xyz, lightPosition.xyzx, -f[TEX0].xyzx;
DP3R R0.w, R1.xyzx, R1.xyzx;
RSQR R0.w, R0.w;
MADR R0.xyz, R0.w, R1.xyzx, R0.xyzx;
MULR R1.xyz, R0.w, R1.xyzx;
DP3R R0.w, R1.xyzx, f[TEX1].xyzx;
MAXR R0.w, R0.w, {0}.x;
![Page 6: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/6.jpg)
The Pioneers: Early GPGPU (2002)
Early Raytracing
www.gpgpu.org
Ray Tracing on Programmable Graphics Hardware,
Purcell et al.
PDEs in Graphics Hardware, Strzodka, Rumpf
Fast Matrix Multiplies using Graphics Hardware,
Larsen, McAllister
Using Modern Graphics Architectures for General-
Purpose Computing: A Frameworks and Analysis,
Thompson et al.
This was not easy…
![Page 7: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/7.jpg)
Challenges with Early GPGPU Programming
HW challenges
• Limited addressing modes
• Limited communication: inter-pixel,
scatter
• Lack of integer & bit ops
• No branching
SW challenges
• Graphics API (DirectX, OpenGL)
• Very limited GPU computing ecosystem
• Distinct vertex and fragment procs
Input Registers
Fragment Program
Output Registers
Constants
Texture
Temp Registers
Software (DirectX, Open GL) and hardware slowly became more general purpose...
![Page 8: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/8.jpg)
L2
FB
SP SP
L1
TF
Th
read
Pro
cesso
r
Vtx Thread Issue
Setup / Rstr / ZCull
Geom Thread Issue Pixel Thread Issue
Input Assembler
Host
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
GeForce 8800 (G80) Unified Compute Architecture • Unified processor types
• Unified access to mem structures
• DirectX 10 & SM 4.0
![Page 9: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/9.jpg)
CUDA C Programming
void saxpy_serial(int n, float a, float *x, float *y)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
// Invoke serial SAXPY kernel
saxpy_serial(n, 2.0, x, y);
__global__ void saxpy_parallel(int n, float a, float *x, float *y)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] = a*x[i] + y[i];
}
// Invoke parallel SAXPY kernel with 256 threads/block
int nblocks = (n + 255) / 256;
saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
Serial C Code
Parallel C Code
![Page 10: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/10.jpg)
0
10
20
30
40
50
60
June 2007 Nov 2007 June 2008 Nov 2008 June 2009 Nov 2009 June 2010 Nov 2010 June 2011 Nov 2011 June 2012
GPU Computing Comes of Age
2006 – G80 • Unified Graphics/Compute
Architecture
• IEEE math
• CUDA introduced at SC’06
Fermi • ECC
• High DP FP perf
• C++ support Tesla
• Double
precision FP
Kepler • Large perf/W
• Programmability
enhancements # GPU-Accelerated Systems in Top500
![Page 11: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/11.jpg)
KEPLER
SMX
Hyper-Q
Dynamic Parallelism
(programmability and
application coverage)
(3x power efficiency)
![Page 12: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/12.jpg)
Hyper-Q Easier Speedup of Legacy MPI Apps
FERMI 1 Work Queue
KEPLER 32 Concurrent Work Queues
0x
5x
10x
15x
20x
0 5 10 15 20
Sp
ee
du
p v
s.
Du
al C
PU
Number of GPUs
CP2K- Quantum Chemistry
K20 with Hyper-Q K20 without Hyper-Q
2.5x
Strong Scaling: 864 water molecules
16 MPI ranks
per node
1 MPI rank
per node
![Page 13: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/13.jpg)
CPU Fermi GPU CPU Kepler GPU
Dynamic Parallelism Simpler Code, More General, Higher Performance
Code Size Cut by 2x
2x Performance
![Page 14: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/14.jpg)
Titan: World’s #1 Supercomputer
18,688 Tesla K20X GPUs
World leading efficiency
27 Petaflops
![Page 15: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/15.jpg)
1.00 1.71
2.73
1.00 3.46
7.17
1.00 5.41
8.85
1.00 8.00
10.20
1.00 7.00
16.70
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Dual-CPUSingle-CPU+M2090
Single-CPU+K20X
Dual-CPUSingle-CPU+M2090
Single-CPU+K20X
Dual-CPUSingle-CPU+M2090
Single-CPU+K20X
Dual-CPUSingle-CPU+M2090
Single-CPU+K20X
Dual-CPUSingle-CPU+M2090
Single-CPU+K20XWS-LSMS
Chroma
SPECFEM3D
AMBER
NAMD
Kepler GPU Performance Results
Dual-socket comparison: CPU/GPU node vs. Dual-CPU node CPU = 8 core SandyBridge E5-2687w 3.10 GHz
![Page 16: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/16.jpg)
The Future
![Page 17: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/17.jpg)
The Future of HPC Programming
Computers are not getting faster… just wider
Need to structure all HPC apps as throughput problems
Locality within nodes much more important
Need to expose locality (programming model)
& explicitly manage memory hierarchy
(compiler, runtime, autotuner)
How can we enable programmers to code
for future processors in a portable way?
![Page 18: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/18.jpg)
OpenACC Directives Portable Parallelism
main() {
double pi = 0.0; long i;
#pragma omp parallel for reduction(+:pi)
for (i=0; i<N; i++)
{
double t = (double)((i+0.05)/N);
pi += 4.0/(1.0+t*t);
}
printf(“pi = %f\n”, pi/N);
}
CPU
OpenMP
main() {
double pi = 0.0; long i;
#pragma acc parallel loop reduction(+:pi)
for (i=0; i<N; i++)
{
double t = (double)((i+0.05)/N);
pi += 4.0/(1.0+t*t);
}
printf(“pi = %f\n”, pi/N);
}
CPU GPU
OpenACC Directives
![Page 19: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/19.jpg)
Integration
Further concentration on locality (both HW and SW)
Reducing overheads
Continued convergence with consumer technology
How Are GPUs Likely to Evolve Over This Decade?
![Page 20: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/20.jpg)
Echelon NVIDIA’s Extreme-Scale Computing Project
DARPA UHPC Program
Targeting 2018
Fast Forward Program
![Page 21: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/21.jpg)
Echelon Compute Node & System
System Interconnect
NoC
C0 C7
SM0 SM255
LO
C 0
LO
C 7
L20
256KB
L21023
256KB MC NIC
DRAM
Stacks
DRAM
DIMMs
NV
RAM
Node 0: 16 TF, 2 TB/s, 512+ GB
Cabinet 0: 4 PF, 128 TB Cabinet N-1
Echelon System (up to 1 EF)
Key architectural features:
2018 Vision: Echelon Compute Node & System
• Malleable memory hierarchy
• Hierarchical register files
• Hierarchical thread scheduling
• Place coherency/consistency
• Temporal SIMT & scalarization
• PGAS memory
• HW accelerated queues
• Active messages
• AMOs everywhere
• Collective engines
• Streamlined LOC/TOC
interaction
Node 255
![Page 22: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/22.jpg)
Summary
GPU accelerated computing has come a long way in a very short time
Has become much easier to program and more general purpose
Aligned with technology trends, supported by consumer markets
Future evolution is about:
• Integration
• Increased generality – efficient on any code with high parallelism
• Reducing energy, reducing overheads
This is simply how computers will be built
![Page 23: Steve Scott, Evolution of GPU Computing - Nvidiadeveloper.download.nvidia.com/GTC/PDF/GTC2012/... · 2012. 11. 21. · FB SP SP L1 TF r Vtx Thread Issue Setup / Rstr / ZCull Geom](https://reader035.fdocuments.net/reader035/viewer/2022062604/5fbf95ade3993e2e761d712f/html5/thumbnails/23.jpg)
Thank You