GPU Technology Conference 2014 Keynote
-
Upload
nvidia -
Category
Technology
-
view
1.570 -
download
3
description
Transcript of GPU Technology Conference 2014 Keynote
5
4
3
2
1
0 2003 2005 2007 2009 2011 2013
Tera
FLO
PS
GPU
CPU
GTC — GROWING AND EXPANDING
2010 2012 2014
397 429
729
FASTEST GROWING TOPICS
Big Data Analytics
Machine Learning
Computer Vision
FASTEST GROWING TOPICS
Energy Exploration
Life Science & Genomics
Molecular Dynamics
#1 TOPIC
HPC / Supercomputing
2012 2013 2014
FOSTERING THE GPU ECOSYSTEM Big Data / Cloud / Computer Vision
AudioStreamTV
CUDA EVERYWHERE
Takayuki Aoki Global Scientific Information and Computing Center
Tokyo Institute of Technology
“ Large-scale CFD Applications and a Full GPU Implementation of a Weather
Prediction Code on the TSUBAME Supercomputer ”
BANDWIDTH BOTTLENECKS
CPU GPU
PCIe
PCI Express
CPU Memory
GPU Memory
16GB/sec
60GB/sec
288GB/sec
INTRODUCING NVLINK CPU GPU
PCIe
Differential with embedded clock
PCIe programming model (w/ DMA+)
Unified Memory
Cache coherency in Gen 2.0
5 to 12X PCIe
5X More Bandwidth for Multi-GPU Scaling
GPU
PCIe SWITCH
CPU GPU GPU GPU
3D MEMORY 3D Chip-on-Wafer integration
Many X bandwidth
2.5X capacity
4X energy efficiency
0
200
400
600
800
1000
1200
2008 2010 2012 2014 2016
Memory Bandwidth
Blaise Pascal 1623-1662
Mechanical Calculator
Probability Theory
Pascal’s Theorem
Pascal’s Law
PASCAL
NVLink
3D Memory
Module
5 to 12X PCIe 3.0
2 to 4X memory BW & size
1/3 size of PCIe card
SG
EM
M /
W N
orm
alized
2012 2014 2008 2010 2016
Tesla CUDA
Fermi FP64
Kepler Dynamic Parallelism
Maxwell DX12
Pascal Unified Memory
3D Memory
NVLink
20
16
12
8
6
2
0
GPU ROADMAP
4
10
14
18
MACHINE LEARNING
Branch of Artificial Intelligence
Computers that learn from data
person
car
helmet
motorcycle
bird
frog
person
dog
chair
person
hammer
flower pot
power drill
Machine Learning using Deep Neural Networks
Input Result
Building High-level Features Using Large Scale Unsupervised Learning
Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, A. Ng
Stanford / Google
1 billion connections
10 million 200x200 pixel images
1,000 machines (16,000 cores)
3 days
1,000 CPU Servers 2,000 CPUs • 16,000 cores
600 kWatts
$5,000,000
GOOGLE BRAIN Today’s Largest Networks
1B connections 10M images ~3 days ~30 ExaFLOPS
Human Brain
~100B neurons x 1000 connections 500M images 5,000,000X “Google Brain” ~150 YottaFLOPS ~40,000 “Google Brain-Years”
SOURCE: Ian Goodfellow
Deep Learning with COTS HPC Systems
A. Coates, B. Huval, T. Wang, D. Wu, A. Ng, B. Catanzaro
Stanford / NVIDIA • ICML 2013
STANFORD AI LAB
3 GPU-Accelerated Servers 12 GPUs • 18,432 cores
4 kWatts
$33,000
Now You Can Build Google’s
$1M Artificial Brain on the Cheap “ “
-Wired
1,000 CPU Servers 2,000 CPUs • 16,000 cores
600 kWatts
$5,000,000
GOOGLE BRAIN
DEMO: MACHINE LEARNING, SIMPLE TRAINING SET
1.2M
1000
2
7
25
Image training set
Classes
Weeks of training
GPUs
EXAFLOPS total to train
DEMO: MACHINE LEARNING, NYU OVERFEAT
CUDA for MACHINE LEARNING
Talks @ GTC
Image Detection
Face Recognition
Gesture Recognition
Video Search & Analytics
Speech Recognition & Translation
Recommendation Engines
Indexing & Search
Use Cases Early Adopters
Image Analytics for Creative Cloud
Image Classification
Speech/Image Recognition
Recommendation
Hadoop
Search Rankings
Big Data & Infinite Compute Turbocharge Deep Learning
SOURCE: KPCB/Mary Meeker, company data. Unstructured data: IDC's Digital Universe Study.
800M photos uploaded per day 100 hours of video uploaded per minute Unstructured data exploding
0
100
200
300
400
500
600
700
800
900
2007 2008 2009 2010 2011 2012 2013 2014
Snapchat
Flickr
0
20
40
60
80
100
120
2007 2008 2009 2010 2011 2012 2013
Hours
(Y
ouTu
be)
Millions
1,104
5,379
0
1,000
2,000
3,000
4,000
5,000
6,000
2010 2015
Exabyte
s of
data
DEMO: TITAN Z REVEAL
5,760 CUDA cores
12GB memory
8 TeraFLOPS
$2999
STANFORD AI LAB
1 Titan Z-Accelerated Server 3 Titan Zs • 17,280 cores
2 kWatts
$12,000
1,000 CPU Servers 2,000 CPUs • 16,000 cores
600 kWatts
$5,000,000
GOOGLE BRAIN
300X energy efficiency
400X lower cost
Fits next to a desk
RenderMan with programmable shading
1.5 hours to render each frame
CCI 6/32 minicomputer
First CGI Film Nominated for
an Academy Award®
State-of-the-art water simulator
48 hours to simulate the base water
250 hours to render each frame
2013 Academy Award® Winner
BEST VISUAL EFFECTS
DEMO: WHALE
DEMO: FLEX
DEMO: FLAMEWORKS
DEMO: UE4
One is a photo, One is Iray…
Bunkspeed Maya
Catia 3ds Max
IRAY VCA SCALABLE GPU RENDERING
APPLIANCE
8 Kepler-class
12GB per GPU
23,040
2 x 1GigE
2 x 10GigE
1 x InfiniBand
GPUs
GPU memory
CUDA cores
Network
DEMO: IRAY / HONDA
0 20 40 60 80
Relative Performance
CPU-only Workstation
Quadro K5000 Workstation
Iray VCA
Bunkspeed Maya
Catia 3ds Max
IRAY VCA SCALABLE GPU RENDERING
APPLIANCE
MSRP $50,000
GRID GPU in the Cloud
Ben Fathi Chief Technology Officer
Horizon DaaS Platform
Mobile CUDA
“10 of the Top 10” Greenest Supercomputers Powered by CUDA GPUs
Unify GPU and Tegra Architecture
192 fully programmable CUDA cores
326 GFLOPS
4X energy efficiency over A15
TEGRA K1 Mobile Super Chip
MOBILE
ARCHITECTURE
Maxwell
Kepler
Tesla
Fermi
Tegra 3
Tegra 4
Tegra K1
GPU
ARCHITECTURE
Computer Vision on CUDA
Feature Detection / Tracking
~30 GFLOPS @ 30 Hz
Object Recognition / Tracking
~180 GFLOPS @ 30 Hz
3D Scene Interpretation
~280 GFLOPS @ 30 Hz
JETSON TK1 1st MOBILE SUPERCOMPUTER FOR EMBEDDED SYSTEMS
192 CUDA cores
326 GFLOPS
VisionWorks SDK
$192
VISIONWORKS COMPUTER VISION ON CUDA
Driver Assistance Computational Photography
Augmented Reality Robotics CUDA
Jetson TK1
VisionWorks Primitives
Your Code
Sample Pipelines
Object Detection / Tracking
Structure from Motion …
Classifier Corner Detection …
Sin
gle
Pre
cis
ion G
FLO
PS /
W N
orm
alized
80
60
0
40
2013 2014 2011 2012 2015
Tegra 2 Tegra 3
Tegra 4
Tegra K1 Kepler GPU CUDA 64b & 32b CPU
Erista Maxwell GPU
20
TEGRA ROADMAP
Andreas Reich Head of Audi Pre-Development
VIDEO: AUDI ADAS
CUDA EVERYWHERE PASCAL PC CLOUD MOBILE
DEMO: PORTAL ON SHIELD