Deep Learning Update May 2016
-
Upload
frederic-pariente -
Category
Technology
-
view
441 -
download
0
Transcript of Deep Learning Update May 2016
Deep Learning Achieves “Superhuman” Results
A NEW COMPUTING MODEL
Traditional Computer VisionExperts + Time
Deep Learning Object DetectionDNN + Data + HPC
ImageNet
DL Systems Sold Application
STATE OF DEEP LEARNING SYSTEMS
Geography
100k Deep Learning in 2015
Machine Learning Algorithms
Image Recognition
Object Recognition
Big Data
Natural Language Processing
Action Recognition
Medical
Other
Facial Recognition
Speech Recognition
1
2
3
4
5
6
7
8
9
10
TOP 10
MICROSOFT: “SUPER DEEP NETWORKS”
Microsoft Deep ResNet
http://arxiv.org/pdf/1512.03385v1.pdf
18 LAYERS1.8 GF
152 LAYERS11.3 GF
>6X MORE FLOPSRevolution of Depth
BAIDU: DL DEVELOPERS NEED HPC
“Investments in computer systems — and I think the bleeding-edge of AI, and deep learning specifically, is shifting to HPC (high performance computing) — can cut down the time to run an experiment, and therefore go around the circle, from a week to a day and sometimes even faster.”
“Those of us that grew up doing machine learning often didn’t grow up with an HPC or computer systems background … partnerships between machine learning researchers and computer systems researchers tend to help both teams drive a lot of machine learning progress.”
— Andrew Ng
NVIDIA DGX-1WORLD’S FIRST DEEP LEARNING SUPERCOMPUTER
170 TFLOPS FP16
8x Tesla P100 16GB
NVLink Hybrid Cube Mesh
Accelerates Major AI Frameworks
Dual Xeon
7 TB SSD Deep Learning Cache
Dual 10GbE, Quad IB 100Gb
3RU – 3200W
BENEFITS FOR AI RESEARCHERS
cuDNN NCCL
cuSPARSE cuBLAS
DesignBig Networks
Reduce Training Times
DL SDKOngoing Updates
cuFFT
Fastest DL Supercomputer
NVIDIA DGX-1 SOFTWARE STACKOptimized for Deep Learning Performance
Accelerated Deep Learning
cuDNN NCCL
cuSPARSE cuBLAS cuFFT
Container Based Applications
NVIDIA Cloud Management
Digits DL Frameworks GPU Apps
INTRODUCING TESLA P100New GPU Architecture to Enable the World’s Fastest Compute Node
Pascal Architecture NVLinkCoWoS HBM2 Page Migration Engine
PCIe
Switch
PCIe
Switch
CPU CPU
Highest Compute Performance GPU Interconnect for Maximum Scalability
Unifying Compute & Memory in Single Package
Simple Parallel Programming with Virtually Unlimited Memory
Unified Memory
CPU
Tesla P100
GIANT LEAPS
IN EVERYTHING
NVLINK
PAGE MIGRATION ENGINE
PASCAL ARCHITECTURE
CoWoS HBM2 Stacked Mem
K40Tera
flops
(FP32/FP16)
5
10
15
20
P100
(FP32)
P100
(FP16)
M40
K40
Bi-
dir
ecti
onal BW
(G
B/Sec)
40
80
120
160P100
M40
K40Bandw
idth
(G
B/s)
200
400
600
P100
M40 K40
Addre
ssable
Mem
ory
(G
B)
10
100
1000
P100
M40
21 Teraflops of FP16 for Deep Learning 5x GPU-GPU Bandwidth
3x Higher for Massive Data Workloads Virtually Unlimited Memory Space
10000800
HUGE JUMP IN PERFORMANCE
DUAL XEON 8X TESLA M40DGX-1
(8X TESLA P100)
FLOPS (CPU + GPU) 3 TF 58 TF 170 TF
PROC-PROC BW 25 GB/s 64 GB/s 640 GB/s
ALEXNET TRAIN TIME 150 HOURS 9 HOURS 2 HOURS
# NODES FOR 3HR TAT 250 4 1
PERFORMANCE 1X 63X 250X
HIGHEST ABSOLUTE PERFORMANCE DELIVEREDNVLink for Max Scalability, More than 45x Faster with 8x P100
0x
5x
10x
15x
20x
25x
30x
35x
40x
45x
50x
Caffe/Alexnet VASP HOOMD-Blue COSMO MILC Amber HACC
2x K80 (M40 for Alexnet) 2x P100 4x P100 8x P100
Speed-u
p v
sD
ual Socket
Hasw
ell
2x Haswell CPU
DATACENTER IN A RACK1 Rack of Tesla P100 Delivers Performance of 6,000 CPUs
QUANTUM PHYSICSMILC
WEATHERCOSMO
DEEP LEARNINGCAFFE/ALEXNET
12 NODES in RACK8 P100s per Node
MOLECULAR DYNAMICSAMBER
# of Racks1 108642 . . .40 84
638 CPUs / 186 kW
650 CPUs / 190 kW
6,000 CPUs / 1.8 MW
2,900 CPUs / 850 kW
96 P100s / 38 kW
36 NODES in RACKDUAL CPUs Per Node
. . .
TESLA P100 ACCELERATOR
Compute 5.3 TF DP ∙ 10.6 TF SP ∙ 21.2 TF HP
Memory HBM2: 720 GB/s ∙ 16 GB
Interconnect NVLink (up to 8 way) + PCIe Gen3
ProgrammabilityPage Migration Engine
Unified Memory
AvailabilityDGX-1: Order Now
Atos, Cray, Dell, HP, IBM: Q1 2017
END-TO-END PRODUCT FAMILY
HYPERSCALE HPC
Tesla M4, M40
MIXED-APPS HPC
Tesla K80
STRONG-SCALING HPC
Tesla P100
FULLY INTEGRATED DLSUPERCOMPUTER
DGX-1
For customers who need to get going now with fully
integrated solution
Hyperscale & HPC data centers running apps that
scale to multiple GPUs
HPC data centers running mix of CPU and GPU workloads
Hyperscale deployment for DL training, inference, video &
image processing
NVIDIA DEEP LEARNING SDKHigh Performance GPU-Acceleration for Deep Learning
COMPUTER VISION SPEECH AND AUDIO BEHAVIOR
Object Detection Voice Recognition TranslationRecommendation
EnginesSentiment Analysis
DEEP LEARNING
cuDNN
MATH LIBRARIES
cuBLAS cuSPARSE
MULTI-GPU
NCCL
cuFFT
Mocha.jl
Image Classification
DEEP LEARNING
SDK
FRAMEWORKS
APPLICATIONS
NVIDIA CUDNN
Building blocks for accelerating deep neural networks on GPUs
High performance deep neural network training
Accelerates Deep Learning: Caffe, CNTK, Tensorflow, Theano, Torch
Performance continues to improve over time
“NVIDIA has improved the speed of cuDNN
with each release while extending the
interface to more operations and devices
at the same time.”
— Evan Shelhamer, Lead Caffe Developer, UC Berkeley
developer.nvidia.com/cudnn
AlexNet training throughput based on 20 iterations, CPU: 1x E5-2680v3 12 Core 2.5GHz.
0.0x
2.0x
4.0x
6.0x
8.0x
10.0x
12.0x
2014 2015 2016
K40(cuDNN v1)
M40(cuDNN v3)
Pascal(cuDNN v5)
WHAT’S NEW IN CUDNN 5?
LSTM recurrent neural networks deliver up to 6x speedup in Torch
Improved performance:
• Deep Neural Networks with 3x3 convolutions, like VGG, GoogleNet and ResNets
• 3D Convolutions
• FP16 routines on Pascal GPUs
Pascal GPU, RNNs, Improved Performance
Performance relative to torch-rnn(https://github.com/jcjohnson/torch-rnn)
DeepSpeech2: http://arxiv.org/abs/1512.02595Char-rnn: https://github.com/karpathy/char-rnn
5.9xSpeedup for char-rnn
RNN Layers
2.8xSpeedup for DeepSpeech 2
RNN Layers
DEEP LEARNING &
ARTIFICIAL INTELLIGENCE
Sep 28-29, 2016 | Amsterdam
www.gputechconf.eu #GTC16
SELF-DRIVING CARS VIRTUAL REALITY &
AUGMENTED REALITY
SUPERCOMPUTING & HPC
GTC Europe is a two-day conference designed to expose the innovative ways developers, businesses and academics are
using parallel computing to transform our world.
EUROPE’S BRIGHTEST MINDS & BEST IDEAS
2 Days | 800 Attendees | 50+ Exhibitors | 50+ Speakers | 15+ Tracks | 15+ Workshops | 1-to-1 Meetings