INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work...
Transcript of INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work...
![Page 1: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/1.jpg)
INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9
Axel Koehler, Principal Solution Architect
GPU$Technology$Conference$$Europe,$October$2017
![Page 2: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/2.jpg)
2
CONTINUED DEMAND FOR COMPUTE POWER
Comprehensive$Earth$System$
Model
Coupled$simulation$of$entire$cells
Simulation$of$combustion$for$new$highEefficiency,$lowEemision engines.
Predictive$calculations$for$supernovae
2016
Baidu Deep$Speech$2Superhuman$Voice$
Recognition
2015
Microsoft$ResNetSuperhuman$Image$
Recognition
2017
Google$Neural$Machine$Translation
Near$Human$Language$Translation
100 ExaFLOPS8700 Million Parameters
20 ExaFLOPS300 Million Parameters
7 ExaFLOPS60 Million Parameters
Neural$Network$complexity$is$ExplodingEverEincreasing$compute$power$Demand$ in$HPC
![Page 3: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/3.jpg)
3
INTRODUCING TESLA V100
The Fastest and Most Productive GPU for Deep Learning and HPC
Volta Architecture
Most Productive GPU
Tensor Core
120 Programmable TFLOPS Deep Learning
Improved SIMT Model
New Algorithms
Volta MPS
Inference Utilization
Improved NVLink & HBM2
Efficient Bandwidth
![Page 4: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/4.jpg)
4
NVIDIA Tesla V100 SXM2 Module with Volta GV100 GPU
![Page 5: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/5.jpg)
5
21B transistors815 mm2
80 SM5120 CUDA Cores640 Tensor Cores
16 GB HBM2900 GB/s HBM2
300 GB/s NVLink
TESLA V100
*full GV100 chip contains 84 SMs
![Page 6: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/6.jpg)
6
NEW SM MICROARCHITECTURE
![Page 7: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/7.jpg)
7
VOLTA GV100 SM
GP100 GV100
FP32 units 64 64
FP64 units 32 32
INT32 units NA 64
Tensor Cores NA 8
Register File 256 KB 256 KB
Unified L1/Sharedmemory
L1: 24KB Shared: 64KB
128 KB
Active Threads 2048 2048
Redesigned for ProductivityCompletely$new$ISATwice$the$schedulersSimplified$Issue$LogicLarge,$fast$L1$cacheImproved$SIMT$modelTensor$acceleration
![Page 8: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/8.jpg)
8
Shared Memory
64 KB
L1$24 KB
L2$4 MB
Load/Store UnitsPascal SM
L2$6 MB
Load/Store UnitsVolta SM
L1$ and Shared Memory128 KBLow Latency
Streaming
UNIFYING KEY TECHNOLOGIES
![Page 9: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/9.jpg)
9
L2$6 MB
Load/Store UnitsSM
L1$ and Shared Memory128 KB
VOLTA L1 AND SHARED MEMORY
Volta Streaming L1$ :
Unlimited cache misses in flightLow cache hit latency4x more bandwidth5x more capacity
Volta Shared Memory :
Unified storage with L1Configurable up to 96KB
![Page 10: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/10.jpg)
10
NARROWING THE SHARED MEMORY GAPwith the GV100 L1 cache
Pascal Volta
Cache: vs shared
• Easier to use
• 90%+ as good
Shared: vs cache
• Faster atomics
• More banks
• More predictable
Average Shared Memory Benefit
70%
93%
Directed testing: shared in global
![Page 11: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/11.jpg)
11
INDEPENDENT THREAD SCHEDULING
![Page 12: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/12.jpg)
12
PRE-VOLTA WARP EXECUTION MODEL
32 thread warp
Program Counter (PC) and Stack (S)
Pre-Volta
Time
X;#Y;
dive
rge
reco
nver
ge
A;#B;
if (threadIdx.x < 4) {A;B;
} else {X;Y;
}
No Synchronization Permitted
![Page 13: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/13.jpg)
13
VOLTA WARP EXECUTION MODEL
32 thread warp with independent schedulingPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,SPC
,S
Convergence Optimizer
Volta
dive
rge
A; B;
X; Y;
Synchronization may lead to interleaved scheduling!
Time
sync
hron
ize
if (threadIdx.x < 4) {A;__syncwarp();B;
} else {X;__syncwarp();Y;
}__syncwarp();
![Page 14: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/14.jpg)
14
Volta Independent Thread Scheduling:
• Enables interleaved execution of statements from divergent branches
• Enables execution of fine-grain parallel algorithms where threads within a warp may synchronize and communicate
• At any given clock cycle, CUDA cores execute the same instruction for all active threads in a warp just as before
• Execution is still SIMT which retains the high throughput
• Use explicit synchronization, don’t rely on implicit convergence
• CUDA 9 provides a fully explicit synchronization model
VOLTA: INDEPENDENT THREAD SCHEDULING
Extended'SIMT'model'enables'thread4parallel'programs'to'execute'with'vector'efficiency
Volta: Threads may waitfor messages
![Page 15: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/15.jpg)
15
VOLTA TENSOR CORE
![Page 16: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/16.jpg)
16
TENSOR COREMixed Precision Matrix Math - 4x4 matrices
New CUDA TensorOp instructions & data formats
4x4x4 matrix processing array
D[FP32] = A[FP16] * B[FP16] + C[FP32]
![Page 17: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/17.jpg)
18
USING TENSOR CORES
Volta Optimized Frameworks and Libraries
__device__ void tensor_op_16_16_16(float *d, half *a, half *b, float *c)
{wmma::fragment<matrix_a, …> Amat;wmma::fragment<matrix_b, …> Bmat;wmma::fragment<matrix_c, …> Cmat;
wmma::load_matrix_sync(Amat, a, 16);wmma::load_matrix_sync(Bmat, b, 16);wmma::fill_fragment(Cmat, 0.0f);
wmma::mma_sync(Cmat, Amat, Bmat, Cmat);
wmma::store_matrix_sync(d, Cmat, 16,wmma::row_major);
}
CUDA C++Warp-Level Matrix Operations
NVIDIA cuDNN, cuBLAS, TensorRT
![Page 18: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/18.jpg)
19
0
1
2
3
4
5
6
7
8
9
10
512 1024 2048 4096
Relative2Perform
ance
Matrix2Size2(M=N=K)
cuBLAS Mixed2Precision2(FP162Input,2FP322compute)
P1002(CUDA28)
V1002Tensor2Cores22(CUDA29)
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1,8
2
512 1024 2048 4096
Relative2Perform
ance
Matrix2Size2(M=N=K)
cuBLAS Single2Precision2(FP32)
P1002(CUDA28)
V1002(CUDA29)
cuBLAS GEMMS FOR DEEP LEARNINGV100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply
9.3x1.8x
Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release.
![Page 19: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/19.jpg)
20
NEW HBM2 MEMORY ARCHITECTURE
STRE
AM:
Tria
d-D
eliv
ered
GB/
s
P100 V10076% DRAM Utilization
95% DRAM Utilization
1.5x Delivered Bandwidth
• Unifying$Compute$&$Memory$in$Single$Package• More$bandwidth$and$more$energy$$efficient• ECC$can$be$active$without$a$bandwidth$or$capacity$penalty
![Page 20: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/20.jpg)
21
VOLTA NVLINK
• 6 NVLINKS @ 50 GB/s bidirectional
• Reduce number of lanes for lightly loaded link (Power savings)
• Coherence features for NVLINK enabled CPUs POWER9 based node
Hybrid cube mesh (eg. DGX1V)
![Page 21: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/21.jpg)
22
STATE OF UNIFIED MEMORYHigh performance, low effort
Allocate Beyond GPU Memory Size
Unified Memory
GPU CPU
PGI OpenACC on Pascal P100
Geometric mean across all 15 SPEC ACCEL™ benchmarks
86% PCI-E, 91% NVLink
Unified Memory
Explicit data movement
Automatic data movement for allocatables
86%
Performance vs no Unified Memory
PGI 17.1 Compilers OpenACC SPEC ACCEL™ 1.1 performance measured March, 2017. SPEC® and the benchmark name SPEC ACCEL™ are registered trademarks of the Standard Performance Evaluation Corporation.
![Page 22: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/22.jpg)
23
VOLTA + UNIFIED MEMORY
VOLTA + NVLINK CPU
VOLTA + PCIE CPU
![Page 23: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/23.jpg)
24
VOLTA MULTI-PROCESS SERVICE
Hardware Accelerated
Work Submission
Hardware Isolation
VOLTA MULTI-PROCESS SERVICE
Volta GV100
A B C
CUDA MULTI-PROCESS SERVICE CONTROLCPU Processes
GPU Execution
Volta MPS Enhancements:
• MPS clients submit work directly to the work queues within the GPU
• Reduced launch latency• Improved launch throughput
• Improved isolation amongst MPS clients
• Address isolation with independent address spaces
• Improved quality of service (QoS)
• 3x more clients than Pascal
A B C
![Page 24: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/24.jpg)
25
Efficient inference deployment without batching system
Single Volta Client,No Batching,
No MPS
VOLTA MPS FOR INFERENCERe
snet
50 Im
ages
/sec
, 7m
s la
tenc
y
Multiple Volta Clients,No Batching,
Using MPS
Volta withBatching System
7x faster
60% of perf with batching
V100 measured on pre-production hardware.
![Page 25: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/25.jpg)
26
P100 V100 Ratio
Training acceleration 10 TOPS 125 TOPS 12.5x
Inference acceleration 21 TFLOPS 125 TOPS 6x
FP64/FP32 5/10 TFLOPS 7.8/15.7 TFLOPS 1.5x
HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x
NVLink Bandwidth 160 GB/s 300 GB/s 1.9x
L2 Cache 4 MB 6 MB 1.5x
L1 Caches 1.3 MB 10 MB 7.7x
GPU PERFORMANCE COMPARISON
![Page 26: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/26.jpg)
27
REVOLUTIONARY AI PERFORMANCE3X Faster DL Training Performance
Over 80x DL Training Performance in 3 Years
1x K80cuDNN2
4x M40cuDNN3
8x P100cuDNN6
8x V100cuDNN7
0x
20x
40x
60x
80x
100x
Q115
Q315
Q217
Q216
Googlenet Training Performance(Speedup Vs K80)
Spee
dup
vs K
80
85% Scale-Out EfficiencyScales to 64 GPUs with Microsoft
Cognitive Toolkit
0 5 10 15
64X V100
8X V100
8X P100
Multi-Node Training with NCCL2.0(ResNet-50)
ResNet50 Training for 90 Epochs with 1.28M images dataset | Cognitive Toolkit with NCCL 2.0 | V100 performance measured on pre-production
hardware.
1 Hour
7.4 Hours
18 Hours
3X Reduction in Time to Train Over P100
0 10 20
1X V100
1X P100
2X CPU
LSTM Training(Neural Machine Translation)
Neural Machine Translation Training for 13 Epochs |German ->English, WMT15 subset | CPU = 2x Xeon E5 2699 V4 | V100 performance
measured on pre-production hardware.
15 Days
18 Hours
6 Hours
![Page 27: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/27.jpg)
28
VOLTA HPC PERFORMANCE
Rela
tive
to
Tesl
a P1
00
System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla P100 or V100. V100 measured on pre-production hardware.
![Page 28: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/28.jpg)
29
INTRODUCING CUDA 9
Tesla V100New GPU ArchitectureTensor CoresNVLinkIndependent Thread Scheduling
BUILT FOR VOLTA
COOPERATIVE THREAD GROUPS
Flexible Thread GroupsEfficient Parallel AlgorithmsSynchronize Across Thread Blocks in a Single GPU or Multi-GPUs
cuBLAS for Deep LearningNPP for Image ProcessingcuFFT for Signal Processing
FASTER LIBRARIES
DEVELOPER TOOLS & PLATFORM UPDATES
Faster Compile TimesUnified Memory ProfilingNVLink VisualizationNew OS and Compiler Support
partition
sync sync
![Page 29: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/29.jpg)
30
CUDA 9: WHAT’S NEW IN LIBRARIES
VOLTA PLATFORM SUPPORT PERFORMANCE
IMPROVED USER EXPERIENCENEW ALGORITHMS
Utilize Volta Tensor Cores
Volta optimized GEMMs (cuBLAS)
Out-of-box performance on Volta (all libraries)
GEMM optimizations for RNNs (cuBLAS)
Faster image processing (NPP)
FFT optimizations across various sizes (cuFFT)
Multi-GPU dense & sparse solvers, dense eigenvalue & SVD (cuSOLVER)
Breadth first search, clustering, triangle counting, extraction & contraction (nvGRAPH)
New install package for CUDA Libraries (library-only meta package)
Modular NPP with small footprint, support for image batching
DEEP LEARNING
Scientific Computing
![Page 30: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/30.jpg)
31
CUDA 9: UP TO 5X FASTER LIBRARIES
2x faster library speeds up image, video and signal processing operations
cuBLAS cuFFT NPP
5x – 9x faster GEMM operations speed up deep learning and HPC apps
Up to 100x faster than IPP for image processing and computer vision operations
0X
1X
1X
2X
2X
3X
1 64 16384 4194304
Spee
d up
Vs.
CU
DA
8*
Data Size
1D 2D 3D
0x 50x 100x
Color Proc.
Filters
Geometry Transforms
JPEG
Morphological Ops.
Speedup Vs. IPP**
* V100 and CUDA 9 (r384); Intel Xeon Broadwell, dual socket, E5-2698 v4@ 2.6GHz, 3.5GHz Turbo with Ubuntu 14.04.5 x86_64 with 128GB System Memory* P100 and CUDA 8 (r361); For cublas CUDA$8$(r361): Intel Xeon Haswell, single-socket, 16-core E5-2698 v3@ 2.3GHz, 3.6GHz Turbo with CentOS 7.2 x86-64 with 128GB System Memory** CPU system running IPP: Intel Xeon Haswell single-socket 16-core E5-2698 v3@ 2.3GHz, 3.6GHz Turbo Ubuntu 14.04.5 x86_64 with 128GB System Memory
0x
2x
4x
6x
8x
10x
512 1024 2048 2816
Spee
d up
Vs.
CU
DA
8*
Matrix Size
FP32 FP16 I/O, FP32 Compute
![Page 31: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/31.jpg)
32
COOPERATIVE GROUPS
![Page 32: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/32.jpg)
33
COOPERATIVE GROUPSA flexible model for synchronisation and communication within groups of threads
Levels$of$cooperation:TODAY
Levels$of$cooperation:CUDA$9
![Page 33: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/33.jpg)
34
COOPERATIVE GROUPS BASICSFlexible, Explicit Synchronization
Thread groups are explicit objects in your program
You can synchronize threads in a group
Create new groups by partitioning existing groups
Partitioned groups can also synchronize
thread_group block =1this_thread_block();
block.sync();
thread_group tile321=1tiled_partition(block,132);thread_group tile41=1tiled_partition(tile32,14);
tile4.sync();Note: calls in green are part of the cooperative_groups:: namespace
Thread Block Group
Partitioned Thread Groups
![Page 34: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/34.jpg)
35
COOPERATIVE GROUPSFlexible and Scalable Thread Synchronization and Communication
Define, synchronize, and partition groups of cooperating threads
Flexible: High-performance API for clean and robust management of thread groups
Scalable: Create and manage groups within warps, across thread blocks, and even across GPUs
Deploy Everywhere (*): Kepler and Newer GPUs
Supported by CUDA developer tools
* Note: Multi-Block and Multi-Device Cooperative Groups are only supported on Pascal and above GPUs
Thread Block Group
Partitioned Thread Groups
![Page 35: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/35.jpg)
36
DEVELOPER TOOLS
![Page 36: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/36.jpg)
37
UNIFIED MEMORY PROFILINGCorrelate CPU Page Faults with Source
Page Fault Correlation
![Page 37: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/37.jpg)
38
NEW UNIFIED MEMORY EVENTS
Page ThrottlingMemory Thrashing Remote Map
Visualize Virtual Memory Activity
![Page 38: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/38.jpg)
39
FUTURE: UNIFIED SYSTEM ALLOCATORAllocate unified memory using standard malloc
Removes CUDA-specific allocator restrictions
Data movement is transparently handled
Requires operating system support:
HMM Linux Kernel Module
void1sortfile(FILE1*fp,1int N)1{char1*data;
//1Allocate1memory1using1any1standard1allocatordata1=1(char1*)1malloc(N1*1sizeof(char));
fread(data,11,1N,1fp);
sort<<<...>>>(data,N,1,compare);
use_data(data);
//1Free1the1allocated1memoryfree(data);
}
CUDA 8 Code with System Allocator
![Page 39: INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work Submission Hardware Isolation VOLTA MULTI-PROCESS SERVICE Volta GV100 A B C CUDA](https://reader036.fdocuments.net/reader036/viewer/2022071006/5fc3f28d35eb655bba1c4849/html5/thumbnails/39.jpg)
40
ADDITIONAL RESOURCES
• Volta
• Whitepaper http://www.nvidia.com/object/volta-architecture-whitepaper.html
• Blog https://devblogs.nvidia.com/parallelforall/inside-volta
• CUDA 9
• Blog https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed
• Download https://developer.nvidia.com/cuda-downloads