INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work...

INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9

Axel Koehler, Principal Solution Architect

GPU$Technology$Conference$$Europe,$October$2017

2

CONTINUED DEMAND FOR COMPUTE POWER

Comprehensive$Earth$System$

Model

Coupled$simulation$of$entire$cells

Simulation$of$combustion$for$new$highEefficiency,$lowEemision engines.

Predictive$calculations$for$supernovae

2016

Baidu Deep$Speech$2Superhuman$Voice$

Recognition

2015

Microsoft$ResNetSuperhuman$Image$

Recognition

2017

Google$Neural$Machine$Translation

Near$Human$Language$Translation

100 ExaFLOPS8700 Million Parameters



Neural$Network$complexity$is$ExplodingEverEincreasing$compute$power$Demand$ in$HPC

3

INTRODUCING TESLA V100

The Fastest and Most Productive GPU for Deep Learning and HPC

Volta Architecture

Most Productive GPU

Tensor Core

120 Programmable TFLOPS Deep Learning

Improved SIMT Model

New Algorithms

Volta MPS

Inference Utilization

Improved NVLink & HBM2

Efficient Bandwidth

4

NVIDIA Tesla V100 SXM2 Module with Volta GV100 GPU

5

21B transistors815 mm2

80 SM5120 CUDA Cores640 Tensor Cores

16 GB HBM2900 GB/s HBM2

300 GB/s NVLink

TESLA V100

*full GV100 chip contains 84 SMs

6

NEW SM MICROARCHITECTURE

7

VOLTA GV100 SM

GP100 GV100

FP32 units 64 64

FP64 units 32 32

INT32 units NA 64

Tensor Cores NA 8

Register File 256 KB 256 KB

Unified L1/Sharedmemory

L1: 24KB Shared: 64KB

128 KB

Active Threads 2048 2048

Redesigned for ProductivityCompletely$new$ISATwice$the$schedulersSimplified$Issue$LogicLarge,$fast$L1$cacheImproved$SIMT$modelTensor$acceleration

8

Shared Memory

64 KB

L1$24 KB

L2$4 MB

Load/Store UnitsPascal SM

L2$6 MB

Load/Store UnitsVolta SM

L1$ and Shared Memory128 KBLow Latency

Streaming

UNIFYING KEY TECHNOLOGIES

9

L2$6 MB

Load/Store UnitsSM

L1$ and Shared Memory128 KB

VOLTA L1 AND SHARED MEMORY

Volta Streaming L1$ :

Unlimited cache misses in flightLow cache hit latency4x more bandwidth5x more capacity

Volta Shared Memory :

Unified storage with L1Configurable up to 96KB

10

NARROWING THE SHARED MEMORY GAPwith the GV100 L1 cache

Pascal Volta

Cache: vs shared

• Easier to use

• 90%+ as good

Shared: vs cache

• Faster atomics

• More banks

• More predictable

Average Shared Memory Benefit

70%

93%

Directed testing: shared in global

11

INDEPENDENT THREAD SCHEDULING

12

PRE-VOLTA WARP EXECUTION MODEL

32 thread warp

Program Counter (PC) and Stack (S)

Pre-Volta

Time

X;#Y;

dive

rge

reco

nver

ge

A;#B;

if (threadIdx.x < 4) {A;B;

} else {X;Y;

}

No Synchronization Permitted

13

VOLTA WARP EXECUTION MODEL

32 thread warp with independent schedulingPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,SPC

,S

Convergence Optimizer

Volta

dive

rge

A; B;

X; Y;

Synchronization may lead to interleaved scheduling!

Time

sync

hron

ize

if (threadIdx.x < 4) {A;__syncwarp();B;

} else {X;__syncwarp();Y;

}__syncwarp();

14

Volta Independent Thread Scheduling:

• Enables interleaved execution of statements from divergent branches

• Enables execution of fine-grain parallel algorithms where threads within a warp may synchronize and communicate

• At any given clock cycle, CUDA cores execute the same instruction for all active threads in a warp just as before

• Execution is still SIMT which retains the high throughput

• Use explicit synchronization, don’t rely on implicit convergence

• CUDA 9 provides a fully explicit synchronization model

VOLTA: INDEPENDENT THREAD SCHEDULING

Extended'SIMT'model'enables'thread4parallel'programs'to'execute'with'vector'efficiency

Volta: Threads may waitfor messages

15

VOLTA TENSOR CORE

16

TENSOR COREMixed Precision Matrix Math - 4x4 matrices

New CUDA TensorOp instructions & data formats

4x4x4 matrix processing array

D[FP32] = A[FP16] * B[FP16] + C[FP32]

18

USING TENSOR CORES

Volta Optimized Frameworks and Libraries

__device__ void tensor_op_16_16_16(float *d, half *a, half *b, float *c)

{wmma::fragment<matrix_a, …> Amat;wmma::fragment<matrix_b, …> Bmat;wmma::fragment<matrix_c, …> Cmat;

wmma::load_matrix_sync(Amat, a, 16);wmma::load_matrix_sync(Bmat, b, 16);wmma::fill_fragment(Cmat, 0.0f);

wmma::mma_sync(Cmat, Amat, Bmat, Cmat);

wmma::store_matrix_sync(d, Cmat, 16,wmma::row_major);

}

CUDA C++Warp-Level Matrix Operations

NVIDIA cuDNN, cuBLAS, TensorRT

19

0

1

2

3

4

5

6

7

8

9

10

512 1024 2048 4096

Relative2Perform

ance

Matrix2Size2(M=N=K)

cuBLAS Mixed2Precision2(FP162Input,2FP322compute)

P1002(CUDA28)

V1002Tensor2Cores22(CUDA29)

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1,8

2

512 1024 2048 4096

Relative2Perform

ance

Matrix2Size2(M=N=K)

cuBLAS Single2Precision2(FP32)

P1002(CUDA28)

V1002(CUDA29)

cuBLAS GEMMS FOR DEEP LEARNINGV100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply

9.3x1.8x

Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release.

20

NEW HBM2 MEMORY ARCHITECTURE

STRE

AM:

Tria

d-D

eliv

ered

GB/

s

P100 V10076% DRAM Utilization

95% DRAM Utilization

1.5x Delivered Bandwidth

• Unifying$Compute$&$Memory$in$Single$Package• More$bandwidth$and$more$energy$$efficient• ECC$can$be$active$without$a$bandwidth$or$capacity$penalty

21

VOLTA NVLINK

• 6 NVLINKS @ 50 GB/s bidirectional

• Reduce number of lanes for lightly loaded link (Power savings)

• Coherence features for NVLINK enabled CPUs POWER9 based node

Hybrid cube mesh (eg. DGX1V)

22

STATE OF UNIFIED MEMORYHigh performance, low effort

Allocate Beyond GPU Memory Size

Unified Memory

GPU CPU

PGI OpenACC on Pascal P100

Geometric mean across all 15 SPEC ACCEL™ benchmarks

86% PCI-E, 91% NVLink

Unified Memory

Explicit data movement

Automatic data movement for allocatables

86%

Performance vs no Unified Memory

PGI 17.1 Compilers OpenACC SPEC ACCEL™ 1.1 performance measured March, 2017. SPEC® and the benchmark name SPEC ACCEL™ are registered trademarks of the Standard Performance Evaluation Corporation.

23

VOLTA + UNIFIED MEMORY

VOLTA + NVLINK CPU

VOLTA + PCIE CPU

24

VOLTA MULTI-PROCESS SERVICE

Hardware Accelerated

Work Submission

Hardware Isolation

VOLTA MULTI-PROCESS SERVICE

Volta GV100

A B C

CUDA MULTI-PROCESS SERVICE CONTROLCPU Processes

GPU Execution

Volta MPS Enhancements:

• MPS clients submit work directly to the work queues within the GPU

• Reduced launch latency• Improved launch throughput

• Improved isolation amongst MPS clients

• Address isolation with independent address spaces

• Improved quality of service (QoS)

• 3x more clients than Pascal

A B C

25

Efficient inference deployment without batching system

Single Volta Client,No Batching,

No MPS

VOLTA MPS FOR INFERENCERe

snet

50 Im

ages

/sec

, 7m

s la

tenc

y

Multiple Volta Clients,No Batching,

Using MPS

Volta withBatching System

7x faster

60% of perf with batching

V100 measured on pre-production hardware.

26

P100 V100 Ratio

Training acceleration 10 TOPS 125 TOPS 12.5x

Inference acceleration 21 TFLOPS 125 TOPS 6x

FP64/FP32 5/10 TFLOPS 7.8/15.7 TFLOPS 1.5x

HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x

NVLink Bandwidth 160 GB/s 300 GB/s 1.9x

L2 Cache 4 MB 6 MB 1.5x

L1 Caches 1.3 MB 10 MB 7.7x

GPU PERFORMANCE COMPARISON

27

REVOLUTIONARY AI PERFORMANCE3X Faster DL Training Performance

Over 80x DL Training Performance in 3 Years

1x K80cuDNN2

4x M40cuDNN3

8x P100cuDNN6

8x V100cuDNN7

0x

20x

40x

60x

80x

100x

Q115

Q315

Q217

Q216

Googlenet Training Performance(Speedup Vs K80)

Spee

dup

vs K

80

85% Scale-Out EfficiencyScales to 64 GPUs with Microsoft

Cognitive Toolkit

0 5 10 15

64X V100

8X V100

8X P100

Multi-Node Training with NCCL2.0(ResNet-50)

ResNet50 Training for 90 Epochs with 1.28M images dataset | Cognitive Toolkit with NCCL 2.0 | V100 performance measured on pre-production

hardware.

1 Hour

7.4 Hours

18 Hours

3X Reduction in Time to Train Over P100

0 10 20

1X V100

1X P100

2X CPU

LSTM Training(Neural Machine Translation)

Neural Machine Translation Training for 13 Epochs |German ->English, WMT15 subset | CPU = 2x Xeon E5 2699 V4 | V100 performance

measured on pre-production hardware.

15 Days

18 Hours

6 Hours

28

VOLTA HPC PERFORMANCE

Rela

tive

to

Tesl

a P1

00

System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla P100 or V100. V100 measured on pre-production hardware.

29

INTRODUCING CUDA 9

Tesla V100New GPU ArchitectureTensor CoresNVLinkIndependent Thread Scheduling

BUILT FOR VOLTA

COOPERATIVE THREAD GROUPS

Flexible Thread GroupsEfficient Parallel AlgorithmsSynchronize Across Thread Blocks in a Single GPU or Multi-GPUs

cuBLAS for Deep LearningNPP for Image ProcessingcuFFT for Signal Processing

FASTER LIBRARIES

DEVELOPER TOOLS & PLATFORM UPDATES

Faster Compile TimesUnified Memory ProfilingNVLink VisualizationNew OS and Compiler Support

partition

sync sync

30

CUDA 9: WHAT’S NEW IN LIBRARIES

VOLTA PLATFORM SUPPORT PERFORMANCE

IMPROVED USER EXPERIENCENEW ALGORITHMS

Utilize Volta Tensor Cores

Volta optimized GEMMs (cuBLAS)

Out-of-box performance on Volta (all libraries)

GEMM optimizations for RNNs (cuBLAS)

Faster image processing (NPP)

FFT optimizations across various sizes (cuFFT)

Multi-GPU dense & sparse solvers, dense eigenvalue & SVD (cuSOLVER)

Breadth first search, clustering, triangle counting, extraction & contraction (nvGRAPH)

New install package for CUDA Libraries (library-only meta package)

Modular NPP with small footprint, support for image batching

DEEP LEARNING

Scientific Computing

31

CUDA 9: UP TO 5X FASTER LIBRARIES

2x faster library speeds up image, video and signal processing operations

cuBLAS cuFFT NPP

5x – 9x faster GEMM operations speed up deep learning and HPC apps

Up to 100x faster than IPP for image processing and computer vision operations

0X

1X

1X

2X

2X

3X

1 64 16384 4194304

Spee

d up

Vs.

CU

DA

8*

Data Size

1D 2D 3D

0x 50x 100x

Color Proc.

Filters

Geometry Transforms

JPEG

Morphological Ops.

Speedup Vs. IPP**

* V100 and CUDA 9 (r384); Intel Xeon Broadwell, dual socket, E5-2698 v4@ 2.6GHz, 3.5GHz Turbo with Ubuntu 14.04.5 x86_64 with 128GB System Memory* P100 and CUDA 8 (r361); For cublas CUDA$8$(r361): Intel Xeon Haswell, single-socket, 16-core E5-2698 v3@ 2.3GHz, 3.6GHz Turbo with CentOS 7.2 x86-64 with 128GB System Memory** CPU system running IPP: Intel Xeon Haswell single-socket 16-core E5-2698 v3@ 2.3GHz, 3.6GHz Turbo Ubuntu 14.04.5 x86_64 with 128GB System Memory

0x

2x

4x

6x

8x

10x

512 1024 2048 2816

Spee

d up

Vs.

CU

DA

8*

Matrix Size

FP32 FP16 I/O, FP32 Compute

32

COOPERATIVE GROUPS

33

COOPERATIVE GROUPSA flexible model for synchronisation and communication within groups of threads

Levels$of$cooperation:TODAY

Levels$of$cooperation:CUDA$9

34

COOPERATIVE GROUPS BASICSFlexible, Explicit Synchronization

Thread groups are explicit objects in your program

You can synchronize threads in a group

Create new groups by partitioning existing groups

Partitioned groups can also synchronize

thread_group block =1this_thread_block();

block.sync();

thread_group tile321=1tiled_partition(block,132);thread_group tile41=1tiled_partition(tile32,14);

tile4.sync();Note: calls in green are part of the cooperative_groups:: namespace

Thread Block Group

Partitioned Thread Groups

35

COOPERATIVE GROUPSFlexible and Scalable Thread Synchronization and Communication

Define, synchronize, and partition groups of cooperating threads

Flexible: High-performance API for clean and robust management of thread groups

Scalable: Create and manage groups within warps, across thread blocks, and even across GPUs

Deploy Everywhere (*): Kepler and Newer GPUs

Supported by CUDA developer tools

* Note: Multi-Block and Multi-Device Cooperative Groups are only supported on Pascal and above GPUs

Thread Block Group

Partitioned Thread Groups

36

DEVELOPER TOOLS

37

UNIFIED MEMORY PROFILINGCorrelate CPU Page Faults with Source

Page Fault Correlation

38

NEW UNIFIED MEMORY EVENTS

Page ThrottlingMemory Thrashing Remote Map

Visualize Virtual Memory Activity

39

FUTURE: UNIFIED SYSTEM ALLOCATORAllocate unified memory using standard malloc

Removes CUDA-specific allocator restrictions

Data movement is transparently handled

Requires operating system support:

HMM Linux Kernel Module

void1sortfile(FILE1*fp,1int N)1{char1*data;

//1Allocate1memory1using1any1standard1allocatordata1=1(char1*)1malloc(N1*1sizeof(char));

fread(data,11,1N,1fp);

sort<<<...>>>(data,N,1,compare);

use_data(data);

//1Free1the1allocated1memoryfree(data);

}

CUDA 8 Code with System Allocator

40

ADDITIONAL RESOURCES

• Volta

• Whitepaper http://www.nvidia.com/object/volta-architecture-whitepaper.html

• Blog https://devblogs.nvidia.com/parallelforall/inside-volta

• CUDA 9

• Blog https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed

• Download https://developer.nvidia.com/cuda-downloads

Axel Koehler, Principal Solution Architect

[email protected]

INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work...

Documents

Transcript of INSIDE THE VOLTA GPU ARCHITECTURE AND CUDA 9...VOLTA MULTI-PROCESS SERVICE Hardware Accelerated Work...