SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing...

63
Pedro Mario Cruz e Silva Solutions Architect Manager, Latin América | Global Energy Team SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL PRIZE 2018

Transcript of SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing...

Page 1: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

Pedro Mario Cruz e SilvaSolutions Architect Manager, Latin América | Global Energy Team

SCALING DEEP LEARNING TO EXASCALEACM GORDON BELL PRIZE 2018

Page 2: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

2

200B CORE HOURS OF LOST SCIENCEData Center Throughput is the Most Important Thing for HPC

Source: NSF XSEDE Data: https://portal.xsede.org/#/galleryNU = Normalized Computing Units are used to compare compute resources across supercomputers and are based on the result of the High Performance LINPACK benchmark run on each system

0

50

100

150

200

250

300

350

400

2009 2010 2011 2012 2013 2014 2015

Computing Resources Requested

Computing Resources Available

Norm

alized U

nit

(Billions)

National Science Foundation (NSF XSEDE) Supercomputing Resources

Page 3: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

3

1980 1990 2000 2010 2020

GPU-Computing perf

1.5X per year

1000X

by

2025

RISE OF GPU COMPUTING

Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.

Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

102

103

104

105

106

107

Single-threaded perf

1.5X per year

1.1X per year

APPLICATIONS

SYSTEMS

ALGORITHMS

CUDA

ARCHITECTURE

Page 4: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

4

1

10

100

1000

Mar-12 Mar-13 Mar-14 Mar-15 Mar-16 Mar-17 Mar-18

Re

lati

ve

Pe

rfo

rm

an

ce

Mar-19

2013

BEYOND MOORE’S LAW

Base OS: CentOS 6.2

Resource Mgr: r304

CUDA: 5.0

Thrust: 1.5.3

2019

Accelerated Server

With FermiAccelerated Server

with Volta

NPP: 5.0

cuSPARSE: 5.0

cuRAND: 5.0

cuFFT: 5.0

cuBLAS: 5.0

Base OS: Ubuntu 16.04

Resource Mgr: r384

CUDA: 10.0

NPP: 10.0

cuSPARSE: 10.0

cuSOLVER: 10.0

cuRAND: 10.0

cuFFT: 10.0

cuBLAS: 10.0

Thrust: 1.9.0

Progress Of Stack In 6 Years

GPU-Accelerated Computing

CPU

Moore’s Law

2013 2014 2015 2016 2017 2018 2019March

Rela

tive P

erf

orm

ance

Page 5: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

5

DIGITAL SCIENCEHPC + AI + DATA

Page 6: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

6

FUSION OF HPC & AI

HPC AI

VOLTA TENSOR CORE GPU

GPU FUSES HPC & AI COMPUTING

MULTI-PRECISION COMPUTING

HPC (Simulation) – FP64, FP32

AI (Deep Learning) – FP16, INT8

Page 7: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

7

AI – A NEW INSTRUMENT FOR SCIENCE

AI> Neural Networks that learn patterns

from large data sets

> Improve predictive accuracy and faster

response time.

Dramatically Improves Accuracy and Time-to-Solution

HPC> Algorithms based on first principles

theory.

> Proven models for accurate results

Commercially

viable fusion

energy

Understanding

cosmological dark

energy and matter

Clinically viable

precision medicine

Improvement and

validation of the Standard

Model of Physics

Climate/weather

forecasts with ultra-

high fidelity

Page 8: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

8

“ACCELERATING EULERIAN FLUID SIMULATION WITH CONVOLUTIONAL NETWORKS”

Tompson, J., Schlachter, K., Sprechmann, P., & Perlin, K. (2016). Accelerating Eulerian Fluid Simulation With Convolutional Networks. arXiv preprint arXiv:1607.03597.

Page 9: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

9

Page 10: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

10

AI FOR SCIENCETransformative Tool To Accelerate The Pace of Scientific Innovation

Improves AccuracyEnabling realization of full scientific potential

Accelerates Time to SolutionUnlocking the use of science in exciting new ways

300,000X FasterPredict Molecular Energetics

Drug Discovery

5,000X FasterProcess LIGO Signal

Understanding Universe

Weeks to 10 milliseconds Analyze Gravitational Lensing

Astrophysics

14X FasterGenerate Bose-Einstein Condensate (Physics)

90% accuracy Fusion Sustainment

Clean Energy

33% FasterTrack NeutrinosParticle Physics

70% accuracy Score Protein Ligand

Drug Discovery

11% higher accuracy Monitor Earth’s Vital

Climate

Page 11: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

11

THE PROBLEM

Page 12: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

12

IMAGE SEGMENTATIONPattern Detection for Characterizing Extreme Weather

Atmospheric rivers (ARs) are labeled in blue, while tropical cyclones (TCs) are labeled in red

Page 13: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

13

CLIMATE DATASET AND GROUND TRUTH LABELS

0.25-degree Community Atmosphere Model (CAM5) output

Climate variables are stored on an 1152x768spatial grid, with a temporal resolution of 3 hours

All available 16 variables (water vapor, wind, precipitation, temperature, pressure, etc).

Process climate model output with the Toolkit for Extreme Climate Analysis to identify TCs.

A floodfill algorithm is used to create spatial masks of ARs

There are about 63K high-resolution samples in total

Split into 80% training, 10% test and 10% validation sets

The pixel mask labels correspond to 3 classes:1) Tropical Cyclone (TC)2) Atmospheric River (AR)3) Background (BG)

Climate data used is currently 3.5 TB

Page 14: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

14

THE TEAM

Page 15: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

15

NERSC & NVIDIA

Page 16: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

16

Page 17: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

17

HARDWARE

Page 18: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

18

NVIDIA POWERS TODAY’S FASTEST SUPERCOMPUTERS

22 of Top 25 Greenest

Piz DaintEurope’s Fastest

5,704 GPUs| 21 PF

ORNL SummitWorld’s Fastest

27,648 GPUs| 149 PF

Total Pangea 3Fastest Industrial

3,348 GPUs| 18 PF

ABCIJapan’s Fastest

4,352 GPUs| 20 PF

LLNL SierraWorld’s 2nd Fastest

17,280 GPUs| 95 PF

Page 19: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

19

NVIDIA POWERS WORLD'S FASTEST SUPERCOMPUTER

27,648Volta Tensor Core GPUs

Summit Becomes First System To Scale The 100 Petaflops Milestone

122 PF 3 EFHPC AI

Page 20: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

20

IBM AC9226xV100 + 2xP9 (Water Cooled)

Page 21: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

TESLA V100 TENSOR CORE GPUWorld’s Most Powerful Data Center GPU

5,120 CUDA cores

640 NEW Tensor cores

7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS

| 125 Tensor TFLOPS

20MB SM RF | 16MB Cache

32 GB HBM2 @ 900GB/s |

300GB/s NVLink

Page 22: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

22

TENSOR CORE4x4x4 matrix multiply and accumulate

Page 23: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

23

TENSOR CORES FOR SCIENCEMulti-precision computing

AI-POWERED WEATHER PREDICTION

PLASMA FUSION APPLICATION EARTHQUAKE SIMULATION

7.815.7

125

0

20

40

60

80

100

120

140

V100 TFLOPS

FP64+ MULTI-PRECISION

FP16 Solver

3.5x times faster

FP16/FP32

1.15x ExaOPS

FP16-FP21-FP32-FP64

25x times faster

Page 24: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

24

SOFTWARE:PERFORMANCE &

PRDUCTIVITY

Page 25: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

25

POWERING THE DEEP LEARNING ECOSYSTEMNVIDIA SDK Accelerates Every Major Framework

COMPUTER VISION

OBJECT DETECTION IMAGE CLASSIFICATION

SPEECH & AUDIO

VOICE RECOGNITION LANGUAGE TRANSLATION

NATURAL LANGUAGE PROCESSING

RECOMMENDATION ENGINES SENTIMENT ANALYSIS

DEEP LEARNING FRAMEWORKS

NVIDIA DEEP LEARNING SDK and CUDA

developer.nvidia.com/deep-learning-software

Page 26: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

26

Page 27: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

27

DEEP LEARNING

GPU ACCELERATED LIBRARIES“Drop-in” Acceleration for Your Applications

LINEAR ALGEBRA PARALLEL ALGORITHMS

SIGNAL, IMAGE & VIDEO

TensorRT

nvGRAPH NCCL

cuBLAS

cuSPARSE cuRAND

DeepStream SDK NVIDIA NPPcuFFT

CUDA

Math library

cuSOLVER

CODEC SDKcuDNN

Page 28: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

28

NVIDIA DEEP LEARNING SDK

Powerful tools and libraries for designing and deploying GPU-accelerated deep learning applications

High performance building blocks for training and deploying deep neural networks on NVIDIA GPUs

Industry vetted deep learning algorithms and linear algebra subroutines for developing novel deep neural networks

Multi-GPU and multi-node scaling that accelerates training on up to eight GPU

High Performance GPU-acceleration for Deep Learning

developer.nvidia.com/deep-learning-software

Deep Learning Primitives

Multi-GPU Communication

Linear Algebra

Programmable Inference Accelerator

Sparse Matrix Operations

Deep Learning for Video Analytics

Page 29: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

29

NVIDIA cuDNNDeep Learning Primitives

developer.nvidia.com/cudnn

High performance building blocks for deep learning frameworks

Drop-in acceleration for widely used deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, PyTorch, Tensorflow and others

Accelerates industry vetted deep learning algorithms, such as convolutions, LSTM RNNs, fully connected, and pooling layers

Fast deep learning training performance tuned for NVIDIA GPUs

“ NVIDIA has improved the speed of cuDNN

with each release while extending the

interface to more operations and devices

at the same time.”

— Evan Shelhamer, Lead Caffe Developer, UC Berkeley

0

2,000

4,000

6,000

8,000

10,000

12,000

8x K80 8x Maxwell DGX-1 DGX-1V

Images/

Second

cuDNN 7

NCCL 2

cuDNN 6

NCCL 1.6

cuDNN 4

cuDNN 2

Deep Learning Training Performance

Page 30: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

30

NVIDIA COLLECTIVE COMMUNICATIONS LIBRARY (NCCL)Multi-GPU and Multi-node Collective Communication Primitives

developer.nvidia.com/nccl

Open-source High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs

Fast routines for multi-GPU multi-node acceleration that maximizes inter-GPU bandwidth utilization

Easy to integrate and MPI compatible. Uses automatic topology detection to scale HPC and deep learning applications over PCIe and NVLink

Accelerates leading deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and more

Multi-Node:

InfiniBand verbs,

IP Sockets

Multi-GPU:

NVLink, PCIe

Automatic

Topology

Detection

Preferred NetworksFeb '17

128 TitanX(Maxwell)

FacebookJune '17

256 TeslaP100

IBMAug '17

256 TeslaP100

Preferred NetworksNov '17

1024 TeslaP100

TencentJul'18

2048 TeslaP40

Tra

inin

g t

ime

Scaling training to 2048 GPUs

4.4 Hours

48 Mins60 Mins

15 Mins

ResNet-50 | Dataset: Imagenet | Trained for 90 Epochs

6.6 Mins

Page 31: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

31

HOROVOD (UBER)Horovod: fast and easy distributed deep learning in TensorFlow

Sergeev, Alexander, and Mike Del Balso. "Horovod: fast and easy distributed deep learning in TensorFlow." arXiv preprint arXiv:1802.05799 (2018).

Page 32: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

32

FULLY CONVOLUTIONAL NETWORKS (FCN)“Fully Convolutional Networks for Semantic Segmentation”,

Shellhammer et al, 2015

Page 33: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

33NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

SEMANTIC SEGMENTATIONFCN vs SDS

Page 34: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

34

DEEP NEURAL NETWORKSTiramisu (left) and DeepLabv3+ (right)

Page 35: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

35

INNOVATIONS (1)

• Weighted loss

• Layer-wise adaptive rate control (LARC)

• Multi-channel segmentation

• Gradient lag

• Modifications to the neural network architectures

Deep Learning Innovations

Page 36: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

36

INNOVATIONS (2)

• High speed parallel data staging

• Optimized data ingestion pipeline

• Hierarchical all-reduce

System Innovations

Page 37: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

37

INNOVATIONS (2)

• High speed parallel data staging

• Read 1500 images per node (250 per GPU)

• 800GB of high-speed SSD storage on each node

• Distributed data staging system that first divides the data set into disjoint pieces, read to some nodes, than copy to other nodes

System Innovations

Page 38: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

38

INNOVATIONS (2)

• Optimized data ingestion pipeline

• TensorFlow input pipeline that reads the input files and converts them into the tensors that are fed through the network (read and convert to TFRecords)

• Eliminate serialization by enabling the prefetching option of TensorFlow datasets

• HDF5 Library serializes calls

• By using the Python multiprocessing module, transform these parallel worker threads into parallel worker processes, each using its own instance of the HDF5 library

System Innovations

Page 39: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

39

INNOVATIONS (2)

• Hierarchical all-reduce

• Horovod is a Python module that uses MPI to transform a single-process TensorFlow application into a data-parallel implementation

• Each MPI rank creates its own identical copy of the TensorFlow operation graph.

• The first issue was a bottleneck on the first rank, which acts as a centralized scheduler for Horovod operations. Solution: organize in a communication tree.

• The existing Horovod implementation is able to reduce data residing on GPUs in two different ways, either by a standard MPI_Allreduce or by using the NVIDIA Collective Communications Library (NCCL)

• NCCL is better for intra-node (exploits NVLINK) and Standard MPI is better for inter-node

System Innovations

Page 40: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

40

RESULTS

Page 41: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

41

CLIMATE RESULTSAtmospheric Rivers (AR) in Blue and Tropical Cyclones (TC) in Red

Page 42: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

42

SCALING RESULTS

Page 43: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

43NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

OVERALL RESULTS

#GPUs GPU ArchPeak

(PFLOPS)Sustained (PFLOPS)

Efficiency

Tiramisu (FP32) 5300 P100 (CUDA) 26.6 21 79.0%

DeepLabv3+ (FP32) 27360 V100 (CUDA) 359.2 325.8 90.7%

DeepLabv3+ (FP16) 27360 V100 (Tensor Cores) 1130.0 999 88.4%

Tiramisu (PizDaint) and DeepLabv3 (Summit)

Page 44: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

44

PUSHING AICOMPUTING LIMITS

Page 45: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

45

Page 46: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

46

NVSWITCHWorld’s Highest Bandwidth On-node Switch

7.2 Terabits/sec or 900 GB/sec

18 NVLINK ports | 50GB/s per

port bi-directional

Fully-connected crossbar

2 billion transistors |

47.5mm x 47.5mm package

Page 47: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

47

NVIDIA DGX-2THE LARGEST GPU EVER CREATED

2 PFLOPS | 512GB HBM2 | 10 kW | 350 lbs

Page 48: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

48

MORE SCIENTIFIC EXAMPLES

Page 49: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

49

GALAXY CLASSIFICATIONMerging vs Not-Merging

Page 50: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

50

GALAXY CLASSIFICATIONMerging vs Not-Merging

Page 51: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

51

Fenix is at the 142nd position in the Top500

List.

576x V100 = 288x Nodes w/ 2xGPU/node

Source: Top500.org.

1º LATIN AMERICA SUPER-COMPUTER:

PETROBRAS FENIX

Page 52: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

52

TRAINING SET

Features (Seismic) Labels

Page 53: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

53

TRAINING IMAGESParihaka dataset (SEGY)

https://wiki.seg.org/wiki/Parihaka-3D

Page 54: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

54NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

METHOD

Based on the state-of-art on image compression work (CVPR-18):

“Conditional Probability Models for Deep

Image Compression” – Mentzer et. al.

Original work operates on 8-bits depth images. Changes for 32-bits and specific training protocols were performed for 3D post-stacked seismic data.

Conditional Problabilistc Deep Auto-Encoder

Page 55: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

55NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.

EXPERIMENTSVisual Comparison

Page 56: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

56

NEW SUPERCOMPUTERS IN BRAZILLNCC (Rio de Janeiro)

376x V100 = 96x Nodes w/ 4xGPU/nodeSENAI-SIMATEC (Salvador)

312x V100 = 78x Nodes w/ 4xGPU/node

Page 57: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

57

LEARN & SHARE MORE

Page 58: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

58

CONNECT

Connect with hundreds of experts from top industry, academic, startup, and government organizations

LEARN

Gain insight and valuable hands-on training through over 500+ sessions

DISCOVER

See how GPU technology is creating breakthroughs in deep learning, cybersecurity, data science, healthcare and more

INNOVATE

Explore disruptive innovations that can transform your work

JOIN US AT GTC 2020 | USE VIP CODE XXXXX FOR 25% OFF

March 22—26, 2020 | Silicon Valley

Don’t miss the premier AI conference.

www.nvidia.com/gtc

Page 59: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

59

March 22 | Full-Day Workshops

March 23 - 26 | Conference & Training

Get the hands-on experience you need to transform the

future of AI, high-performance computing and more with

NVIDIA’s Deep Learning Institute (DLI).

Register for GTC 2020 to earn certification in full-day

workshops, join instructor-led sessions, and start self-

paced training.

www.nvidia.com/en-us/gtc/sessions/training/

THE LATEST DEEP LEARNING

DEVELOPER TOOLS

Page 60: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

60

developer.nvidia.com

Page 61: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

61

Deep Learning Fundamentals

Game Development & Digital Content

Finance

NVIDIA DEEP LEARNING INSTITUTE

Hands-on self-paced and instructor-led training in deep learning and accelerated computing for developers

Request onsite instructor-led workshops at your organization: www.nvidia.com/requestdli

Take self-paced labs online: www.nvidia.com/dlilabs

Download the course catalog, view upcoming workshops, and learn about the University Ambassador Program: www.nvidia.com/dli

Intelligent Video Analytics

Medical Image Analysis

Autonomous Vehicles

Accelerated Computing Fundamentals

More industry-specific training coming soon…

Genomics

Page 62: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

62

NVIDIA HW GRANT PROGRAM

Titan V Volta

• Robotics

• Autonomous Machines

Jetson TX2(Dev Kit)

• Scientific Visualization

• Virtual Reality

Quadro P6000

• Scientific Computing

• HPC

• Deep Learning

https://developer.nvidia.com/academic_gpu_seeding

Page 63: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year

ObrigadoGracias

Thank you

[email protected]