Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09)...

47
Maggie Zhang (张雪萌) [email protected] Accelerate Deep Learning Training at Scale on GPUs

Transcript of Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09)...

Page 1: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

Maggie Zhang (张雪萌) [email protected]

Accelerate Deep Learning Training at Scale on GPUs

Page 2: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

AGENDA

● Introduction

● Why do we need to scale training

● How to achieve scaling

Page 3: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

3

2015

36000 Mins (25 Days)

1xK80 | 2015CUDA

2016

1200 Mins (20 Hours)DGX-1P | 2016

NVLink

2017

480 Mins (8 Hours)DGX-1V | 2017Tensor Core

6.3 Minutes on MLPerfAt Scale | 2018

DGX Cluster

2018

70 Minutes on MLPerfDGX-2H | 2018

NVSwitch

ResNet50 v1.5 training

2019

52.7 Minutes on MLPerf

DGX-2H | 2019NVSwitch

1.33 Minutes on MLPerf

At Scale | 2019DGX SuperPOD

DL Training: from single GPU to multi-node

Page 4: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

4

The whole stack must be considered

● Compute

● Network

● Storage

● Frameworks & Libraries

● Numerical methods

● Training recipes

Page 5: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

5

MLPerf: NVIDIA advancing AI training

Time to Train From 8 Hours to 80 Seconds

2019 MLPerf ID (in order from top to bottom of chart): ResNet-50: 0.6-30 | Transformer: 0.6-28 | GNMT: 0.6-14 | SSD: 0.6-27 | Mini-Go: 0.6-11 | Mask R-CNN: 0.6-23

Page 6: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

6

Largest TensorFlow model at scaleOak Ridge National Lab scales TensorFlow climate analytics model up to 27,360 V100 GPUs

Source: https://arxiv.org/pdf/1810.01993.pdf

2018 Gordon Bell Prize Winner

Page 7: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

AGENDA

● Introduction

● Why do we need to scale training

● How to achieve scaling

Page 8: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

8

● Unlabeled data:

○ Language model: BooksCorpus (800M words), English Wikipedia (2.5B words), WebText (8M

documents, 40 GB), C4 (Common Crawl, 745 GB)

○ GAN: unlabeled images and videos

○ Reinforcement learning: unsupervised self-play generates unlimited data

● Labeled data:

○ ImageNet (2012) - 1.3M images, 1000 categories Open Images (2019) - 9M images, 6000

categories

○ Semi-autonomous vehicles: 0.5-1.1TB of data for every 8h driving

Datasets getting larger

Page 9: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

9

DL models increasing in complexity

Image Recognition

NLP

NLP – Generative Tasks

ChatbotsE-mail auto-completionDocument Summarization

Autonomous VehiclesSocial TaggingVisual Search

Q&ASentimentTranslation

1.5Bn

26M340M

Next-level use-cases require gigantic models

https://github.com/NVIDIA/Megatron-LM

Project Megatron

8.3B parameters

8-way Model Parallel

64-way Data Parallel

24x larger than BERT

Speech Recognition

Translation

Object Detection

Page 10: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

AGENDA

● Introduction

● Why do we need to scale training

● How to achieve scaling

Page 11: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

11

Scaling == whack-a-mole ?

Solving one bottleneck and another one pops up

Page 12: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

12

Multi-node infrastructure requirements

System Design

Data Center

ManagementSW Stack

Multi-Node

Success

Page 13: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

13

● Hardware GPU cluster design:○ Compute: significant CPU to GPU ratio, interconnect with GPU

○ Storage: high speed NFS, multi-tier caching

○ Networking: topology and bandwidth, NVLINK, GPUDirect RDMA

● GPU cluster management:○ Scheduler: Slurm vs. Kubernetes

○ Container technologies: Docker, Enroot, Singularity, etc.

● Integrated software stack:○ NVIDIA libraries: CUDA, cuDNN, NCCL

○ DL Framework scale-out optimization

○ Model scale-out implementation & optimization

Challenges of multi-node DL training

Page 14: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

14

A basic recipe for deep learning scaling

Step 1: Optimize your single GPU model

Step 2: Scale to multiple GPUs on one node

Step 3: Scale to multiple nodes

Page 15: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

15

Case study

• BERT model scripts:https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERTConfigurations for convergence, from 8 to 1500 GPUs, multi-node ready

• Clone and train your own BERT model on multi-node Or download a pre-trained BERT model from NGC and fine-tune for your NLP task

Bidirectional Encoder Representations from Transformers

Super Human Question & Answering

NVIDIA Deep Learning Examples have many model scripts with best practices for accuracy and performance

Page 16: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

16

• Pre-training on non-labelled data opens up opportunities to using massive amounts of data:• BooksCorpus (800 million words)• English Wikipedia (2.5 billion words), multi-language Wikipedia• WebText (OpenAI, 8M documents, 40 GB of text)

• More data tends to lead to better accuracy

• BERT pre-training is computationally intensive and takes days to train even on the most powerful single node: BERT-Large (330M parameters) takes ~2.5 days to train on a single DGX-2 server with 16 V100 GPUs.

Why multi-node BERT training

Page 17: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

17

BERT multi-node pre-training performance

DGX-1

(16 GB)

GPUs Time to train

(Hrs)

1 8 153.6 (6.3

days)

4 32 39.3

16 128 10.4

DGX-2H

(32 GB)

GPUs Time to train

(Hrs)

1 16 58.4 (2.4 days)

4 64 15.4

16 256 3.9

64 1024 1.2

Source: https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT#pre-training-loss-results

* Above time to train is measured for Mixed precision, training loss 1.3 in PyTorch; with LAMB optimizer

** Gradient accumulation is applied to DGX-2H 1,4,16 node

Metric: Time to train

Page 18: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

18

• Create efficient data pipeline

• Enable mixed precision training

• Enable XLA

• Ensure latest GPU libraries

• Develop model in container to facilitate scaling out

Step 1: Optimize model

Page 19: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

19

Step 1: Optimize model

• Use tf.data to create performant input pipelines

• Test I/O bottlenecks with a trivial model

• NVIDIA DALI accelerates image-based input pipelines

Data pipeline

Page 20: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

20

d = tf.data.Dataset.from_tensor_slices(tf.constant(input_files))d = d.repeat()d = d.shuffle(buffer_size=len(input_files))

# `cycle_length` is the number of parallel files that get read.cycle_length = min(num_cpu_threads, len(input_files))d = d.apply(

tf.contrib.data.parallel_interleave(tf.data.TFRecordDataset,cycle_length=cycle_length))

d = d.shuffle(buffer_size=100)

d = d.apply(tf.contrib.data.map_and_batch(

lambda record: _decode_record(record, name_to_features),batch_size=batch_size,num_parallel_batches=num_cpu_threads,drop_remainder=True if is_training else False))

BERT

TFRecord - fast binary format

Parallel read, map, & batch

Fused map & batch op

Data pipeline

https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/run_pretraining.py

Page 21: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

21

Step 1: Optimize model

• 1-line optimizer wrapper:opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)

• Up to 3x speed up in training on Tensor Cores with• Same accuracy• No change in hyperparameters• ½ memory bandwidth & footprint

• Optimal on Volta and Turing GPUs

Automatic Mixed Precision (AMP)

Page 22: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

22

Step 1: Optimize modelAutomatic Mixed Precision (AMP)

• Robust speedup across different TensorFlow workloads

• https://arxiv.org/abs/1710.03740

Page 23: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

23

Step 1: Optimize modelXLA (Accelerated Linear Algebra)

• TensorFlow XLA can accelerate models with minimal code changes

• XLA optimizes graph, mostly by fusing compatible kernels

• Set XLA optimization level:

https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageMo

deling/BERT/run_pretraining.py#L531

System config: Xeon E4-2698v4 CPU with 256GB system RAM, single V100 Tensor Core GPU 32GB. Tests

run using NVIDIA 18.11 TensorFlow container.

config.graph_options.optimizer_options.global_jit_level = tf.OptimizerOptions.ON_1

Page 24: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

24

Step 1: Optimize model

• Latest compatible features and tuning from CUDA toolkit and Deep Learning Libraries (cuDNN, cuBLAS, NCCL)

Latest GPU optimizations

Page 25: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

25

Step 1: Optimize model

• NGC containers: fully featured DL containers

• DL frameworks compiled with latest GPU libraries

• Portability of application libraries facilitates multi-node scale-out

Latest GPU optimizations

Page 26: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

26

Page 27: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

27

• Understand Data Parallel training concepts

• Ensure optimal inter-GPU communication

• Apply high level API for multi-GPU training

Step 2: Scale to multiple GPUs

Page 28: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

28

Step 2: Scale to multiple GPUs

• Single GPU

Under the hood

Page 29: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

29

Step 2: Scale to multiple GPUs

• Multiple GPU

• Data parallel training

Under the hood

• Allreduce algorithm

• NCCL: NVIDIA Collective Communication Library

Page 30: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

30

• Inter-GPU communication:

Step 2: Scale to multiple GPUsUnder the hood

Effective bandwidth in GB/s

Page 31: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

31

• Full non-blocking bandwidth

Step 2: Scale to multiple GPUsUnder the hood

Page 32: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

32

Step 2: Scale to multiple GPUs

• Popular approach to enable multi-GPU/multi-node in TensorFlow/Keras

• Strong NCCL integration

• Sample commands:

• Single-node (4 GPUs):

horovodrun -np 4 -H localhost:4 python train.py

• Multi-node (4 nodes with 4 GPUs each):

horovodrun -np 16 -H server1:4,server2:4,server3:4,server4:4 python train.py

Approach 1: Horovod

Page 33: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

33

Step 2: Scale to multiple GPUs

import tensorflow as tfimport horovod.tensorflow as hvd

# Initialize Horovodhvd.init()

# Pin GPU to be usedconfig = tf.ConfigProto()config.gpu_options.visible_device_list = str(hvd.local_rank())

# Build model...loss = ...opt = tf.train.AdamOptimizer(lr=0.01 * hvd.size())

# Add Horovod Distributed Optimizeropt = hvd.DistributedOptimizer(opt)

Approach 1: Horovod

# Add hook to synchronize initial statehooks = [hvd.BroadcastGlobalVariablesHook(0)]

# Make training operationtrain_op = opt.minimize(loss)

# Only checkpoint on rank 0ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None

# Session

with tf.train.MonitoredTrainingSession(checkpoint_dir=ckpt_dir,config=config, hooks=hooks) as mon_sess:

while not mon_sess.should_stop():# Perform synchronous training.mon_sess.run(train_op)

Page 34: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

34

• Recently released native API that also support Allreduce with NCCL

• Multi-GPU:tf.distribute.MirrorStrategy

• Multi-node:tf.distribute.experimental.MultiWorkerMirroredStrategy

Step 2: Scale to multiple GPUsApproach 2: tf.distribute.Strategy

Source: https://www.tensorflow.org/guide/distributed_training

Page 35: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

35

• Adopt optimizer designed for large batch size

• Ensure effective inter-node communication

• Move data close to compute

• Consider full application & system software stack

Step 3: Scale to multiple nodes

Page 36: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

36

• Optimizer inspired by LARS• Layerwise Adaptive learning rate (You et al.)

• Allows training at huge global batch size• Originally, BERT+Adam (Devlin et al.) – global batch 256

• BERT+LAMB (You et al.) – global batch 64k

• Massive data parallelism

• Lower interconnect pressure with gradient accumulation

Step 3: Scale to multiple nodesLAMB optimizer

Page 37: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

37

BERT+LAMB

Robustly scale to large batch size

https://github.com/NVIDIA/DeepLearningExamples/blob/master/TensorFlow/LanguageModeling/BERT/optimization.py

class LAMBOptimizer(tf.train.Optimizer):"""A LAMB optimizer that includes "correct" L2 weight decay."""

def __init__(self,learning_rate,weight_decay_rate=0.0,beta_1=0.9,beta_2=0.999,epsilon=1e-6,exclude_from_weight_decay=None,name="LAMBOptimizer"):

"""Constructs a LAMBOptimizer."""super(LAMBOptimizer, self).__init__(False, name)

.

.

.

Step 3: Scale to multiple nodesLAMB optimizer

Page 38: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

38

• Inter-GPU communication (bigger picture):

Step 3: Scale to multiple nodesUnder the hood

Effective bandwidth in GB/s

Page 39: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

42

• Tensor Fusion

• Batch tensors together during allreduce

• HOROVOD_FUSION_THRESHOLD=<bytes> HOROVOD_CYCLE_TIME=<ms> horovodrun ...

• Gradient Compression (FP16 Allreduce):

• hvd.DistributedOptimizer(..., compression=hvd.Compression.fp16)

• Reduces network utilization

Step 3: Scale to multiple nodesFurther Horovod optimizations

Page 40: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

43

• DNN datasets are large

• Read-dominated at beginning of each epoch

• Keep data close to compute as much as possible:

• RAM disk, SSDs in RAID 0, Fast network attached storage

Step 3: Scale to multiple nodesStorage

Page 41: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

44

• Integrated software and hardware system for multi-node scaling

• State-of-the-art compute, GPU interconnect, node interconnect, and storage

Step 3: Scale to multiple nodesReference architecture: DGX SuperPOD

Page 42: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

45

NVIDIA DGX SuperPOD

Mellanox EDR 100G InfiniBand Network

Mellanox Smart Director Switches

In-Network Computing Acceleration Engines

Fast and Efficient Storage Access with RDMA

Up to 130Tb/s Switching Capacity per Switch

Ultra-Low Latency of 300ns

Integrated Network Manager

Terabit-Speed InfiniBand Networking per Node

Rack 1 Rack 16

ComputeBackplane

Switch

Storage Backplane

Switch

64 DGX-2

GPFS

200 Gb/s per node

800 Gb/s per node

White paper: https://www.nvidia.com/en-us/data-

center/resources/nvidia-dgx-superpod-reference-architecture/

Page 43: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

46

• Deep Learning Model:

• Hyperparameters tuned for multi-node scaling

• Multi-node launcher scripts

• Deep Learning Container:

• Optimized DL frameworks, GPU libraries, and multi-node software

• Host:

• Host OS, GPU driver, IB driver, container runtime engine (docker, enroot)

Step 3: Scale to multiple nodesSoftware stack - Application

Page 44: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

47

• Slurm: User job scheduling & management

• Enroot: NVIDIA open-source tool to convert traditional container/OS images into unprivileged sandboxes

• Pyxis: NVIDIA open-source plugin integrating Enroot with Slurm

• DeepOps: NVIDIA open-source toolbox for GPU cluster management w/Ansible playbooks

Step 3: Scale to multiple nodesSoftware stack - System

Login nodes DGX Pod: DGX Servers w. DGX base OS

Slurm

controllerEnroot | DockerPyxis

NGC model containers (Pytorch, Tensorflow from 19.09)

DCGM

Page 45: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

48

DeepOps leverages Ansible for automated

large scale cluster deployment. Deployment doc

Deployment with DeepOps

Bootstrap all nodes

Prepare provisioning node

Provision all node(s)

Deploy Slurm on Slurm nodes

Deploy DL/ML development tools

Deploy Production AI applications

Deploy management services DeepO

ps

- Build your own GPU cluster following the DGX Pod and DGX

SuperPOD reference architectures.

- Clone the DeepOps repo and follow the cluster setup guide.

Open a GitHub issue if any problem.

Step 3: Scale to multiple nodes

Page 46: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...

49

• Scaling requires careful consideration of algorithms and infrastructure at each step

• Optimized single-GPU model

• Efficient & scalable Allreduce library

• GPU interconnect, networking, storage

...

• NVIDIA platform makes scaling DL training easier and more efficient

• Deep Learning Examples with SOTA accuracy and performance

• NVIDIA NGC Container with optimized multi-GPU/multi-node software stack

• Accelerated compute platform designed for performance and scaling

SummaryScaling is important and we are here to help

Page 47: Running TensorFlow at scale on GPUs€¦ · NGC model containers (Pytorch, Tensorflow from 19.09) DCGM. 48 DeepOps leverages Ansible for automated ...