E8270 –INSIDE NVIDIA GPU CLOUD CONTAINERSon-demand.gputechconf.com/gtc-eu/2018/pdf/e8270... · 15...

Post on 12-Jun-2020

15 views 0 download

Transcript of E8270 –INSIDE NVIDIA GPU CLOUD CONTAINERSon-demand.gputechconf.com/gtc-eu/2018/pdf/e8270... · 15...

Chris Kawalek, NVIDIA GPU Cloud Product Team, NVIDIAMichael O’Connor, Optimized Deep Learning Frameworks, NVIDIA

E8270 – INSIDE NVIDIA GPU CLOUD CONTAINERS

2

AGENDA

The Difficulty With Complex Software

Running In Different Environments

Why Containers

Diving Into NGC Deep Learning Containers

Q&A

3

CHALLENGES WITH COMPLEX SOFTWARE

Current DIY GPU-accelerated AI and HPC deployments are complex and time consuming to build, test and maintain

Development of software frameworks by the community is moving very fast

Requires high level of expertise to manage driver, library, framework dependencies

NVIDIA Libraries

NVIDIA Container

Runtime for Docker

NVIDIA Driver

NVIDIA GPU

Applications or

Frameworks

4

NVIDIA GPU CLOUD (NGC)

Over 35 GPU-Accelerated ContainersDeep learning, HPC applications, HPC visualization tools, and partner applications

Innovate in Minutes, Not WeeksPre-configured, ready-to-run

Run AnywhereNVIDIA GPUs on the top cloud providers, NVIDIA DGX Systems, and PCs and workstations

Simple Access to GPU-Accelerated Software

5

A CONSISTENT, HYBRID CLOUD EXPERIENCE ACROSS COMPUTE PLATFORMS

9

������ ���������� ����%������������������ ��� ����� �������!������+��.�������.�����

7

WORK AT SCALE ON AI SUPERCOMPUTERSNGC Containers Run on NVIDIA DGX Systems

8

DEVELOP ON NVIDIA TITAN & NVIDIA QUADRONGC Containers Run on PCs and Workstations with Select NVIDIA GPUs

9

WHY CONTAINERS?

Benefits of Containers:

Simplify deployment of GPU-accelerated software, eliminating time-consuming software integration work

Isolate individual deep learning frameworks and HPC applications

Share, collaborate, and test applications across different environments

9

10

CONTINUAL EXPANSION

October 2017 October 2018

36 containers

10 containers

bigdft

candle

chroma

gamess

gromacs

lammps

lattice-microbes

MILC

namd

pgi

picongpu

relion

vmd

caffe

caffe2

cntk

cuda

digits

mxnet

pytorch

tensorflow

tensorrt

tensorrtserver

theano

torch

index

paraview-holodeck

paraview-index

paraview-optix

chainer

h20ai-driverless

kinetica

mapd

matlab

paddlepaddle

Deep Learning HPC HPC Visualization PartnersNVIDIA/K8s

Kubernetes

on NVIDIA GPUs

10 containersNEW

CONTAINERS

11

USING NGC CONTAINERS

Data Scientists and

ResearchersDevelopers

Eliminate setup time, focus on

science and research

Work with the latest software with

a known good starting point

Sysadmins

Deploy to production

immediately

12

VIRTUAL MACHINES VS. CONTAINERS

Packaging and deployment mechanism for applications

▶ Consistent and reproducible deployment

▶ Lightweight and lower overhead than VMs

▶ Logical isolation from other applications

Motivation

Image credits

13

EXAMPLE NGC CONTAINER WORKFLOW

NVIDIA builds application image composed of layers of files

Image(s) tested and released to NGC repository hosted at URLs like nvcr.io/nvidia/tensorflow

User pulls image to a machine and runs it

Image cached and OS isolated set of resources allocated (container) in which to execute

Data & results accessed as a filesystem volume

NGC

$ docker run nvcr.io/…

101010

14

ANATOMY OF AN NGC CONTAINER IMAGE

ubuntu:16.04

Image Layers (R/O)

f2233041f557

145c1bf7947a

0c395732af81

fb91e851e672

R/W Layer

NVIDIA DeepLearning SDK

NVIDIA CUDA SDK

DL Framework & Source

Examples & Scripts

15

ALWAYS UP-TO-DATEMonthly Releases from NVIDIA

18.09 18.08

Supported Platform DGX OS 4.0.1 and 3.1.2+ 3.1.2+ and 2.1.1+

NVIDIA Driver 410 and 384 384

Base Image Ubuntu 16.04 16.04

CUDA 10.0.130 9.0.176

cuBLAS 10.0.130 9.0.425 (aka Patch 4)

cuDNN 7.3.0 7.2.1

NCCL 2.3.4 2.2.1

NVIDIA Optimized Frameworks NVCaffe 0.17.1 for Python 3.5 0.17.1 for Python 2.7

DIGITS 6.1.1 6.1.1

MXNet 1.3 for Python 3.51.2.0+ for Python 2.7 and

Python 3.5

PyTorch 0.4.1++ for Python 3.6 0.4.1+ for Python 3.6

TensorFlow1.10 for Python 2.7 and Python

3.5 (TensorRT 5.0.0)

1.9.0 for Python 2.7 and

Python 3.5 (TensorRT 4.0.1)

TensorRT 5.0.0 4.0.1

TensorRT Server 0.6 0.5

TensorFlow for Jetson 1.10 on JetPack 4.1 for Xavier 1.9.0 on Jetpack 3.2 for TX2

16

CUDA COMPATIBILITY – UPGRADE PATHS

NEW Forward Compatibility Option

Upgrade only user-mode CUDA components*

CUDA Toolkit

and Runtime

CUDA Toolkit

and RuntimeUpgrade

CUDA 9.0

GPU Kernel

Mode Driver –

nvidia.ko

GPU Kernel

Mode Driver –

nvidia.ko

CUDA User

Mode Driver –

libcuda.so

CUDA User

Mode Driver –

libcuda.so

R384 Driver R410 Driver

CUDA 10.0

Upgrade

New compatibility platform upgrade path available

� Use newer CUDA toolkits on older driver installs

� Compatibility only with specific older driver versions

System requirements

� Tesla GPU support only – no Quadro or GeForce

� Only available on Linux

Starting with CUDA 10.0

*requires new ‘cuda-compat-10-0’ package

17

BEST NVIDIA PERFORMANCEOver 12 months, up to 1.8X improvement with mixed-precision on ResNet-50

18

BEST NVIDIA PERFORMANCE2.0X improvement with mixed-precision on ResNet-50 from DGX-1 to DGX2

19

TARGET SYSTEM SETUP

NGC Virtual Machine ImagesNVIDIA Deep Learning for Volta (AWS–EC2 AMI)

Pre-installedUp-to-date Ubuntu Server OS

CUDA DriversNVIDIA Container Runtime

NGC Container Ready BaseOSOn all DGX Systems

Self-Install Setup Guide

NGC Examples and Management Scriptshttps://github.com/nvidia/ngc-examples

20

LOG INTO NGC, PULL AND RUN11

33 Browse For Image

Create Account / Log In

22 Get API Key

Log in on Machine & Run

$ docker login nvcr.io

Username: $oauthtoken

Password: *******

$ docker run -it nvcr.io/nvidia/tensorflow:18.09

44

21

RUNNING CONTAINERS WITH DATA

101010

nvcri.io/nvidia/tensorflow:18.02

/mnt/ssd/large_dataset/workspace/large_dataset

$ docker run –-rm –it nvcr.io/… -volume /mnt/ssd/large_dataset:/workspace/large_dataset

22

NVIDIA GPU CLOUDGPU-Accelerated Containers for Deep Learning, HPC, and HPC Visualization

Innovate In Minutes, Not Weeks

Run Anywhere

Comprehensive Library of GPU-Accelerated Containers

23

Q & A