Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING...

Dale Southard

BEST PRACTICES FOR DESIGNING,

DEPLOYING, AND MANAGING GPU CLUSTERS

Cluster Design

Development Data Center Infrastructure

Tesla Accelerated Computing Platform

Accelerators Interconnect

System

Management

Compiler

Solutions

GPU Boost

GPU Direct

NVLink

Profile and

CUDA Debugging API

Development

Programming

Languages

Infrastructure

Management Communication System Solutions

Software

Solutions

Libraries

cuBLAS

TESLA: PLATFORM WITH OPEN ECOSYSTEM

Pick the CPU that’s Correct for You

Libraries

Programming

Languages

Compiler

Directives

cuBLAS

DESIGNING FOR WORKLOAD

Multiple Ways to Balance Parallel and Serial Performance

NVLink PCIe Gen3 x16

CPU 80GB/s

16GB/s

2 GPUs per CPU

GPU CPU

3 GPUs per CPU

PCIe Switch

40 GB/s

20 GB/s

40 GB/s

16GB/s 20GB/s

4 GPUs per CPU

IN SITU VIS – FASTER SCIENCE, LOWER COST

Traditional Workflow

CPU Supercomputer Viz Cluster

Simulation- 1 Week Viz- 1 Day

Multiple Iterations

Time to Discovery =

Months

Tesla Platform Faster Time to Discovery

Reduced Pressure on Filestore

GPU-Accelerated Supercomputer

Visualize while

you simulate

Restart Simulation Instantly

Multiple Iterations

Time to Discovery = Weeks

Cluster Deployment

NODE TOPOLOGY CIRCA 2008

Life Was Good

CPU CPU

Bridge

Memory

NODE TOPOLOGY IN 2014 AND BEYOND

Better, but More Choices

CPU CPU

Memory

… …

TOPOLOGY-AWARE RESOURCE MANAGERS

$CUDA_VISIBLE_DEVCIES today, cgroups coming soon

CPU CPU

GPU GPU GPU GPU

Cluster Management

TESLA GPU DEPLOYMENT KIT

nvidia-smi provides a command-line interface

NVML provides API access

https://developer.nvidia.com/gpu-deployment-kit

Command-line and Library

COMPUTE MODE

The Compute Mode setting controls simultaneous use

DEFAULT allow multiple simultaneous processes

EXCLUSIVE_THREAD allows only one context

EXCLUSIVE_PROCESS one process, but multiple threads

PROHIBITED

Can be set by command–line (nvidia-smi) & API (NVML)

Insurance Against Scheduling Issues

RESOURCE LIMITS

UVA depends on allocating virtual address space

Virtual address space != physical ram consumption

Several batch systems and resource mangers support cgroups,

either directly or via plugins.

Use cgroups, Not ulimit

MONITORING

Environmental and utilization metrics are available

Tesla cards may provide OoB access to telemetry via BMC

NVML support has been integrated into many monitoring systems

Open Source, Commercial, DIY

GPU ACCOUNTING

Per-process accounting of GPU usage by PID

Accessible by library (NVML) or command-line (nvidia-smi)

Enable accounting mode:

sudo nvidia-smi –am 1

Human-readable accounting

nvidia-smi –q –d ACCOUNTING

CSV accounting data

nvidia-smi –query-accounted-

apps=gpu_name,gpu_util –

format=csv

Summary

TESLA ACCELERATED COMPUTING PLATFORM

Pick the HW mix that’s right for your site

Topology Matters in design and in resource management

Rich ecosystem of management hooks and tools

Libraries, Languages, Tools

Thanks

Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING...

Documents

Transcript of Best Practices for Designing, Deploying, and Managing GPU ... · TESLA ACCELERATED COMPUTING...

GRID LICENSING - Nvidiaprovides an overview of GRID licensing: Figure 1 GRID licensing architecture . GRID Server . GRID License VM Server. NVIDIA Tesla GPU . NVIDIA Tesla GPU GRID

Tesla: Fastest Processor Adoption in HPC History · Tesla GPU Computing Products Tesla S1070 System Tesla C1060 Processor GPUs 4 Tesla GPUs 1 Tesla GPU 1 Tesla GPU Single Precision

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND …on-demand.gputechconf.com/supercomputing/2014/... · 2014-11-19 · TESLA GPU TESLA GPU CPU 5x Faster than PCIe Gen3 x16 PCIe

Announcing the Nvidia Tesla P100 GPU for PCIe Servers

Tesla V100 PCIe Product Brief - images.nvidia.comimages.nvidia.com/content/tesla/pdf/Tesla-V100-PCIe-Product-Brief.pdf · Tesla V100 PCIe GPU Accelerator PB-08744-001_v03 | ii . DOCUMENT

Arquitetura Volta NVIDIA TESLA V100 GPUwiki.icmc.usp.br/images/f/f1/G15_1.pdf · NVIDIA Tesla D870 GPU Lançado em 2 de maio de 2007 Dois NVIDIA Tesla C870 GPU com 128 processadores

NVIDIA Tesla K20X GPU Accelerator

NVIDIA TESLA V100 GPU アクセラレーターimages.nvidia.com/content/technologies/deep-learning/pdf/NVIDIA-Tesla-V100-JPN.pdf · 最先端のデータセンター向けgpu nvidia

NVIDIA Tesla P100images.nvidia.com/content/pdf/tesla/whitepaper/pascal...NVIDIA Tesla P100 WP-08019-001_v01.2 | 5 Tesla P100: Revolutionary Performance and Features for GPU Computing

TESLA P40 GPU ACCELERATOR - Nvidiaimages.nvidia.com/content/pdf/tesla/Tesla-P40-Product... · 2017-09-07 · The Tesla P40 GPU Accelerator is offered as a 250 W passively cooled board

TESLA K40 GPU ACTIVE ACCELERATOR - Nvidia · Tesla K40 GPU Active Accelerator BD-06949-001_v03 | 1 OVERVIEW The NVIDIA® Tesla® K40 graphics processing unit (GPU) active accelerator

Best Practices for Deploying and Managing GPU Clusterson-demand.gputechconf.com/gtc-express/2012/... · Best Practices for Deploying and Managing GPU Clusters Dale Southard, NVIDIA

NVIDIA TESLA GPU ACCELERATORS - OCF NVIDIA K80 Data Sheet.pdfNVIDIA ® TESLA ® GPU ACCELERATORS The world’s fastest accelerators 1 For details on GPU Boost, refer to the GPU Boost

NVIDIA TESLA V100 GPU アクセラレーター - …dl.it-sol.jpn.panasonic.com/data/its/nvidia/tesla_v100.pdfTitle NVIDIA TESLA V100 GPU アクセラレーター Author パナソニック

Tesla GPU Computing - Nvidia · 2015-10-14 · 미루웨어 Tesla GPU Computing Unutu 장점(개발머쉰) 2015.01 캐노니컬은 6개월마다 새로운 배포판을 출시핚다는

TESLA P100 PCIe GPU ACCELERATOR - Nvidiaimages.nvidia.com/content/tesla/pdf/NV-tesla-p100-pcie-PB-08248... · Tesla P100 PCIe GPU Accelerator PB-08248-001_v01 | ii . DOCUMENT CHANGE

GPUチュートリアル：OpenACC篇 - KEK...GTX 680M (Kepler) Tesla K40 (Kepler) Tesla P100 (Pascal) Tesla V100 (Volta) Number of SM 7 15 56 80 FP64 Cores/GPU 448 960 1792 2560 FP32

NVIDIA TESLA V100 GPU ACCELERATOR · 2019-02-21 · nvidia tesla v100 . gpu accelerator. 가장 진보된 데이터 센터 gpu. nvidia tesla v100은 ai, hpc 및 graphic 가속을

TESLA P100 PCIe GPU ACCELERATOR - Nvidia

TESLA P4 GPU ACCELERATOR - Nvidiaimages.nvidia.com/content/pdf/tesla/Tesla-P4-Product-Brief.pdf · Tesla P4 GPU Accelerator PB-08449-001_v01 ... SKU 200: 0x11D8 . ... Information