NV-Group: Link-Efficient Reductions for Distributed Deep ... · Network Based Computing Laboratory...

NV-Group: Link-Efficient Reductions for Distributed Deep Learning on Modern Dense

GPU SystemsChing-Hsiang Chu, Pouya Kousha, Ammar A. Awan, Kawthar Shafie Khorassani,

Hari Subramoni, and Dhabaleswar K. (DK) Panda{chu.368, kousha.2, awan.10, shafiekhorassani.1}@osu.edu

{subramon, panda}@cse.ohio-state.edu

Network-based Computing LaboratoryThe Ohio State University

Network Based Computing Laboratory ICS 2020

• Introduction– Trend in Modern HPC systems

– All-reduce for Distributed Deep Learning on Dense-GPU systems

• Research Challenge

• Proposed Designs: NV-Group Allreduce

• Performance Evaluation

• Concluding Remarks

Outline

2


Trends in Modern HPC Architecture: Heterogeneous

Acceleratorshigh compute density,high performance/watt

High Performance Interconnects InfiniBand, Omni-Path, EFA

<1usec latency, 200Gbps+ Bandwidth

Multi/ Many-core Processors

SSD, NVMe-SSD, NVRAM

Node local storage

#2 Summit (27,648 GPUs)#3 Sierra (17,280 GPUs)#14 Lassen (2,664 GPUs)

#7 SeleneNVIDIA DGX SuperPOD

(2,240 GPUs)

#1 Fugaku(158,976 nodes with A64FX ARM

CPU, a GPU-like processor)

#6 HPC5(7,280 GPUs)

https://www.top500.org/

https://www.top500.org/


• Scale-up (up to 150 GB/s)

– PCIe, NVLink/NVSwitch

– Infinity Fabric, Xe Link

• Scale-out (up to 25 GB/s)

– InfiniBand, Omni-path, Ethernet

– Cray Slingshot

Trends in Modern Large-scale Dense-GPU Systems

NVIDIA DGX machine IBM Power System AC922 on #1 Summit system4


• Easy-to-use and high-performance frameworks

• Wide range of applications– Image Classification– Speech Recognition– Self-driving Car– Healthcare– Climate Analytic

GPU-enabled Distributed Deep Learning

Kurth T, Treichler S, Romero J, Mudigonda M, Luehr N, Phillips E, Mahesh A, Matheson M, Deslippe J, Fatica M, Houston M. Exascale deep learning for climate analytics. SC 2018 Nov 11 (p. 51). (Golden Bell Prize)

999 PetaFlop/s sustained, and 1.13 ExaFlop/s peak FP 16 performance over 4560 nodes (27,360 GPUs)

5


• Distributed deep learning training with data parallelism

– Using Allreduce operations to exchange and update gradients, weights…etc.

• State-of-the-art Ring-based Allreduce for GPUs*

– Pros: Contention-free

– Cons: not scalable

Reduction Operations for Distributed Deep Learning

https://www.oreilly.com/ideas/distributed-tensorflowBen-Nun T, Hoefler T. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. arXiv preprint arXiv:1802.09941. 2018 Feb 26.

6*Please refer to the paper for the analysis of more algorithms

https://www.oreilly.com/ideas/distributed-tensorflow


• Ring-based Allreduce cannot efficiently utilize NVLinks

Motivation

0

5

10

15

G1<->CPU

G2<->CPU

G3<->CPU

G4<->CPU

G5<->CPU

G6<->CPU

G1<->G2

G1<->G3

G2<->G3

G4<->G5

G4<->G6

G5<->G6

Thro

ughp

ut (G

B/s)

NVLink Pairs

SpectrumMPI-10.3 OpenMPI-4.0.3 MV2-GDR-2.3 NCCL-2.6

7* Profiling tool: P. Kousha et al., Designing a Profiling and Visualization Tool for Scalable and In-Depth Analysis of High-Performance GPU Clusters, HiPC 2019.


• Introduction


• Proposed Designs: NV-Group



Outline

8


How to design a link-efficient Allreduce algorithm that can maximize the utilization of available

hardware communication channels to boost the performance for distributed DL training on

emerging dense GPU systems?

Broad Research Challenge

9


• Introduction


• Proposed Designs: NV-Group Allreduce



Outline

10


1. Forming NV-Groups– Treat multiple GPUs as one

2. Cooperative reduction kernel within NV-Group– Persistent GPU kernels

– Exploit load-store primitives over NVLinks

– High-occupancy kernel

3. Communication across NV-Groups– Contention-free over slowest IB networks

Overview of the Proposed NVGroup Allreduce

11


• Topology detection and GPU grouping– Discover which GPUs are fully connected by NVLink; using tools

such as hwloc[1] and NVML[2]

– Create logical GPU groups, e.g., MPI Group or Communicator

12

Forming NV-Groups

[1] https://www.open-mpi.org/projects/hwloc/[2] NVIDIA Management Library (NVML), https://developer.nvidia.com/nvidia-management-library-nvml


• CPU creates work queue for each Cooperative Thread Array (CTA or block)

• Persistent GPU Kernel1) Poll the individual work queue

2) Reduce the data chunks• Reduce-scatter among GPUs• Direct Load-Store over NVLink

3) Signal CPU upon completion• Implicit synchronization[1]

13

Cooperative Reduction Kernel within NV-Group

[1 ] Ching-Hsiang Chu et al. "Designing High-Performance In-Memory Key-Value Operations with Persistent GPU Kernels and OpenSHMEM, " OpenSHMEM 2018.

GPU 0

CTA 0 CTA 1

GPU 1

CTA 0 CTA 1

Chunk 0

Shared System Memory

Chunk 1

Chunk 0-0CPU GPU0

Chunk 1-0CPU GPU1

Work queue

Chunk 0-1CPU GPU0

Chunk 1-1CPU GPU1

Chunk 0-0GPU1 GPU0

Chunk 1-0GPU1 GPU0

Chunk 0-1GPU0 GPU1

Chunk 1-1GPU0 GPU1


• High-Occupancy kernel with low register pressure*

– CPU coordinates the topology and communication paths

– Enable all threads for reduction operations

• Free resources for applications– Low SM consumption

Ø Low scheduling overhead

– Enable overlap opportunity

14

Cooperative Reduction Kernel - Efficiency

02040

6080

1 2 4 8 16Th

roug

hput

(GB/

s)Number of CTAs

Efficiency of Reduction Kernel (higher the better)

NVGroup (1024 threads) NCCL-2.6 (256 threads)

Links have been saturated

* https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#execution-configuration-optimizations


• CPU processes– Inter-group communication

– Ring-based Reduce-Scatter + Allgather over IB or X-BUS

– Offload reduction to NV-Group

• GPUs (NV-Group):– Processing operations

requested by CPU• Direct Reduce-scatter or

Allgather over NVLink

15

Group-Wise Communication – CPU-GPU Cooperation

GPU 0

GPU 1

GPU 2

Group-1 Group-2 Group-3 Group-4

* Please check out the paper for more

optimization techniques.


• Introduction


• Proposed Designs: NV-Group



Outline

16

Network Based Computing Laboratory ICS 2020 17

Experimental Environments#1 Summit #10 Lassen* NVIDIA DGX-2

CPU Model IBM POWER9 AC922 Intel Skylake

System memory 512 GB 256 GB 1.5 TB

GPU Model NVIDIA Volta V100 x 6 NVIDIA Volta V100 x 4 NVIDIA Volta V100 x 16

Interconnects between CPU & GPU2-lane NVLink 3-lane NVLink

PCIe Gen3

Interconnects between GPUs 6-lane NVLink & NVSwitch

Interconnects between nodes Dual-rail Mellanox IB EDR Mellanox IB EDR x 8 (Unused)

NVIDIA driver & CUDA versions 418.116 & 10.1.243 410.48 & 10.1.243

• Libraries: SpectrumMPI v10.3.1, OpenMPI v4.0.3+UCX v1.8.0, MVAPICH2-GDR v2.3, NCCL v2.6

• Benchmarks: OSU Micro-Benchmark (OMB) & modified nccl-test

• Applications: Horovod v0.19 with TensorFlow v1.14 & PyTorch v1.5

*Please refer to the paper for the thorough performance comparison


Overview of the MVAPICH2 Project• High Performance open-source MPI Library

• Support for multiple interconnects– InfiniBand, Omni-Path, Ethernet/iWARP, RDMA over Converged Ethernet (RoCE), and AWS

EFA

• Support for multiple platforms– x86, OpenPOWER, ARM, Xeon-Phi, GPGPUs

• Started in 2001, first open-source version demonstrated at SC ‘02

• Supports the latest MPI-3.1 standard

• http://mvapich.cse.ohio-state.edu

• Additional optimized versions for different systems/environments:– MVAPICH2-X (Advanced MPI + PGAS), since 2011

– MVAPICH2-GDR with support for NVIDIA GPGPUs, since 2014

– MVAPICH2-MIC with support for Intel Xeon-Phi, since 2014

– MVAPICH2-Virt with virtualization support, since 2015

– MVAPICH2-EA with support for Energy-Awareness, since 2015

– MVAPICH2-Azure for Azure HPC IB instances, since 2019

– MVAPICH2-X-AWS for AWS HPC+EFA instances, since 2019

• Tools:– OSU MPI Micro-Benchmarks (OMB), since 2003

– OSU InfiniBand Network Analysis and Monitoring (INAM), since 2015

• Used by more than 3,100 organizations in 89 countries

• More than 772,000 (> 0.7 million) downloads from the OSU site directly

• Empowering many TOP500 clusters (June 2020 ranking)– 4th, 10,649,600-core (Sunway TaihuLight) at NSC, Wuxi, China

– 8th, 448, 448 cores (Frontera) at TACC

– 12th, 391,680 cores (ABCI) in Japan

– 18th, 570,020 cores (Nurion) in South Korea and many others

• Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, OpenHPC, and Spack)

• Partner in the 8th ranked TACC Frontera system

• Empowering Top500 systems for more than 15 years

18

http://mvapich.cse.ohio-state.edu/


• NVGroup design utilizes all NVLinks19

Evaluation of Link Utilization on Summit

05

1015202530

G1<->CPU

G2<->CPU

G3<->CPU

G4<->CPU

G5<->CPU

G6<->CPU

G1<->G2

G1<->G3

G2<->G3

G4<->G5

G4<->G6

G5<->G6

Thro

ughp

ut (G

B/s)

NVLink Pairs

Throughput of Allreduce operation on IBM AC922 machine SpectrumMPI-10.3 OpenMPI-4.0.3 MV2-GDR-2.3 NCCL-2.6 Proposed


Allreduce Benchmarking on Summit

0

1

2

3

4

5

6

32M 64M 128M 256M

Band

wid

th (G

B/s)

Message Size (Bytes)

Bandwidth on 1,536 GPUs

NVGroup NCCL 2.6

1.7X better

0

100

200

300

400

500

4 16 64 256 1K 4K 16K

Late

ncy

(us)

Message Size (Bytes)

Latency on 1,536 GPUs

NVGroup NCCL 2.6

1.6X better

0

2

4

6

8

10

24 48 96 192 384 768 1536

Band

wid

th (G

B/s)

Number of GPUs

128MB Message

SpectrumMPI 10.3.1 OpenMPI 4.0.3

NCCL 2.6 NVGroup

1.7X better

• NVGroup yields 40% and 30% better latency and bandwidth than NCCL

• NVGroup scale significantly better than production libraries on Summit system

20*Please refer to the paper for the thorough comparison


Distributed Deep Learning Training on DGX-2 Machine• ResNet-50 Training using Horovod with TensorFlow

– Synthetic ImageNet dataset with batch size 64 and 100 batches

• NVGroup achieves higher throughput due to high-occupancy reduction kernel

0

1000

2000

3000

4000

5000

6000

7000

1 2 4 8 16

Imag

es p

er se

cond

Number of GPUs

NCCL-2.6 Proposed Ideal

9% higher

0102030405060708090

100

1 2 4 8 16

Scal

ing

Effic

ienc

y (%

)Number of GPUs

NCCL-2.6 Proposed

21

93%


020000400006000080000

100000120000140000160000

6 12 24 48 96 192 384

Imag

es p

er se

cond

Number of GPUs

TensorFlow v1.14

SpectrumMPI MV2-GDR-2.3 NCCL-2.6 Proposed

Distributed Deep Learning Training on Summit• ResNet-50 Training using Horovod with TensorFlow and PyTorch

– Synthetic ImageNet dataset with batch size 64 and 100 batches

5% higher

0

20000

40000

60000

80000

100000

120000

6 12 24 48 96 192 384

Imag

es p

er se

cond

Number of GPUs

PyTorch v1.5

SpectrumMPI MV2-GDR-2.3 NCCL-2.6 Proposed

22

1.48X higher


• Introduction


• Proposed Designs



Outline

23


• Allreduce operations dominate the performance of distributed deep learning training with data parallelism

• State-of-the-art ring-based Allreduce failed to efficiently utilize interconnects on the modern Dense GPU systems

• Proposed NVGroup Allreduce can– Maximize the NVLink utilization for the dense-GPU systems

– Perform cooperative reduction on GPUs

– Achieve faster distributed DL training on dense-GPU systems such as #1 Summit

• Publicly available since MVAPICH2-GDR 2.3.2 release!– http://mvapich.cse.ohio-state.edu/

Concluding Remarks

24

http://mvapich.cse.ohio-state.edu/


Thank You!

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/

{chu.368, kousha.2, awan.10, shafiekhorassani.1}@osu.edu{subramon, panda}@cse.ohio-state.edu

The High-Performance MPI/PGAS Projecthttp://mvapich.cse.ohio-state.edu/

The High-Performance Deep Learning Projecthttp://hidl.cse.ohio-state.edu/

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/

25

http://nowlab.cse.ohio-state.edu/

NV-Group: Link-Efficient Reductions for Distributed Deep ... · Network Based Computing Laboratory...

Documents

Transcript of NV-Group: Link-Efficient Reductions for Distributed Deep ... · Network Based Computing Laboratory...