Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication...

19
IBM Research AI Efficient Communication Library for Large-Scale Deep Learning Mar 26, 2018 Minsik Cho ([email protected])

Transcript of Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication...

Page 1: Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication library for Distributed Deep Learning –MPI-like interface for easy-integration –Enables

IBM Research AI

Efficient Communication Library for Large-Scale Deep Learning

Mar 26, 2018

Minsik Cho ([email protected])

Page 2: Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication library for Distributed Deep Learning –MPI-like interface for easy-integration –Enables

2

Deep Learning changing Our Life

Automotive/transportation

Security/public safety Medicine and Biology

Media and Entertainment Consumer Web, Mobile, Retail

Page 3: Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication library for Distributed Deep Learning –MPI-like interface for easy-integration –Enables

3

Deep Learning WorkflowIBM

3

Training

Training Data(grouped in large

minibatches)

Trained

Model

Next Minibatch

Next Epoch

Forward

Backward

Erro

r

InferenceForward Action

Application-

dependent

Latency to model: Typically days to train complex modelsLimited by training compute throughput

Latency to action: Typically ms to complete full inference workflowLimited by latency of batching (to enable efficient inference) + inference compute+ resultant action

IndividualE.g., from microservices

Input Data

Inference

Model

Conversion/RetrainingNeeded if training/inference precisions differ

BatchingSmaller, varied batch size:

Application-dependent

This is my focus

Page 4: Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication library for Distributed Deep Learning –MPI-like interface for easy-integration –Enables

4

Advance in Computation for Deep Learning

• 10-100 TFLOPS

• Very good scaling for last 15 years

[P. Goldsborough]

[MichaelGalloy.com]

/FPGA

Page 5: Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication library for Distributed Deep Learning –MPI-like interface for easy-integration –Enables

5

Motivation: Ok, ever-fast computation. Is this enough?

• ImageNet1K : 1.2M images, 1K classes, Resnet101

– Batch-size = 32 (limited by GPU memory)

– Iteration time = 300ms

– #iterations per epoch = 38000

– Total training time for 100 epoch = 13.2 days

• ImageNet22K : 7.5M images, 22K classes , Resnet101

– Total training time for 100 epoch = 35.2 days

• No, it is NOT

– 1.2M samples are still at toy scale

– Computation scaling is not fast enough

• the data explosion/model complexity

• Innovation will take too long, or even stop at some point

– I cannot wait for days to get my model trained!

Page 6: Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication library for Distributed Deep Learning –MPI-like interface for easy-integration –Enables

6

Faster Training Time with Distributed Deep Learning

9DaysRecognition

Recognition

54x

Learning

runs with

Power 8

What will you do?

Iterate more and create more accurate models?

Create more models?

Both?

4 H

ours

4 H

ours

4 H

ours

4 H

ours

4 H

ours

4 H

ours

4 H

ours

4 H

ours

4 H

ours

4 H

ours

4 H

ou

rs

4 H

ou

rs

Page 7: Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication library for Distributed Deep Learning –MPI-like interface for easy-integration –Enables

7

Distributed Deep Learning

[P. Goldsborough]

Data parallelism : Parm-ServerModel parallelism

(complex partitioning)

Gradient/weight (10MB-1GB)

Anything

Data parallelism : Allreduce

Page 8: Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication library for Distributed Deep Learning –MPI-like interface for easy-integration –Enables

8

Communication : Overhead

32 images 32 images

syncsync

32 images 32 images 32 images 32 images

32 images 32 images 32 images 32 images

• In weak-scaling

– Computation cost remains constant

– Communication cost increases with more learners/GPUs

• Computation /Communication is the key for large-scale deep learning

– Increase Computation

– Faster Communication

Page 9: Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication library for Distributed Deep Learning –MPI-like interface for easy-integration –Enables

9

Advance in Communication for Deep Learning

• Still scaling, but not fast enough

– Computation is still ahead

– Data perhaps grows much faster

Page 10: Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication library for Distributed Deep Learning –MPI-like interface for easy-integration –Enables

10

• Model/Application• Deeper/wider model to increase compute time• Smaller gradient count to reduce communication time

• System• Balance network and computing resources• Select mini-batch size to adjust the ratio

• Larger mini-batch size to lower the ratio

• Too big mini-batch size can hurt convergence and accuracy

• Network-topology aware communication

Computation

Model depth

GPU throughput

Faster algorithm

Mini-batch size

Communication

Gradient count

Network BW

Faster algorithmGood balance

Designing Large-scale Deep Learning

Page 11: Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication library for Distributed Deep Learning –MPI-like interface for easy-integration –Enables

11

IBM PowerAI DDL (Distributed Deep Learning Library)

• Collaborative communication library for Distributed Deep Learning

– MPI-like interface for easy-integration

– Enables deep learning software to scale to 100s severs with CPU/GPUs

– Works across variety of system sizes

– Works with variety of network types, switch topologies

• DDL orchestrates the data communication

– Plan efficient communication pattern on a hierarchal network environment

– Actual point-point data transfer via NCCL or MPI

• Currently integrated into

– Supported: Caffe, Tensorflow, Chainer, Torch

– Ongoing : Caffe2, PyTorch, Keras (TF-backend)

• Currently US patent-pending

Page 12: Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication library for Distributed Deep Learning –MPI-like interface for easy-integration –Enables

12

DDL : Topology-aware communication

switch0 switch1

Max bandwidth: 10 Gbytes/sec

Max sustained bandwidth: 100 Gbytes/sec

A B C D

• Example

– A, B, C, D broadcast to all others

– Suffers from congestion

– Suffers from lower BW

A->B B->C C->D

C->DB->C D->A

C->D D->A A->B

A->BD->A B->C

Page 13: Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication library for Distributed Deep Learning –MPI-like interface for easy-integration –Enables

13

DDL : Topology-aware communication

switch0 switch1

box0 box1 box2 box3

A B C D

• It’s a mapping problem

– System-specific network

– Application-specific traffic

• DDL does differently

– To minimize bus contention

– To minimize crossing lower BW

A->B B->C C->D

C->DB->C D->A

C->D D->A A->B

A->BD->A B->C

B->C

D->A

A->D

C->B

A->B

C->D

B->A

D->C

C->D

A->B

D->C

B->A

suboptimal

Optimal

Page 14: Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication library for Distributed Deep Learning –MPI-like interface for easy-integration –Enables

14

DDL : Problem Definition and Solution

• Assumption

– network topology with various bandwidths

• Problem Definition

– min-cost multi-commodity flow problem

– NP-hard problem but can be solved easily if graph size is small (ie 4 vertices)

• DDL solves a typical case/topology offline

– if the cluster/cloud has provide such topology, it performs very well

Page 15: Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication library for Distributed Deep Learning –MPI-like interface for easy-integration –Enables

15

DDL : How well it performs on Caffe2

• 48 IBM S822LC with PPC64LE RHEL

– 3 racks and 16 hosts on each, connected though 10GB/s IB

– Each host has 4 P100-SXM2 with CUDA8, CUDNN5

• Comparing algorithms on Resnet50 + Imagenet1K (preloaded to RAMDisk) mbs=32

– MPI_Allreduce

– Ring (all-reduce from Baidu in Feb 2017)

– GLOO (from Facebook) : NCCL+ib_verb

DDLDDL

Page 16: Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication library for Distributed Deep Learning –MPI-like interface for easy-integration –Enables

16

Comparison with NCCL 2.1.x Allreduce (POWER)

• IBM P9 Newell Systems (NVLink) with V100s

• 100Gbps InfiniBand

Exploiting in-system topology Exploiting in/cross-system topology

Page 17: Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication library for Distributed Deep Learning –MPI-like interface for easy-integration –Enables

17

Comparison with NCCL 2.1.x Allreduce (X86)

• X86 Systems (PCIe) with P100s

• 10Gbps Ethernet

NO in-system topology Exploiting cross-system topology

Page 18: Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication library for Distributed Deep Learning –MPI-like interface for easy-integration –Enables

18

Conclusion

• DDL is a topology-aware communication library in PowerAI

• DDL delivers the industry-best performance with

– Network hierarchy

– Multi-tier bandwidth

• DDL is suitable for common distributed training on cloud environment

Page 19: Efficient Communication Library for Large-Scale Deep Learning...•Collaborative communication library for Distributed Deep Learning –MPI-like interface for easy-integration –Enables

19

BACKUP