Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An...

© 2019 Mellanox Technologies | Confidential 1

SC 19Gil Bloch

Accelerating Deep Learning with In-Network Computing


What is it?

Moore’s Law


Where is it going?

Moore’s Law

▪ April 2005, Gordon Moore stated in an interview that the projection cannot be sustained indefinitely: "It can't continue forever. ... It no longer centered its research and development plan on Moore's law.


Moore’s Law

GPU Accelerated Computing


Exponential Data Growth


Cloud Big Data

Enterprise

Business Intelligence

HPC

Storage

Security

Machine Learning

Internet of Things

Exponential Data Growth Everywhere


It is not a wave, it is a Tsunami

Riding The Data Wave

Did you know that 90 % of the world’s data has been created only in last two years?

It has been predicted that by 2020, 40 zettabytes of data will get generatedan increase of 300 times from 2005!


Big Data? No… REALLY BIG DATA

Average data generated in a self-driving vehicle is expected to reach 40TB for every eight hours of driving (this mostly applies to full service fleet vehicles)

The Pratt & Whitney PW1000G engine has 5,000 sensors installed, generating about 10 GB of data per second. With an average 12-hr. flight-time can produce up to 844 TB of data

Mellanox is the de-facto interconnect for deep learning deployments


Neural Networks Complexity Growth

2014 2015 2016 2017

DeepSpeech DeepSpeech-2DeepSpeech-3

30X

2012 2013 2014 2015 2016

AlexNet GoogleNetResNet

Inception-V2

350X

Inception-V4

Image Recognition

SpeechRecognition

Complexity = GOPS X Bandwidth


MoreData

BetterModels

FasterInterconnect

GPUs

CPUs

FPGAs

Storage

Mellanox Unleashes the Power of Artificial IntelligenceEnabling World-Leading Artificial Intelligence Solutions

ASIC


The Need for Intelligent and Faster Interconnect

CPU-Centric (Onload) Data-Centric (Offload)

Must Wait for the DataCreates Performance Bottlenecks

Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale

GPU

CPU

GPU

CPU

Onload Network In-Network Computing

GPU

CPU

CPU

GPU

GPU

CPU

GPU

CPU

GPU

CPU

CPU

GPU

Analyze Data as it Moves!Higher Performance and Scale


An Application Example – Pizza Processing

▪ Order Pizza▪ Call (or use Pizza application)

▪ PE 1 – prepare Pizza▪ Tomato sauce, Cheese, Peperoni…

▪ PE 1 – Put in the oven▪ And now we wait…

▪ PE 1 – Pack and send▪ Network (Pizza Delivery)▪ PE2 – Pizza Consumption

CPU-Centric (Onload)

Must Wait for the PizzaCreates Performance Bottlenecks

PE 1 – Pizza GenerationPE 2 – Pizza Consumption

GPU

CPU

GPU

CPU

Onload Network

GPU

CPU

CPU

GPU


What if…


Data Centric Architecture to Overcome Latency Bottlenecks

CPU-Centric (Onload) Data-Centric (Offload)

Communications Latencies of 30-40us

Intelligent Interconnect Paves the Road to Exascale Performance

GPU

CPU

GPU

CPU

GPU

CPU

CPU

GPU

GPU

CPU

GPU

CPU

GPU

CPU

CPU

GPU

Communications Latenciesof 3-4us


In-Network Computing to Enable Data-Centric Data Centers

GPU

CPU

GPU

CPU

GPU

CPU

CPU

GPU

GPUDirect

RDMA

Scalable Hierarchical Aggregation and

Reduction Protocol

NVMeOverFabrics


The Need for Speed


Mellanox Accelerates TensorFlow 1.5

100G is a Must For Large Scale Models 6.5X Faster Training

with 100G

2.5X

6.5X


Remote Direct Memory Access RDMA


Mellanox Accelerates TensorFlow

Unmatched Linear Scalability at No Additional Cost

50% Better

Performance


Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)


Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)

▪ Reliable Scalable General Purpose Primitive

▪ Applicable to Multiple Use-cases in ML/HPC

▪ Scalable High Performance Collective Offload

DataAggregated

AggregatedResult

Aggregated Result

Data

Host Host Host Host Host

SwitchSwitch

Switch


SHARP AllReduce Performance Advantages (128 Nodes)

SHARP enables 75% Reduction in LatencyProviding Scalable Flat LatencyScalable Hierarchical

Aggregation and

Reduction Protocol


SHARP AllReduce Performance Advantages 1500 Nodes, 60K MPI Ranks, Dragonfly+ Topology

SHARP Enables Highest PerformanceScalable Hierarchical Aggregation and

Reduction Protocol


Performs the Gradient AveragingReplaces all physical parameter serversAccelerate AI Performance

SHARP Accelerates AI Performance

The CPU in a parameter server becomes the bottleneck


▪ Increase System Performance▪ Better Scalability▪ Reduces amount of data traversing the network

InfiniBand SHARP Advantage for Deep Learning

16%

11%

System Configuration: Intel E5-2650V4, 12 cores @ 2.2GHz, 30M L2 cache, 9.6GT QPI, 256GB RAM: 16 x 16 GB DDR4, NVIDIA P100 GPUs, ConnectX-6 HCA, IB Quantum Switch (EDR speed), RH 7.5, Mellanox OFED 4.4, HPC-X v2.3, TensorFlow v1.11, Horovod 0.15.0

Scalable Performance for Distributed AI


NCCL SHARP


NCCL Overview

▪ NCCL : NVIDIA Collective Communication Library

▪ Enables Multi GPU Computing▪ Data Parallel multi GPU training▪ NCCL Allreduce : Aggregate gradients across GPUS

▪ DL Frameworks (Tensorflow/Horovod, PyTorch, MXNet, Chainer, …)

▪ NCCL 1.0▪ Single node Ring

▪ NCCL 2.0▪ Ring across multiple nodes▪ RDMA

▪ NCCL 2.4▪ Hierarchical tree algorithm


NCCL SHARPNetwork Fabric

NIC NIC NIC


Thank You

Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An...

Documents

Transcript of Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An...