Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An...
Transcript of Accelerating Deep Learning with In-Network …...© 2019 Mellanox Technologies | Confidential 12 An...
© 2019 Mellanox Technologies | Confidential 1
SC 19Gil Bloch
Accelerating Deep Learning with In-Network Computing
© 2019 Mellanox Technologies | Confidential 2
What is it?
Moore’s Law
© 2019 Mellanox Technologies | Confidential 3
Where is it going?
Moore’s Law
▪ April 2005, Gordon Moore stated in an interview that the projection cannot be sustained indefinitely: "It can't continue forever. ... It no longer centered its research and development plan on Moore's law.
© 2019 Mellanox Technologies | Confidential 4
Moore’s Law
GPU Accelerated Computing
© 2019 Mellanox Technologies | Confidential 5
Exponential Data Growth
© 2019 Mellanox Technologies | Confidential 6
Cloud Big Data
Enterprise
Business Intelligence
HPC
Storage
Security
Machine Learning
Internet of Things
Exponential Data Growth Everywhere
© 2019 Mellanox Technologies | Confidential 7
It is not a wave, it is a Tsunami
Riding The Data Wave
Did you know that 90 % of the world’s data has been created only in last two years?
It has been predicted that by 2020, 40 zettabytes of data will get generatedan increase of 300 times from 2005!
© 2019 Mellanox Technologies | Confidential 8
Big Data? No… REALLY BIG DATA
Average data generated in a self-driving vehicle is expected to reach 40TB for every eight hours of driving (this mostly applies to full service fleet vehicles)
The Pratt & Whitney PW1000G engine has 5,000 sensors installed, generating about 10 GB of data per second. With an average 12-hr. flight-time can produce up to 844 TB of data
Mellanox is the de-facto interconnect for deep learning deployments
© 2019 Mellanox Technologies | Confidential 9
Neural Networks Complexity Growth
2014 2015 2016 2017
DeepSpeech DeepSpeech-2DeepSpeech-3
30X
2012 2013 2014 2015 2016
AlexNet GoogleNetResNet
Inception-V2
350X
Inception-V4
Image Recognition
SpeechRecognition
Complexity = GOPS X Bandwidth
© 2019 Mellanox Technologies | Confidential 10
MoreData
BetterModels
FasterInterconnect
GPUs
CPUs
FPGAs
Storage
Mellanox Unleashes the Power of Artificial IntelligenceEnabling World-Leading Artificial Intelligence Solutions
ASIC
© 2019 Mellanox Technologies | Confidential 11
The Need for Intelligent and Faster Interconnect
CPU-Centric (Onload) Data-Centric (Offload)
Must Wait for the DataCreates Performance Bottlenecks
Faster Data Speeds and In-Network Computing Enable Higher Performance and Scale
GPU
CPU
GPU
CPU
Onload Network In-Network Computing
GPU
CPU
CPU
GPU
GPU
CPU
GPU
CPU
GPU
CPU
CPU
GPU
Analyze Data as it Moves!Higher Performance and Scale
© 2019 Mellanox Technologies | Confidential 12
An Application Example – Pizza Processing
▪ Order Pizza▪ Call (or use Pizza application)
▪ PE 1 – prepare Pizza▪ Tomato sauce, Cheese, Peperoni…
▪ PE 1 – Put in the oven▪ And now we wait…
▪ PE 1 – Pack and send▪ Network (Pizza Delivery)▪ PE2 – Pizza Consumption
CPU-Centric (Onload)
Must Wait for the PizzaCreates Performance Bottlenecks
PE 1 – Pizza GenerationPE 2 – Pizza Consumption
GPU
CPU
GPU
CPU
Onload Network
GPU
CPU
CPU
GPU
© 2019 Mellanox Technologies | Confidential 13
What if…
© 2019 Mellanox Technologies | Confidential 14
Data Centric Architecture to Overcome Latency Bottlenecks
CPU-Centric (Onload) Data-Centric (Offload)
Communications Latencies of 30-40us
Intelligent Interconnect Paves the Road to Exascale Performance
GPU
CPU
GPU
CPU
GPU
CPU
CPU
GPU
GPU
CPU
GPU
CPU
GPU
CPU
CPU
GPU
Communications Latenciesof 3-4us
© 2019 Mellanox Technologies | Confidential 15
In-Network Computing to Enable Data-Centric Data Centers
GPU
CPU
GPU
CPU
GPU
CPU
CPU
GPU
GPUDirect
RDMA
Scalable Hierarchical Aggregation and
Reduction Protocol
NVMeOverFabrics
© 2019 Mellanox Technologies | Confidential 16
The Need for Speed
© 2019 Mellanox Technologies | Confidential 17
Mellanox Accelerates TensorFlow 1.5
100G is a Must For Large Scale Models 6.5X Faster Training
with 100G
2.5X
6.5X
© 2019 Mellanox Technologies | Confidential 18
Remote Direct Memory Access RDMA
© 2019 Mellanox Technologies | Confidential 19
Mellanox Accelerates TensorFlow
Unmatched Linear Scalability at No Additional Cost
50% Better
Performance
© 2019 Mellanox Technologies | Confidential 20
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
© 2019 Mellanox Technologies | Confidential 21
Scalable Hierarchical Aggregation and Reduction Protocol (SHARP)
▪ Reliable Scalable General Purpose Primitive
▪ Applicable to Multiple Use-cases in ML/HPC
▪ Scalable High Performance Collective Offload
DataAggregated
AggregatedResult
Aggregated Result
Data
Host Host Host Host Host
SwitchSwitch
Switch
© 2019 Mellanox Technologies | Confidential 22
SHARP AllReduce Performance Advantages (128 Nodes)
SHARP enables 75% Reduction in LatencyProviding Scalable Flat LatencyScalable Hierarchical
Aggregation and
Reduction Protocol
© 2019 Mellanox Technologies | Confidential 23
SHARP AllReduce Performance Advantages 1500 Nodes, 60K MPI Ranks, Dragonfly+ Topology
SHARP Enables Highest PerformanceScalable Hierarchical Aggregation and
Reduction Protocol
© 2019 Mellanox Technologies | Confidential 24
Performs the Gradient AveragingReplaces all physical parameter serversAccelerate AI Performance
SHARP Accelerates AI Performance
The CPU in a parameter server becomes the bottleneck
© 2019 Mellanox Technologies | Confidential 25
▪ Increase System Performance▪ Better Scalability▪ Reduces amount of data traversing the network
InfiniBand SHARP Advantage for Deep Learning
16%
11%
System Configuration: Intel E5-2650V4, 12 cores @ 2.2GHz, 30M L2 cache, 9.6GT QPI, 256GB RAM: 16 x 16 GB DDR4, NVIDIA P100 GPUs, ConnectX-6 HCA, IB Quantum Switch (EDR speed), RH 7.5, Mellanox OFED 4.4, HPC-X v2.3, TensorFlow v1.11, Horovod 0.15.0
Scalable Performance for Distributed AI
© 2019 Mellanox Technologies | Confidential 26
NCCL SHARP
© 2019 Mellanox Technologies | Confidential 27
NCCL Overview
▪ NCCL : NVIDIA Collective Communication Library
▪ Enables Multi GPU Computing▪ Data Parallel multi GPU training▪ NCCL Allreduce : Aggregate gradients across GPUS
▪ DL Frameworks (Tensorflow/Horovod, PyTorch, MXNet, Chainer, …)
▪ NCCL 1.0▪ Single node Ring
▪ NCCL 2.0▪ Ring across multiple nodes▪ RDMA
▪ NCCL 2.4▪ Hierarchical tree algorithm
© 2019 Mellanox Technologies | Confidential 28
NCCL SHARPNetwork Fabric
NIC NIC NIC
© 2019 Mellanox Technologies | Confidential 29
Thank You