Multi GPU training Part 3: Engineering Challenges of Multi...
Transcript of Multi GPU training Part 3: Engineering Challenges of Multi...
![Page 1: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/1.jpg)
RAPIDSGPU POWERED MACHINE LEARNING
![Page 2: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/2.jpg)
1980 1990 2000 2010 2020
GPU-Computing perf
1.5X per year
1000X
by
2025
RISE OF GPU COMPUTING
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.
Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
102
103
104
105
106
107
Single-threaded perf
1.5X per year
1.1X per year
APPLICATIONS
SYSTEMS
ALGORITHMS
CUDA
ARCHITECTURE
![Page 3: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/3.jpg)
EXTENDING DL → BIG DATA ANALYTICSFrom Business Intelligence to Data Science
Deep
Learning
Traditional Machine Learning
(regressions, decision trees, graph)Analytics
DATA SCIENCE
ARTIFICIAL INTELLIGENCE
DENSE DATA TABULAR/SPARSE DATADENSE DATA TYPES
(images, video, voice)
![Page 4: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/4.jpg)
USE CASES IN EVERY INDUSTRY
CONSUMER INTERNET
— Personalized recommendations to drive viewership
— Optimized ad targeting
— Preventing churn by identifying factors that influence loyalty
RETAIL
— Inventory forecasting
— Personalized recommendations
— Optimized pricing and promotions
— Preventing credit card fraud and cyber attacks
FINANCIAL SERVICES
— Personalized guidance on financial products
— Return optimization based on market signals
— Fraud detection
HEALTHCARE
— Better disease prediction with genomic medicine
— Improved health outcomes with analysis of EMRs
— Predictive care/treatment
![Page 5: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/5.jpg)
TODAY’S DATA SCIENCE STIFLES INNOVATION
All
DataETL
Manage Data
Structured
Data Store
Data Preparation
Training
Model Training
Visualization
Evaluate
Inference
Deploy
Slow Training Times for Data Scientists
Ie: HURRY UP AND WAIT
![Page 6: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/6.jpg)
![Page 7: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/7.jpg)
DATA SCIENCE CHALLENGES
SLOW TRAINING
Hours to build GBDT
30+
SLOW DATA PROCESSING
DaysData Transformation
WeeksFeature Engineering
MonthsScoring Pipelines
ESCALATING TCO
More servers and infrastructure yielding diminishing performance returns
![Page 8: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/8.jpg)
XGBOOST
XGBoost is an implementation of gradient
boosted decision trees designed for speed
and performance.
Definition
It is a powerful tool for
solving classification and
regression problems in a
supervised learning setting.
![Page 9: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/9.jpg)
Source: https://goo.gl/eTxVtA
Example of Decision Tree
PREDICT: WHO ENJOYS COMPUTER GAMES
![Page 10: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/10.jpg)
Source: https://goo.gl/eTxVtA
Example of Using Ensembled Decision Trees
COMBINE TREES FOR STRONGER PREDICTIONS
![Page 11: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/11.jpg)
RAPIDS OVERVIEW
![Page 12: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/12.jpg)
RAPIDS
GPU Accelerated Data Science
RAPIDS is a set of open source libraries
for GPU accelerating data preparation
and machine learning.
OSS website: http://www.rapids.ai/
![Page 13: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/13.jpg)
RE-IMAGINING DATA SCIENCE WORKFLOWOpen Source, End-to-end GPU-accelerated Workflow Built On CUDA
Data preparation /
wrangling
cuDF
Optimized ML model
training
cuML Visualization
Data visualization
libraries
data insights
![Page 14: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/14.jpg)
RAPIDS LIBRARIES
GPU accelerated software for doing data manipulation and data preparation.
Accelerates loading, filtering, and manipulation of data for model training data preparation.
Python drop-in Pandas replacement built on CUDA C++
cuDF
GPU accelerated traditional machine learning libraries.
XGBoost, Kalman, K-means, KNN, DBScan, PCA, TSVD and more.
cuML
Collection of graph analytics libraries. Coming soon.
cuGRAPH
![Page 15: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/15.jpg)
RAPIDS — OPEN GPU DATA SCIENCESoftware Stack Python
Data Preparation
cuDFVisualization
cuGRAPHModel Training
cuML
CUDA
PYTHON
APACHE ARROW on GPU Memory
DASK
DEEP LEARNING
FRAMEWORKS
CUDNN
RAPIDS
CUMLCUDF CUGRAPH
![Page 16: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/16.jpg)
THE RAPIDS VALUE PROPOSITIONHigh Performance, Easy-to-use
Data Scientist Data Science Leader
Reduced Training TimeDrastically improve your productivity with near-interactive data science
Hassle-Free IntegrationAccelerate your Python data science toolchain with minimal code changes and no new tools to learn
Open SourceCustomizable, extensible, interoperable — the open-source software is supported by NVIDIA and built on Apache Arrow
Top Model AccuracyIncrease machine learning model accuracy by iterating on models faster and deploying them more frequently
TCO ReductionDecrease the server costs, footprint, power consumption of your ML workloads reducing the TCO
Increased Data Scientist ProductivityReduce training time, allow data scientists to be more productive
![Page 17: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/17.jpg)
RAPIDS DEPLOYMENT STACK
TARGET INDUSTRIES
Retail Finance CICN Healthcare
TARGET AUDIENCE AND RECOMMENDED SYSTEMS
Individual Data Scientist Shared Infrastructure For Data Scientists
Quadro GV100 WS2 GV100, NVLink
DGX Station4 V100, NVLink
CloudV100 Cloud Instances
V100 Servers4-8 V100, NVLink, HGX-1,
HGX-2
DGX-18 V100, NVLink
DGX-216 V100, NVLink
CloudV100 Cloud Instances
![Page 18: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/18.jpg)
PILLARS OF RAPIDS PERFORMANCE
CUDA Architecture NVLink/NVSwitch Memory Architecture
Massively parallel processing
NVSWITCH
6x NVLINK
High speed connecting between GPUs for distribute algorithms
Large virtual GPU memory, high-speed memory
Iu-Bump
DRAM Core Die
DRAM Core Die
DRAM Core Die
DRAM Core Die
Base Die
TSV
![Page 19: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/19.jpg)
DESIGNED TO DO THE PREVIOUSLY IMPOSSIBLE
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
19
30 TB NVME SSDs Internal Storage
NVIDIA Tesla V100 32 GB Tensor Core GPUs
Two GPU Boards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card
Twelve NVSwitches2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25/100 Gb/secEthernet
![Page 20: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/20.jpg)
20
NVSWITCH: THE REVOLUTIONARY AI NETWORK FABRIC• Inspired by leading edge research
that demands unrestricted model parallelism
• Like the evolution from dial-up to broadband, NVSwitch delivers a networking fabric for the future, today
• Delivering 2.4 TB/s bisection bandwidth, equivalent to a PCIe bus with 1,200 lanes
• NVSwitches on DGX-2 = all of Netflix HD <45s
![Page 21: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/21.jpg)
TRADITIONALHPCCLUSTER
300 Servers
$3M
180 kW
![Page 22: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/22.jpg)
GPU-ACCELERATEDHPC + AI CLUSTER
1 DGX-2
10 kW
1/8 the Cost
1/15 the Space
1/18 the Power
![Page 23: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/23.jpg)
FASTER INSIGHTS FOR MACHINE LEARNINGDGX-2 544X Speedup Compared to CPU-Only Server Nodes
0 500 1,000 1,500 2,000 2,500 3,000 3,500
1 CPU instance
20 CPU instances
30 CPU instances
50 CPU instances
100 CPU instances
HGX-2
Process Time (min)
cuIO/ cuDF (Load and Data prep) Data Conversion XGBoost
GPU Measurements Completed on DGX-2 running RAPIDSCPU: 20 CPU cluster- comparison is prorated to 1 CPU (61 GB of memory, 8 vCPUs, 64-bit platform), Apache Spark
US Mortgage Data Fannie Mae and Freddie Mac 2006-2017 | 146M mortgagesBenchmark 200GB CSV dataset | Data preparation includes joins, variable transformations
544X speedup
![Page 24: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/24.jpg)
FASTER SPEEDS, REAL WORLD BENEFITS
2,290
1,956
1,999
1,948
169
157
0 500 1,000 1,500 2,000 2,500
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
0 2,000 4,000 6,000 8,000 10,000
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
cuML — XGBoost
2,741
1,675
715
379
42
19
0 1,000 2,000 3,000
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
End-to-EndcuIO/cuDF —Load and Data Preparation
Benchmark
200GB CSV dataset; Data preparation includes joins, variable transformations.
CPU Cluster Configuration
CPU nodes (61 GiB of memory, 8 vCPUs, 64-bit platform), Apache Spark
DGX Cluster Configuration
5x DGX-1 on InfiniBand network
Time in seconds — Shorter is better
cuIO / cuDF (Load and Data Preparation) Data Conversion XGBoost
![Page 25: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/25.jpg)
DGX POD FOR RAPIDS
RAPIDS.AI - Open GPU Data Science
![Page 26: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/26.jpg)
![Page 27: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/27.jpg)
CPU vs GPUPORTING EXISTING CODE
PCA
Training and query results:
• CPU: ~5 minutes
• GPU: ~7 seconds
Principal Component Analysis (PCA)
…Now!Before…
![Page 28: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/28.jpg)
Cloud
HOW? DOWNLOAD AND DEPLOY
On-premises
Source code, libraries, packages
Source available on GitHub | Container available on NGC and Docker Hub | Conda and PIP
NGC
https://github.com/rapidsaihttps://ngc.nvidia.com
https://hub.docker.com/u/rapidsai
https://anaconda.org/rapidsai
PIP available at a later date
![Page 29: Multi GPU training Part 3: Engineering Challenges of Multi ...on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8501.pdf · XGBOOST XGBoost is an implementation of gradient boosted decision](https://reader033.fdocuments.net/reader033/viewer/2022042310/5ed7a7bc48b98015c20212c8/html5/thumbnails/29.jpg)
ACCELERATING MACHINE LEARNINGThe RAPIDS Ecosystem
Open Source Community
Enterprise Data Science Platforms
StartupsDeep Learning
Integration
GPU Servers Storage Partners