RAPIDSGPU POWERED MACHINE LEARNING
1980 1990 2000 2010 2020
GPU-Computing perf
1.5X per year
1000X
by
2025
RISE OF GPU COMPUTING
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.
Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
102
103
104
105
106
107
Single-threaded perf
1.5X per year
1.1X per year
APPLICATIONS
SYSTEMS
ALGORITHMS
CUDA
ARCHITECTURE
EXTENDING DL → BIG DATA ANALYTICSFrom Business Intelligence to Data Science
Deep
Learning
Traditional Machine Learning
(regressions, decision trees, graph)Analytics
DATA SCIENCE
ARTIFICIAL INTELLIGENCE
DENSE DATA TABULAR/SPARSE DATADENSE DATA TYPES
(images, video, voice)
USE CASES IN EVERY INDUSTRY
CONSUMER INTERNET
— Personalized recommendations to drive viewership
— Optimized ad targeting
— Preventing churn by identifying factors that influence loyalty
RETAIL
— Inventory forecasting
— Personalized recommendations
— Optimized pricing and promotions
— Preventing credit card fraud and cyber attacks
FINANCIAL SERVICES
— Personalized guidance on financial products
— Return optimization based on market signals
— Fraud detection
HEALTHCARE
— Better disease prediction with genomic medicine
— Improved health outcomes with analysis of EMRs
— Predictive care/treatment
TODAY’S DATA SCIENCE STIFLES INNOVATION
All
DataETL
Manage Data
Structured
Data Store
Data Preparation
Training
Model Training
Visualization
Evaluate
Inference
Deploy
Slow Training Times for Data Scientists
Ie: HURRY UP AND WAIT
DATA SCIENCE CHALLENGES
SLOW TRAINING
Hours to build GBDT
30+
SLOW DATA PROCESSING
DaysData Transformation
WeeksFeature Engineering
MonthsScoring Pipelines
ESCALATING TCO
More servers and infrastructure yielding diminishing performance returns
XGBOOST
XGBoost is an implementation of gradient
boosted decision trees designed for speed
and performance.
Definition
It is a powerful tool for
solving classification and
regression problems in a
supervised learning setting.
Source: https://goo.gl/eTxVtA
Example of Decision Tree
PREDICT: WHO ENJOYS COMPUTER GAMES
Source: https://goo.gl/eTxVtA
Example of Using Ensembled Decision Trees
COMBINE TREES FOR STRONGER PREDICTIONS
RAPIDS OVERVIEW
RAPIDS
GPU Accelerated Data Science
RAPIDS is a set of open source libraries
for GPU accelerating data preparation
and machine learning.
OSS website: http://www.rapids.ai/
RE-IMAGINING DATA SCIENCE WORKFLOWOpen Source, End-to-end GPU-accelerated Workflow Built On CUDA
Data preparation /
wrangling
cuDF
Optimized ML model
training
cuML Visualization
Data visualization
libraries
data insights
RAPIDS LIBRARIES
GPU accelerated software for doing data manipulation and data preparation.
Accelerates loading, filtering, and manipulation of data for model training data preparation.
Python drop-in Pandas replacement built on CUDA C++
cuDF
GPU accelerated traditional machine learning libraries.
XGBoost, Kalman, K-means, KNN, DBScan, PCA, TSVD and more.
cuML
Collection of graph analytics libraries. Coming soon.
cuGRAPH
RAPIDS — OPEN GPU DATA SCIENCESoftware Stack Python
Data Preparation
cuDFVisualization
cuGRAPHModel Training
cuML
CUDA
PYTHON
APACHE ARROW on GPU Memory
DASK
DEEP LEARNING
FRAMEWORKS
CUDNN
RAPIDS
CUMLCUDF CUGRAPH
THE RAPIDS VALUE PROPOSITIONHigh Performance, Easy-to-use
Data Scientist Data Science Leader
Reduced Training TimeDrastically improve your productivity with near-interactive data science
Hassle-Free IntegrationAccelerate your Python data science toolchain with minimal code changes and no new tools to learn
Open SourceCustomizable, extensible, interoperable — the open-source software is supported by NVIDIA and built on Apache Arrow
Top Model AccuracyIncrease machine learning model accuracy by iterating on models faster and deploying them more frequently
TCO ReductionDecrease the server costs, footprint, power consumption of your ML workloads reducing the TCO
Increased Data Scientist ProductivityReduce training time, allow data scientists to be more productive
RAPIDS DEPLOYMENT STACK
TARGET INDUSTRIES
Retail Finance CICN Healthcare
TARGET AUDIENCE AND RECOMMENDED SYSTEMS
Individual Data Scientist Shared Infrastructure For Data Scientists
Quadro GV100 WS2 GV100, NVLink
DGX Station4 V100, NVLink
CloudV100 Cloud Instances
V100 Servers4-8 V100, NVLink, HGX-1,
HGX-2
DGX-18 V100, NVLink
DGX-216 V100, NVLink
CloudV100 Cloud Instances
PILLARS OF RAPIDS PERFORMANCE
CUDA Architecture NVLink/NVSwitch Memory Architecture
Massively parallel processing
NVSWITCH
6x NVLINK
High speed connecting between GPUs for distribute algorithms
Large virtual GPU memory, high-speed memory
Iu-Bump
DRAM Core Die
DRAM Core Die
DRAM Core Die
DRAM Core Die
Base Die
TSV
DESIGNED TO DO THE PREVIOUSLY IMPOSSIBLE
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
19
30 TB NVME SSDs Internal Storage
NVIDIA Tesla V100 32 GB Tensor Core GPUs
Two GPU Boards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card
Twelve NVSwitches2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25/100 Gb/secEthernet
20
NVSWITCH: THE REVOLUTIONARY AI NETWORK FABRIC• Inspired by leading edge research
that demands unrestricted model parallelism
• Like the evolution from dial-up to broadband, NVSwitch delivers a networking fabric for the future, today
• Delivering 2.4 TB/s bisection bandwidth, equivalent to a PCIe bus with 1,200 lanes
• NVSwitches on DGX-2 = all of Netflix HD <45s
TRADITIONALHPCCLUSTER
300 Servers
$3M
180 kW
GPU-ACCELERATEDHPC + AI CLUSTER
1 DGX-2
10 kW
1/8 the Cost
1/15 the Space
1/18 the Power
FASTER INSIGHTS FOR MACHINE LEARNINGDGX-2 544X Speedup Compared to CPU-Only Server Nodes
0 500 1,000 1,500 2,000 2,500 3,000 3,500
1 CPU instance
20 CPU instances
30 CPU instances
50 CPU instances
100 CPU instances
HGX-2
Process Time (min)
cuIO/ cuDF (Load and Data prep) Data Conversion XGBoost
GPU Measurements Completed on DGX-2 running RAPIDSCPU: 20 CPU cluster- comparison is prorated to 1 CPU (61 GB of memory, 8 vCPUs, 64-bit platform), Apache Spark
US Mortgage Data Fannie Mae and Freddie Mac 2006-2017 | 146M mortgagesBenchmark 200GB CSV dataset | Data preparation includes joins, variable transformations
544X speedup
FASTER SPEEDS, REAL WORLD BENEFITS
2,290
1,956
1,999
1,948
169
157
0 500 1,000 1,500 2,000 2,500
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
0 2,000 4,000 6,000 8,000 10,000
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
cuML — XGBoost
2,741
1,675
715
379
42
19
0 1,000 2,000 3,000
20 CPU Nodes
30 CPU Nodes
50 CPU Nodes
100 CPU Nodes
DGX-2
5x DGX-1
End-to-EndcuIO/cuDF —Load and Data Preparation
Benchmark
200GB CSV dataset; Data preparation includes joins, variable transformations.
CPU Cluster Configuration
CPU nodes (61 GiB of memory, 8 vCPUs, 64-bit platform), Apache Spark
DGX Cluster Configuration
5x DGX-1 on InfiniBand network
Time in seconds — Shorter is better
cuIO / cuDF (Load and Data Preparation) Data Conversion XGBoost
DGX POD FOR RAPIDS
RAPIDS.AI - Open GPU Data Science
CPU vs GPUPORTING EXISTING CODE
PCA
Training and query results:
• CPU: ~5 minutes
• GPU: ~7 seconds
Principal Component Analysis (PCA)
…Now!Before…
Cloud
HOW? DOWNLOAD AND DEPLOY
On-premises
Source code, libraries, packages
Source available on GitHub | Container available on NGC and Docker Hub | Conda and PIP
NGC
https://github.com/rapidsaihttps://ngc.nvidia.com
https://hub.docker.com/u/rapidsai
https://anaconda.org/rapidsai
PIP available at a later date
ACCELERATING MACHINE LEARNINGThe RAPIDS Ecosystem
Open Source Community
Enterprise Data Science Platforms
StartupsDeep Learning
Integration
GPU Servers Storage Partners
https://rapids.ai
Top Related