The Platform Inside and Out Release 0 - RAPIDS Docs 0.9 Release Deck.pdf · Scikit-Learn -> cuML...

Joshua Patterson – GM, Data Science

The Platform Inside and OutRelease 0.9

2

RAPIDSEnd-to-End Accelerated GPU Data Science

cuDF cuIOAnalytics

GPU Memory

Data Preparation VisualizationModel Training

cuMLMachine Learning

cuGraphGraph Analytics

PyTorch Chainer MxNetDeep Learning

cuXfilter <> pyVizVisualization

Dask

3

Data Processing EvolutionFaster data access, less data movement

25-100x ImprovementLess code

Language flexiblePrimarily In-Memory

HDFS Read

HDFS Write

HDFS Read

HDFS Write

HDFS ReadQuery ETL ML Train

HDFS Read Query ETL ML Train

HDFS Read

GPU ReadQuery CPU

WriteGPU Read ETL CPU

WriteGPU Read

MLTrain

5-10x ImprovementMore code

Language rigidSubstantially on GPU

Traditional GPU Processing

Hadoop Processing, Reading from disk

Spark In-Memory Processing

4

APP A

Data Movement and TransformationData Movement and TransformationThe bane of productivity and performance

CPU

APP B

Copy & Convert

Copy & Convert

Copy & Convert

APP A GPU Data

APP BGPU Data

Read Data

Load Data

APP B

APP A

GPU

5

Data Movement and TransformationData Movement and TransformationWhat if we could keep data on the GPU?

APP A

APP B

Copy & Convert

Copy & Convert

Copy & Convert

Read Data

Load Data

CPU

APP A GPU Data

APP BGPU Data

APP B

APP A

GPUCopy & Convert

6

Learning from Apache Arrow

From Apache Arrow Home Page - https://arrow.apache.org/

● Each system has its own internal memory format● 70-80% computation wasted on serialization and deserialization● Similar functionality implemented in multiple projects

● All systems utilize the same memory format● No overhead for cross-system communication● Projects can share functionality (eg, Parquet-to-Arrow

reader)

7

Data Processing EvolutionFaster data access, less data movement

25-100x ImprovementLess code

Language flexiblePrimarily In-Memory

HDFS Read

HDFS Write

HDFS Read

HDFS Write

HDFS ReadQuery ETL ML Train

HDFS Read Query ETL ML Train

HDFS Read

GPU ReadQuery CPU

WriteGPU Read ETL CPU

WriteGPU Read

MLTrain

ArrowRead ETL ML

Train

5-10x ImprovementMore code

Language rigidSubstantially on GPU

50-100x ImprovementSame code

Language flexiblePrimarily on GPU

RAPIDS

Traditional GPU Processing

Hadoop Processing, Reading from disk

Spark In-Memory Processing

Query

8

Faster Speeds, Real-World BenefitscuIO/cuDF – Load and Data Preparation cuML - XGBoost

Time in seconds (shorter is better)

cuIO/cuDF (Load and Data Prep) Data Conversion XGBoost

Benchmark

200GB CSV dataset; Data prep includes joins, variable transformations

CPU Cluster ConfigurationCPU nodes (61 GiB memory, 8 vCPUs, 64-bit platform), Apache Spark

DGX Cluster Configuration5x DGX-1 on InfiniBand network

8762

6148

3925

3221

322

213

End-to-End

9

Speed, UX, and IterationThe Way to Win at Data Science

10

RAPIDS Core

11

PandasAnalytics

CPU Memory


Scikit-LearnMachine Learning

NetworkXGraph Analytics


Matplotlib/SeabornVisualization

Open Source Data Science EcosystemFamiliar Python APIs

Dask

12

cuDF cuIOAnalytics

GPU Memory






RAPIDSEnd-to-End Accelerated GPU Data Science

Dask

13

Dask

14

cuDF cuIOAnalytics

GPU Memory






RAPIDSScaling RAPIDS with Dask

Dask

15

Why Dask?

• Easy Migration: Built on top of NumPy, Pandas Scikit-Learn, etc.

• Easy Training: With the same APIs• Trusted: With the same developer community

PyData Native

• Easy to install and use on a laptop• Scales out to thousand-node clusters

Easy Scalability

• Most common parallelism framework today in the PyData and SciPy community

Popular

• HPC: SLURM, PBS, LSF, SGE• Cloud: Kubernetes• Hadoop/Spark: Yarn

Deployable

16

Why OpenUCX?

• TCP sockets are slow!

• UCX provides uniform access to transports (TCP, InfiniBand, shared memory, NVLink)

• Python bindings for UCX (ucx-py) in the works https://github.com/rapidsai/ucx-py

• Will provide best communication performance, to Dask based on available hardware on nodes/cluster

Bringing hardware accelerated communications to Dask

https://github.com/rapidsai/ucx-py

17

Scale up with RAPIDS

Accelerated on single GPU

NumPy -> CuPy/PyTorch/..Pandas -> cuDFScikit-Learn -> cuMLNumba -> Numba

RAPIDS and Others

NumPy, Pandas, Scikit-Learn, Numba and many more

Single CPU coreIn-memory data

PyData

Scal

e U

p /

Acce

lera

te

18

Scale out with RAPIDS + Dask with OpenUCX

Accelerated on single GPU

NumPy -> CuPy/PyTorch/..Pandas -> cuDFScikit-Learn -> cuMLNumba -> Numba

RAPIDS and Others

Multi-GPUOn single Node (DGX)Or across a cluster

RAPIDS + Dask with OpenUCX

Scal

e U

p /

Acce

lera

te

Scale out / Parallelize

NumPy, Pandas, Scikit-Learn, Numba and many more

Single CPU coreIn-memory data

PyDataMulti-core and Distributed PyData

NumPy -> Dask ArrayPandas -> Dask DataFrameScikit-Learn -> Dask-ML… -> Dask Futures

Dask

19

cuDF

20

cuDF cuIOAnalytics

GPU Memory






RAPIDSGPU Accelerated data wrangling and feature engineering

Dask

21

GPU-Accelerated ETLThe average data scientist spends 90+% of their time in ETL as opposed to training

models

22

ETL - the Backbone of Data SciencelibcuDF is…

CUDA C++ Library

● Low level library containing function implementations and C/C++ API

● Importing/exporting Apache Arrow in GPU memory using CUDA IPC

● CUDA kernels to perform element-wise math operations on GPU DataFrame columns

● CUDA sort, join, groupby, reduction, etc. operations on GPU DataFrames

23

ETL - the Backbone of Data SciencecuDF is…

Python Library

● A Python library for manipulating GPU DataFrames following the Pandas API

● Python interface to CUDA C++ library with additional functionality

● Creating GPU DataFrames from Numpy arrays, Pandas DataFrames, and PyArrow Tables

● JIT compilation of User-Defined Functions (UDFs) using Numba

24

cuDF v0.9, Pandas 0.24.2

Running on NVIDIA DGX-1:

GPU: NVIDIA Tesla V100 32GBCPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz

Benchmark Setup:

DataFrames: 2x int32 columns key columns, 3x int32 value columns

Merge: inner

GroupBy: count, sum, min, max calculated for each value column

Benchmarks: single-GPU Speedup vs. Pandas

25

cuDF cuIOAnalytics

GPU Memory






Dask

ETL - the Backbone of Data SciencecuDF is not the end of the story

26

ETL - the Backbone of Data ScienceString Support

•Regular Expressions•Element-wise operations

• Split, Find, Extract, Cat, Typecasting, etc…•String GroupBys, Joins•Categorical columns fully on GPU

Current v0.9 String Support

• Combining cuStrings into libcudf• Extensive performance optimization• More Pandas String API compatibility• JIT-compiled String UDFs

Future v0.10+ String Support

27

• Follow Pandas APIs and provide >10x speedup

• CSV Reader - v0.2, CSV Writer v0.8

• Parquet Reader – v0.7, Parquet Writer v0.10

• ORC Reader – v0.7, ORC Writer v0.10

• JSON Reader - v0.8

• Avro Reader - v0.9

• GPU Direct Storage integration in progress for bypassing PCIe bottlenecks!

• Key is GPU-accelerating both parsing and decompression wherever possible Source: Apache Crail blog: SQL Performance: Part 1 - Input File Formats

Extraction is the CornerstonecuIO for Faster Data Loading

http://crail.incubator.apache.org/blog/2018/08/sql-p1.html

28

ETL is not just DataFrames!

29

GPU Memory


RAPIDSBuilding bridges into the array ecosystem

Dask

cuDF cuIOAnalytics





30

Interoperability for the WinDLPack and __cuda_array_interface__

mpi4py

31

Interoperability for the WinDLPack and __cuda_array_interface__

mpi4py

32

ETL – Arrays and DataFramesDask and CUDA Python arrays

• Scales NumPy to distributed clusters• Used in climate science, imaging, HPC analysis

up to 100TB size• Now seamlessly accelerated with GPUs

33

Benchmark: single-GPU CuPy vs NumPy

More details: https://blog.dask.org/2019/06/27/single-gpu-cupy-benchmarks

https://blog.dask.org/2019/06/27/single-gpu-cupy-benchmarks

34

SVD BenchmarkDask and CuPy Doing Complex Workflows

35

Architecture Time

Single CPU Core 2hr 39min

Forty CPU Cores 11min 30s

One GPU 1min 37s

Eight GPUs 19s

Also…Achievement Unlocked: Petabyte Scale Data Analytics with Dask and CuPy

Cluster configuration: 20x GCP instances, each instance has:CPU: 1 VM socket (Intel Xeon CPU @ 2.30GHz), 2-core, 2 threads/core, 132GB mem, GbE ethernet, 950 GB diskGPU: 4x NVIDIA Tesla P100-16GB-PCIe (total GPU DRAM across nodes 1.22 TB)Software: Ubuntu 18.04, RAPIDS 0.5.1, Dask=1.1.1, Dask-Distributed=1.1.1, CuPY=5.2.0, CUDA 10.0.130

https://blog.dask.org/2019/01/03/dask-array-gpus-first-steps

36

ETL – Arrays and DataFramesMore Dask Awesomeness from RAPIDS

https://youtu.be/gV0cykgsTPM https://youtu.be/R5CiXti_MWo

37

cuML

38

GPU Memory


Dask

Machine LearningMore models more problems

cuDF cuIOAnalytics





39

ProblemData sizes continue to grow

Histograms / Distributions

Dimension ReductionFeature Selection

Remove Outliers

Sampling

Massive Dataset

Better to start with as much data aspossible and explore / preprocess to scaleto performance needs.

Iterate. Cross Validate & Grid Search.

Iterate some more.

Meet reasonable speed vs accuracy tradeoff

Hours? Days?

TimeIncreases

40

ML Technology Stack

Python

Cython

cuML Algorithms

cuML Prims

CUDA Libraries

CUDA

Dask cuMLDask cuDF

cuDFNumpy

ThrustCub

cuSolvernvGraphCUTLASScuSparsecuRandcuBlas

41

AlgorithmsGPU-accelerated Scikit-Learn

Classification / Regression

Inference

Clustering

Decomposition & Dimensionality Reduction

Time Series

Decision Trees / Random ForestsLinear RegressionLogistic RegressionK-Nearest Neighbors

Random forest / GBDT inference

K-MeansDBSCANSpectral Clustering

Principal ComponentsSingular Value DecompositionUMAPSpectral Embedding

Holt-WintersKalman Filtering

Cross Validation

More to come!

Hyper-parameter TuningKey:

● Preexisting● NEW for 0.9

42

RAPIDS matches common Python APIs

from sklearn.cluster import DBSCANdbscan = DBSCAN(eps = 0.3, min_samples = 5)

dbscan.fit(X)

y_hat = dbscan.predict(X)

Find Clusters

from sklearn.datasets import make_moonsimport pandas

X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0)

X = pandas.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])})

CPU-Based Clustering

43

RAPIDS matches common Python APIs

from cuml import DBSCANdbscan = DBSCAN(eps = 0.3, min_samples = 5)

dbscan.fit(X)

y_hat = dbscan.predict(X)

Find Clusters

from sklearn.datasets import make_moonsimport cudf

X, y = make_moons(n_samples=int(1e2), noise=0.05, random_state=0)

X = cudf.DataFrame({'fea%d'%i: X[:, i] for i in range(X.shape[1])})

GPU-Accelerated Clustering

44

Benchmarks: single-GPU cuML vs scikit-learn

1x V100 vs 2x 20 core CPU

45

cuML’s Forest Inference Library accelerates prediction (inference) for random forests and boosted decision trees:

● Works with existing saved models (XGBoost and LightGBM today, scikit-learn RF and cuML RF soon)

● Lightweight Python API● Single V100 GPU can infer up to 34x

faster than XGBoost dual-CPU node● Over 100 million forest inferences

per sec (with 1000 trees) on a DGX-1

Forest InferenceTaking models from training to production

23x 36x 34x 23x

46

Road to 1.0 August 2019 - RAPIDS 0.9

cuML Single-GPU Multi-GPU Multi-Node-Multi-GPU

Gradient Boosted Decision Trees (GBDT)

GLM

Logistic Regression

Random Forest

K-Means

K-NN

DBSCAN

UMAP

Holt-Winters

Kalman Filter

t-SNE

Principal Components

Singular Value Decomposition

47

Road to 1.0 March 2020 - RAPIDS 0.14

cuML Single-GPU Multi-GPU Multi-Node-Multi-GPU

Gradient Boosted Decision Trees (GBDT)

GLM

Logistic Regression

Random Forest

K-Means

K-NN

DBSCAN

UMAP

ARIMA & Holt-Winters

Kalman Filter

t-SNE

Principal Components

Singular Value Decomposition

48

cuGraph

49

GPU Memory


Dask

Graph AnalyticsMore connections more insights

cuDF cuIOAnalytics





50

GOALS AND BENEFITS OF CUGRAPHFocus on Features and User Experience

• Property Graph support via DataFrames

Seamless Integration with cuDF and cuML

• Up to 500 million edges on a single 32GB GPU• Multi-GPU support for scaling into the billions

of edges

Breakthrough Performance

• Python: Familiar NetworkX-like API• C/C++: lower-level granular control for

application developers

Multiple APIs

• Extensive collection of algorithm, primitive, and utility functions

Growing Functionality

51

Graph Technology Stack

Python

Cython

cuGraph Algorithms

Prims

CUDA Libraries

CUDA

Dask cuGraphDask cuDF

cuDFNumpy

thrustcub

cuSolvercuSparsecuRand

Gunrock*

cuGraphBLAS cuHornet

nvGRAPH has been Opened Sourced and integrated into cuGraph. A legacy version is available in a RAPIDS GitHub repo * Gunrock is from UC Davis

52

AlgorithmsGPU-accelerated NetworkX

Community

Components

Link Analysis

Link Prediction

Traversal

Structure

Spectral ClusteringBalanced-CutModularity Maximization

LouvainSubgraph ExtractionTriangle Counting

JaccardWeighted JaccardOverlap Coefficient

Single Source Shortest Path (SSSP)Breadth First Search (BFS)

COO-to-CSR (Multi-GPU)TransposeRenumbering

Multi-GPU

More to come!

Utilities

Weakly Connected ComponentsStrongly Connected Components

Page Rank (Multi-GPU)Personal Page Rank

Query Language

53

Louvain Single Run

Dataset Nodes Edges

preferentialAttachment 100,000 999,970

caidaRouterLevel 192,244 1,218,132

coAuthorsDBLP 299,067 299,067

dblp-2010 326,186 1,615,400

citationCiteseer 268,495 2,313,294

coPapersDBLP 540,486 30,491,458

coPapersCiteseer 434,102 32,073,440

as-Skitter 1,696,415 22,190,596

Louvain returns:cudf.DataFrame with two names columns: louvain["vertex"]: The vertex id. louvain["partition"]: The assigned partition.

G = cugraph.Graph()G.add_edge_list(gdf["src_0"], gdf["dst_0"], gdf["data"])df, mod = cugraph.nvLouvain(G)

54

Multi-GPU PageRank Performance

PageRank portion of the HiBench benchmark suite

HiBench Scale Vertices Edges CSV File (GB)

# of GPUs PageRank for3 Iterations (secs)

Huge 5,000,000 198,000,000 3 1 1.1

BigData 50,000,000 1,980,000,000 34 3 5.1

BigData x2 100,000,000 4,000,000,000 69 6 9.0

BigData x4 200,000,000 8,000,000,000 146 12 18.2

BigData x8 400,000,000 16,000,000,000 300 16 31.8

55

Road to 1.0 August 2019 - RAPIDS 0.9

cuGraph Single-GPU Multi-GPU Multi-Node-Multi-GPU

Jaccard and Weighted Jaccard

Page Rank

Personal Page Rank

SSSP

BFS

Triangle Counting

Subgraph Extraction

Katz Centrality

Betweenness Centrality

Connected Components (Weak and Strong)

Louvain

Spectral Clustering

InfoMap

K-Cores

56

Road to 1.0 March 2020 - RAPIDS 0.14

cuGraph Single-GPU Multi-GPU Multi-Node-Multi-GPU

Jaccard and Weighted Jaccard

Page Rank

Personal Page Rank

SSSP

BFS

Triangle Counting

Subgraph Extraction

Katz Centrality

Betweenness Centrality

Connected Components (Weak and Strong)

Louvain

Spectral Clustering

InfoMap

K-Cores

57

Community

58

Ecosystem Partners

59

Building on top of RAPIDSA bigger, better, stronger ecosystem for all

Streamz

High-Performance Serverless event and data processing that utilizes RAPIDS for GPU Acceleration

Distributed stream processing using RAPIDS and Dask

GPU accelerated SQL engine built on top of RAPIDS

60

Deploy RAPIDS EverywhereFocused on robust functionality, deployment, and user experience

Integration with major cloud providersBoth containers and cloud specific machine instancesSupport for Enterprise and HPC Orchestration Layers

Cloud Dataproc Azure Machine Learning

61

5 Steps to getting started with RAPIDS

1. Install RAPIDS on using Docker, Conda, or Colab

2. Join our community conversations on Slack, Google, and Twitter

3. Explore our walk through videos, blog content, our github, the tutorial

notebooks, and our examples workflows,

4. Build your own data science workflows.

5. Contribute back. Don't forget to ask and answer questions on Stack Overflow

https://rapids.ai/start.html

https://colab.research.google.com/drive/1XTKHiIcvyL5nuldx0HSL_dUa8yopzy_Y#forceEdit=true&offline=true&sandboxMode=true

https://join.slack.com/t/rapids-goai/shared_invite/enQtMjE0Njg5NDQ1MDQxLTViZWFiYTY5MDA4NWY3OWViODg0YWM1MGQ1NzgzNTQwOWI1YjE3NGFlOTVhYjQzYWQ4YjI4NzljYzhiOGZmMGM

https://github.com/rapidsai/rapidsai-staging/issues/[email protected]

https://twitter.com/rapidsai

https://www.youtube.com/channel/UCsoi4wfweA3I5FsPgyQnnqw?view_as=subscriber

https://medium.com/rapids-ai

https://github.com/rapidsai

https://github.com/rapidsai/notebooks-extended#getting-started-notebooks

https://github.com/rapidsai/notebooks-extended#getting-started-notebooks

https://github.com/rapidsai/notebooks-extended#intermediate-notebooks

https://stackoverflow.com/tags/rapids

62

Easy InstallationInteractive Installation Guide

63

Join the Conversation

64

Explore: RAPIDS Githubhttps://github.com/rapidsai

65

Explore: RAPIDS DocsImproved and easier to use!

https://docs.rapids.ai

66

Explore: RAPIDS Code and BlogsCheck out our code and how we use it

https://github.com/rapidsai https://medium.com/rapids-ai


https://medium.com/rapids-ai

67

Explore: Notebooks ContribNotebooks Contrib Repo has tutorials and examples, and various E2E demos. RAPIDS

Youtube channel has explanations, code walkthroughs and use cases.

68

Contribute BackIssues, feature requests, PRs, Blogs, Tutorials, Videos, QA...bring your best!

69

Getting Started

70

RAPIDS DocsNew, improved, and easier to use

https://docs.rapids.ai

71

RAPIDS DocsEasier than ever to get started with cuDF

72

• https://ngc.nvidia.com/registry/nvidia-rapidsai-rapidsai

• https://hub.docker.com/r/rapidsai/rapidsai/

• https://github.com/rapidsai

• https://anaconda.org/rapidsai/

RAPIDSHow do I get the software?

https://ngc.nvidia.com/registry/nvidia-rapidsai-rapidsai

https://ngc.nvidia.com/registry/nvidia-rapidsai-rapidsai

https://hub.docker.com/r/rapidsai/rapidsai/


https://anaconda.org/rapidsai/

73

Join the MovementEveryone can help!

Integrations, feedback, documentation support, pull requests, new issues, or code donations welcomed!

APACHE ARROW GPU Open Analytics Initiative

https://arrow.apache.org/

@ApacheArrow

http://gpuopenanalytics.com/

@GPUOAI

RAPIDS

https://rapids.ai

@RAPIDSAI

Dask

https://dask.org

@Dask_dev

THANK YOU

Joshua Patterson @datametrician

[email protected]

The Platform Inside and Out Release 0 - RAPIDS Docs 0.9 Release Deck.pdf · Scikit-Learn -> cuML...

Documents

Transcript of The Platform Inside and Out Release 0 - RAPIDS Docs 0.9 Release Deck.pdf · Scikit-Learn -> cuML...