GPU Computing with Apache Spark and Python - GTC On...

GPU Computing with Apache Sparkand Python

Stan Seibert Siu Kwan Lam

April 5, 2016

My Background

• Trained in particle physics

• Using Python for data analysis for 10 years

• Using GPUs for data analysis for 7 years

• Currently lead the High Performance Python team at Continuum

About Continuum Analytics

• We give superpowers to people who change the world!

• We want to help everyone analyze their data with Python (and other tools), so we offer:

• Enterprise Products • Consulting • Training • Open Source

• I’m going to use Anaconda throughout this presentation.

• Anaconda is a free Mac/Win/Linux Python distribution: • Based on conda, an open source package manager • Installs both Python and non-Python dependencies • Easiest way to get the software I will talk about today

• https://www.continuum.io/downloads

Overview1. Why Python?

2. Numba: A Python JIT compiler for the CPU and GPU

3. PySpark: Distributed computing for Python

4. Example: Image Registration

5. Tips and Tricks

6. Conclusion

WHY PYTHON?

Why is Python so popular?• Straightforward, productive language for system administrators,

programmers, scientists, analysts and hobbyists

• Great community:

• Lots of tutorial and reference materials

• Easy to interface with other languages

• Vast ecosystem of useful libraries

• Pure, interpreted Python is slow.

• Python excels at interfacing with other languages used in HPC:

C: ctypes, CFFI, Cython

C++: Cython, Boost.Python

FORTRAN: f2py

• Secret: Most scientific Python packages put the speed critical sections of their algorithms in a compiled language.

… But, isn’t Python slow?

Is there another way?

• Switching languages for speed in your projects can be a little clunky:

• Sometimes tedious boilerplate for translating data types across the language barrier

• Generating compiled functions for the wide range of data types can be difficult

• How can we use cutting edge hardware, like GPUs?

NUMBA: A PYTHON JIT COMPILER

Compiling Python

• Numba is an open-source, type-specializing compiler for Python functions

• Can translate Python syntax into machine code if all type information can be deduced when the function is called.

• Implemented as a module. Does not replace the Python interpreter!

• Code generation done with:

• LLVM (for CPU)

• NVVM (for CUDA GPUs).

Supported Platforms

OS HW SW

• Windows (7 and later) • 32 and 64-bit x86 CPUs • Python 2 and 3

• OS X (10.9 and later) • CUDA-capable NVIDIA GPUs • NumPy 1.7 through 1.10

• Linux (~RHEL 5 and later) • HSA-capable AMD GPUs

• Experimental support for ARMv7 (Raspberry Pi 2)

How Does Numba Work?

Python Function (bytecode)

Bytecode Analysis

Functions Arguments

Numba IR

Machine CodeExecute!

Type Inference

LLVM/NVVM JIT LLVM IR

Lowering

Rewrite IR

@jit def do_math(a, b): … >>> do_math(x, y)

Numba on the CPU

Array Allocation

Looping over ndarray x as an iterator

Using numpy math functions

Returning a slice of the array

2.7x speedup!

Numba decorator (nopython=True not required)

Numba on the CPU

CUDA Kernels in Python

CUDA Kernels in PythonDecoratorwillinfertypesignaturewhenyoucallit

NumPyarrayshaveexpectedattributesandindexing

HelperfunctiontocomputeblockIdx.x * blockDim.x + threadIdx.x

HelperfunctiontocomputeblockDim.x * gridDim.x

Calling the Kernel from Python

WorksjustlikeCUDAC,exceptNumbahandlesallocatingandcopyingdatato/fromthehostifneeded

Handling Device Memory Directly

Memoryallocationmattersinsmalltasks.

Higher Level Tools: GPU Ufuncs

Higher Level Tools: GPU UfuncsDecoratorforcreatingufunc

Listofsupportedtypesignatures

Codegenerationtarget

GPU Ufuncs Performance

4xspeedupincl.host<->deviceroundtriponGeForceGT650M

Accelerate Library Bindings: cuFFT

>2xspeedupincl.host<->deviceroundtriponGeForceGT650M

MKLacceleratedFFT

PYSPARK: DISTRIBUTED COMPUTING FOR PYTHON

What is Apache Spark?• An API and an execution engine for distributed computing on a cluster

• Based on the concept of Resilient Distributed Datasets (RDDs)

• Dataset: Collection of independent elements (files, objects, etc) in memory from previous calculations, or originating from some data store

• Distributed: Elements in RDDs are grouped into partitions and may be stored on different nodes

• Resilient: RDDs remember how they were created, so if a node goes down, Spark can recompute the lost elements on another node

Computation DAGs

Figfrom:https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html

How does Spark Scale?• All cluster scaling is about minimizing I/O.

Spark does this in several ways:

• Keep intermediate results in memory with rdd.cache()

• Move computation to the data whenever possible (functions are small and data is big!)

• Provide computation primitives that expose parallelism and minimize communication between workers:map, filter, sample, reduce, …

Python and Spark

• Spark is implemented in Java & Scala on the JVM

• Full API support for Scala, Java, and Python(+ limited support for R)

• How does Python work, since it doesn’t run on the JVM?(not counting IronPython)

SparkContext

SparkWorker

PythonInterpreter

SparkWorker

PythonInterpreter

Python Java

Python

Client Cluster

Getting Started

• conda can create a local environment for Spark for you:

conda create -n spark -c anaconda-cluster python=3.5 spark numba \ cudatoolkit ipython-notebook

source activate spark #Uncomment below if on Mac OS X #export JRE_HOME=$(/usr/libexec/java_home) #export JAVA_HOME=$(/usr/libexec/java_home)

IPYTHON_OPTS="notebook" pyspark # starts jupyter notebook

Using Numba (CPU) with Spark

LoadarraysontotheSparkWorkers

ApplymyfunctiontoeveryelementintheRDDandreturnfirstelement

Using CUDA Python with SparkDefineCUDAkernelCompilationhappenshere

WrapCUDAkernellaunchinglogic

CreatesSparkRDD(8partitions)

Applygpu_workoneachpartition

Client Cluster

CUDAPythonKernel

LLVMIR

LLVMIR PTX

CUDABinary

Compiles Serialize FinalizeDeserialize

useserializedPTXifCUDAarchmatchesclient

recompileifCUDAarchisdifferent

Client Cluster

CUDAPythonKernel

LLVMIR

LLVMIR PTX

CUDABinary

Compiles Serialize FinalizeDeserialize

Thishappensoneveryworkerprocess

EXAMPLE: IMAGE REGISTRATION

Basic Algorithm

Groupsimilarimages(unsupervisedkNNclustering)

attemptimageregistrationoneverypairineachgroup

(phasecorrelationbased;FFTheavy)

imageset

unusedimages newimages

Progress?

Basic Algorithm

• FFT2Dheavy• Expensive• MostlyrunningonGPU

• Reducenumberofpairwiseimageregistrationattempt

• RunonCPUGroupsimilarimages

(unsupervisedkNNclustering)

attemptimageregistrationoneverypairineachgroup

(phasecorrelationbased;FFTheavy)

imageset

unusedimages newimages

Progress?

Cross Power Spectrum Core of phase correlation based image registration algorithm

explicitmemorytransfer

partition

u n …

Randompartitioningrdd.repartition(num_parts)

Imageset

ApplyImgRegoneachPartitionrdd.mapPartitions(imgreg_func)

Progress?

Scaling on Spark

• MachineshavemultipleGPUs• Eachworkercomputesapartitionatatime

CUDA Multi-Process Service (MPS)

• Sharing one GPU between multiple workers can be beneficial

• nvidia-cuda-mps-control

• Better GPU utilization from multiple processes

• For our example app: 5-10% improvement with MPS

• Effect can be bigger for app with higher GPU utilization

Scaling with Multiple GPUs and MPS• 1 CPU worker: 1 Tesla K20

• 2 CPU worker: 2 Tesla K20

• 4 CPU worker: 2 Tesla K20(2 workers per GPU using MPS)

• Spark + Dask = External process performing GPU calculation

partition

u n …

Randompartitioningrdd.repartition(num_parts

Imageset

ApplyImgRegoneachPartitionrdd.mapPartitions(imgreg_func)

Progress?

Spark vs CUDA

• SparklogicissimilartoCUDAhostlogic• mapPartitionsislikekernellaunch• SparknetworktransferoverheadvsPCI-expresstransferoverhead

• PartitionsarelikeCUDAblocks• Workcooperativelywithinapartition

TIPS AND TRICKS

Parallelism is about communication

• Be aware of the communication overhead when deciding how to chunk your work in Spark.

• Bigger chunks: Risk of underutilization of resources

• Smaller chunks: Risk of computation swamped by overhead

➡ Start with big chunks, then move to smaller ones if you fail to reach full utilization.

Start Small

• Easier to debug your logic on your local system!

• Run a Spark cluster on your workstation where it is easier to debug and profile.

• Move to the cluster only after you’ve seen things work locally. (Spark makes this fairly painless.)

Be interactive with Big Data!

• Run Jupyter notebook on Spark + MultiGPU cluster

• Live experimentation helps you develop and understand your progress

• Spark keeps track of intermediate data and recomputes them if they are lost

Amortize Setup Costs

• Make sure that one-time costs are actually done once:

• GPU state (like FFT plans) can be slow to initialize. Do this at the top of your mapPartition call and reuse for each element in the RDD partition.

• Coarser chunking of tasks allows GPU memory allocations to be reused between steps inside a single task.

Be(a)ware of Global State• Beware that PySpark may (quite often!) spawn new processes

• New processes may have the same initial random seed

• the python random module must be seeded properly in each newly spawned process

• Use memoize

• example: use global dictionary to track shared states and remember performed initializations

Spark and Multi-GPU• Spark is not aware of GPU resources

• Efficient usage of GPU resources require some workaroundse.g.

• option 1: random GPU assignmentnumba.cuda.select_device(random.randrange(num_of_gpus))

• option 2: use mapPartitionWithIndexselected_gpu = partition_index % num_of_gpus

• option 3: manage GPUs externally; like CaffeOnSpark

GPUassignmentbypartitionindex

DelegateGPUworktoexternalGPU-awareprocess>2xspeedup

CONCLUSION

PySpark and Numba for GPU clusters• Numba let’s you create compiled CPU and CUDA functions right inside

your Python applications.

• Numba can be used with Spark to easily distribute and run your code on Spark workers with GPUs

• There is room for improvement in how Spark interacts with the GPU, but things do work.

• Beware of accidentally multiplying fixed initialization and compilation costs.

GPU Computing with Apache Spark and Python - GTC On...

Documents

Transcript of GPU Computing with Apache Spark and Python - GTC On...

Apache Spark - LMU

Integrating Apache Hive with Kafka, Spark, and BI...Community Connection: Integrating Apache Hive with Apache Spark--Hive Warehouse Connector Apache Spark-Apache Hive connection configuration

Using Apache Spark

TeachYourself Apache Spark...HOUR 1 Introducing Apache Spark..... 1 2 Understanding Hadoop ... Part II: Programming with Apache Spark HOUR 6: Learning the Basics of Spark Programming

Best Practices for Deep Learning on Apache Sparkon-demand.gputechconf.com/gtc/.../s7510-hunter-apache-spark-gpus... · Best Practices for Deep Learning on Apache Spark Tim Hunter

Apache Spark Introduction

R + Apache Spark

A Tutorial on Apache Spark - Michael Hahslermichael.hahsler.net/SMU/EMIS8331/tutorials/Tutorial_Apache_Spark.pdf · A Tutorial on Apache Spark ... •Apache Spark is considered to

Introduction to Cassandra • Why Spark - Apache Cassandra | Apache Kafka | Apache Spark · 2017. 12. 20. · • Introduction to Cassandra • Why Spark + Cassandra • Problem background

Using Apache Spark Pat McDonough - Databricks. Apache Spark spark.incubator.apache.org github.com/apache/incubator- spark user@spark.incubator.apache.or.

Apache Spark overview

Budapest Spark Meetup - Apache Spark @enbrite.ly

Apache Spark 2.0

Apache Spark Streaming

Developing Apache Spark Applications - Cloudera · Apache Spark Quick Start Apache Spark Overview Apache Spark Programming Guide Using the Spark DataFrame API A DataFrame is a distributed

Plugin Apache Spark

Apache Spark Operations

Apache spark session

Apache Spark 101

Accelerating Cross-Validation in Spark Using GPUon-demand.gputechconf.com/gtc/2017/presentation/s7117...Apache Spark Overview 4 In-memory engine for large-scale distributed data processing