GPU Computing with Apache Spark and Python

© 2015 Continuum Analytics- Confidential & Proprietary

GPU Computing with Apache Sparkand Python

Stan Seibert Siu Kwan Lam

April 5, 2016

My Background

• Trained in particle physics

• Using Python for data analysis for 10 years

• Using GPUs for data analysis for 7 years

• Currently lead the High Performance Python team at Continuum

2

About Continuum Analytics

• We give superpowers to people who change the world!

• We want to help everyone analyze their data with Python (and other tools), so we offer:

• Enterprise Products • Consulting • Training • Open Source

3

• I’m going to use Anaconda throughout this presentation.

• Anaconda is a free Mac/Win/Linux Python distribution: • Based on conda, an open source package manager • Installs both Python and non-Python dependencies • Easiest way to get the software I will talk about today

• https://www.continuum.io/downloads

4

https://www.continuum.io/downloads

Overview1. Why Python?

2. Numba: A Python JIT compiler for the CPU and GPU

3. PySpark: Distributed computing for Python

4. Example: Image Registration

5. Tips and Tricks

6. Conclusion

WHY PYTHON?

6

Why is Python so popular?• Straightforward, productive language for system administrators,

programmers, scientists, analysts and hobbyists

• Great community:

• Lots of tutorial and reference materials

• Easy to interface with other languages

• Vast ecosystem of useful libraries

7

9

• Pure, interpreted Python is slow.

• Python excels at interfacing with other languages used in HPC:

C: ctypes, CFFI, Cython

C++: Cython, Boost.Python

FORTRAN: f2py

• Secret: Most scientific Python packages put the speed critical sections of their algorithms in a compiled language.

… But, isn’t Python slow?

Is there another way?

10

• Switching languages for speed in your projects can be a little clunky:

• Sometimes tedious boilerplate for translating data types across the language barrier

• Generating compiled functions for the wide range of data types can be difficult

• How can we use cutting edge hardware, like GPUs?

NUMBA: A PYTHON JIT COMPILER

11

Compiling Python

12

• Numba is an open-source, type-specializing compiler for Python functions

• Can translate Python syntax into machine code if all type information can be deduced when the function is called.

• Implemented as a module. Does not replace the Python interpreter!

• Code generation done with:

• LLVM (for CPU)

• NVVM (for CUDA GPUs).

Supported Platforms

13

OS HW SW

• Windows (7 and later) • 32 and 64-bit x86 CPUs • Python 2 and 3

• OS X (10.9 and later) • CUDA-capable NVIDIA GPUs • NumPy 1.7 through 1.10

• Linux (~RHEL 5 and later) • HSA-capable AMD GPUs

• Experimental support for ARMv7 (Raspberry Pi 2)

How Does Numba Work?

14

Python Function (bytecode)

Bytecode Analysis

Functions Arguments

Numba IR

Machine CodeExecute!

Type Inference

LLVM/NVVM JIT LLVM IR

Lowering

Rewrite IR

Cache

@jit def do_math(a, b): … >>> do_math(x, y)

15

Numba on the CPU

16

Array Allocation

Looping over ndarray x as an iterator

Using numpy math functions

Returning a slice of the array

2.7x speedup!

Numba decorator (nopython=True not required)

Numba on the CPU

17

CUDA Kernels in Python

18

CUDA Kernels in PythonDecoratorwillinfertypesignaturewhenyoucallit

NumPyarrayshaveexpectedattributesandindexing

HelperfunctiontocomputeblockIdx.x * blockDim.x + threadIdx.x

HelperfunctiontocomputeblockDim.x * gridDim.x

19

Calling the Kernel from Python

WorksjustlikeCUDAC,exceptNumbahandlesallocatingandcopyingdatato/fromthehostifneeded

20

Handling Device Memory Directly

Memoryallocationmattersinsmalltasks.

21

Higher Level Tools: GPU Ufuncs

22

Higher Level Tools: GPU UfuncsDecoratorforcreatingufunc

Listofsupportedtypesignatures

Codegenerationtarget

23

GPU Ufuncs Performance

4xspeedupincl.host<->deviceroundtriponGeForceGT650M

24

Accelerate Library Bindings: cuFFT

>2xspeedupincl.host<->deviceroundtriponGeForceGT650M

MKLacceleratedFFT

PYSPARK: DISTRIBUTED COMPUTING FOR PYTHON

25

26

What is Apache Spark?• An API and an execution engine for distributed computing on a cluster

• Based on the concept of Resilient Distributed Datasets (RDDs)

• Dataset: Collection of independent elements (files, objects, etc) in memory from previous calculations, or originating from some data store

• Distributed: Elements in RDDs are grouped into partitions and may be stored on different nodes

• Resilient: RDDs remember how they were created, so if a node goes down, Spark can recompute the lost elements on another node

27

Computation DAGs

Figfrom:https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html

https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html

28

How does Spark Scale?• All cluster scaling is about minimizing I/O.

Spark does this in several ways:

• Keep intermediate results in memory with rdd.cache()

• Move computation to the data whenever possible (functions are small and data is big!)

• Provide computation primitives that expose parallelism and minimize communication between workers:map, filter, sample, reduce, …

29

Python and Spark

• Spark is implemented in Java & Scala on the JVM

• Full API support for Scala, Java, and Python(+ limited support for R)

• How does Python work, since it doesn’t run on the JVM?(not counting IronPython)

30

SparkContext

SparkContext

SparkWorker

PythonInterpreter

SparkWorker

PythonInterpreter

Python Java

Java

Java

Python

Python

Client Cluster

Py4J

Py4J

Py4J

31

Getting Started

• conda can create a local environment for Spark for you:

conda create -n spark -c anaconda-cluster python=3.5 spark numba \ cudatoolkit ipython-notebook

source activate spark #Uncomment below if on Mac OS X #export JRE_HOME=$(/usr/libexec/java_home) #export JAVA_HOME=$(/usr/libexec/java_home)

IPYTHON_OPTS="notebook" pyspark # starts jupyter notebook

32

Using Numba (CPU) with Spark

33

Using Numba (CPU) with Spark

LoadarraysontotheSparkWorkers

ApplymyfunctiontoeveryelementintheRDDandreturnfirstelement

34

Using CUDA Python with SparkDefineCUDAkernelCompilationhappenshere

WrapCUDAkernellaunchinglogic

CreatesSparkRDD(8partitions)

Applygpu_workoneachpartition

35

Client Cluster

CUDAPythonKernel

LLVMIR

PTX

LLVMIR PTX

PTX

CUDABinary

Compiles Serialize FinalizeDeserialize

useserializedPTXifCUDAarchmatchesclient

recompileifCUDAarchisdifferent

36

Client Cluster

CUDAPythonKernel

LLVMIR

PTX

LLVMIR PTX

PTX

CUDABinary

Compiles Serialize FinalizeDeserialize

Thishappensoneveryworkerprocess

EXAMPLE: IMAGE REGISTRATION

37

38

Basic Algorithm

Groupsimilarimages(unsupervisedkNNclustering)

attemptimageregistrationoneverypairineachgroup

(phasecorrelationbased;FFTheavy)

imageset

unusedimages newimages

Progress?

39

Basic Algorithm

• FFT2Dheavy• Expensive• MostlyrunningonGPU

• Reducenumberofpairwiseimageregistrationattempt

• RunonCPUGroupsimilarimages

(unsupervisedkNNclustering)

attemptimageregistrationoneverypairineachgroup

(phasecorrelationbased;FFTheavy)

imageset

unusedimages newimages

Progress?

40

Cross Power Spectrum Core of phase correlation based image registration algorithm

41

Cross Power Spectrum Core of phase correlation based image registration algorithm

cuFFT

explicitmemorytransfer

partition

i

u n

partition

i

u n

partition

i

u n …

42

Randompartitioningrdd.repartition(num_parts)

Imageset

ApplyImgRegoneachPartitionrdd.mapPartitions(imgreg_func)

Progress?

Scaling on Spark

…

• MachineshavemultipleGPUs• Eachworkercomputesapartitionatatime

43

CUDA Multi-Process Service (MPS)

• Sharing one GPU between multiple workers can be beneficial

• nvidia-cuda-mps-control

• Better GPU utilization from multiple processes

• For our example app: 5-10% improvement with MPS

• Effect can be bigger for app with higher GPU utilization

44

Scaling with Multiple GPUs and MPS• 1 CPU worker: 1 Tesla K20

• 2 CPU worker: 2 Tesla K20

• 4 CPU worker: 2 Tesla K20(2 workers per GPU using MPS)

• Spark + Dask = External process performing GPU calculation

partition

i

u n

partition

i

u n

partition

i

u n …

45

Randompartitioningrdd.repartition(num_parts

Imageset

ApplyImgRegoneachPartitionrdd.mapPartitions(imgreg_func)

Progress?

Spark vs CUDA

…

• SparklogicissimilartoCUDAhostlogic• mapPartitionsislikekernellaunch• SparknetworktransferoverheadvsPCI-expresstransferoverhead

• PartitionsarelikeCUDAblocks• Workcooperativelywithinapartition

TIPS AND TRICKS

46

47

Parallelism is about communication

• Be aware of the communication overhead when deciding how to chunk your work in Spark.

• Bigger chunks: Risk of underutilization of resources

• Smaller chunks: Risk of computation swamped by overhead

➡ Start with big chunks, then move to smaller ones if you fail to reach full utilization.

48

Start Small

• Easier to debug your logic on your local system!

• Run a Spark cluster on your workstation where it is easier to debug and profile.

• Move to the cluster only after you’ve seen things work locally. (Spark makes this fairly painless.)

49

Be interactive with Big Data!

• Run Jupyter notebook on Spark + MultiGPU cluster

• Live experimentation helps you develop and understand your progress

• Spark keeps track of intermediate data and recomputes them if they are lost

50

Amortize Setup Costs

• Make sure that one-time costs are actually done once:

• GPU state (like FFT plans) can be slow to initialize. Do this at the top of your mapPartition call and reuse for each element in the RDD partition.

• Coarser chunking of tasks allows GPU memory allocations to be reused between steps inside a single task.

51

Be(a)ware of Global State• Beware that PySpark may (quite often!) spawn new processes

• New processes may have the same initial random seed

• the python random module must be seeded properly in each newly spawned process

• Use memoize

• example: use global dictionary to track shared states and remember performed initializations

52

Spark and Multi-GPU• Spark is not aware of GPU resources

• Efficient usage of GPU resources require some workaroundse.g.

• option 1: random GPU assignmentnumba.cuda.select_device(random.randrange(num_of_gpus))

• option 2: use mapPartitionWithIndexselected_gpu = partition_index % num_of_gpus

• option 3: manage GPUs externally; like CaffeOnSpark

© 2015 Continuum Analytics- Confidential & Proprietary 53

GPUassignmentbypartitionindex

DelegateGPUworktoexternalGPU-awareprocess>2xspeedup

CONCLUSION

54

55

PySpark and Numba for GPU clusters• Numba let’s you create compiled CPU and CUDA functions right inside

your Python applications.

• Numba can be used with Spark to easily distribute and run your code on Spark workers with GPUs

• There is room for improvement in how Spark interacts with the GPU, but things do work.

• Beware of accidentally multiplying fixed initialization and compilation costs.

GPU Computing with Apache Spark and Python

Data & Analytics

Transcript of GPU Computing with Apache Spark and Python