GPU Computing with Apache Spark and Python

55
© 2015 Continuum Analytics- Confidential & Proprietary GPU Computing with Apache Spark and Python Stan Seibert Siu Kwan Lam April 5, 2016

Transcript of GPU Computing with Apache Spark and Python

Page 1: GPU Computing with Apache Spark and Python

© 2015 Continuum Analytics- Confidential & Proprietary

GPU Computing with Apache Sparkand Python

Stan Seibert Siu Kwan Lam

April 5, 2016

Page 2: GPU Computing with Apache Spark and Python

My Background

• Trained in particle physics

• Using Python for data analysis for 10 years

• Using GPUs for data analysis for 7 years

• Currently lead the High Performance Python team at Continuum

2

Page 3: GPU Computing with Apache Spark and Python

About Continuum Analytics

• We give superpowers to people who change the world!

• We want to help everyone analyze their data with Python (and other tools), so we offer:

• Enterprise Products • Consulting • Training • Open Source

3

Page 4: GPU Computing with Apache Spark and Python

• I’m going to use Anaconda throughout this presentation.

• Anaconda is a free Mac/Win/Linux Python distribution: • Based on conda, an open source package manager • Installs both Python and non-Python dependencies • Easiest way to get the software I will talk about today

• https://www.continuum.io/downloads

4

Page 5: GPU Computing with Apache Spark and Python

Overview1. Why Python?

2. Numba: A Python JIT compiler for the CPU and GPU

3. PySpark: Distributed computing for Python

4. Example: Image Registration

5. Tips and Tricks

6. Conclusion

Page 6: GPU Computing with Apache Spark and Python

WHY PYTHON?

6

Page 7: GPU Computing with Apache Spark and Python

Why is Python so popular?• Straightforward, productive language for system administrators,

programmers, scientists, analysts and hobbyists

• Great community:

• Lots of tutorial and reference materials

• Easy to interface with other languages

• Vast ecosystem of useful libraries

7

Page 8: GPU Computing with Apache Spark and Python

8

Page 9: GPU Computing with Apache Spark and Python

9

• Pure, interpreted Python is slow.

• Python excels at interfacing with other languages used in HPC:

C: ctypes, CFFI, Cython

C++: Cython, Boost.Python

FORTRAN: f2py

• Secret: Most scientific Python packages put the speed critical sections of their algorithms in a compiled language.

… But, isn’t Python slow?

Page 10: GPU Computing with Apache Spark and Python

Is there another way?

10

• Switching languages for speed in your projects can be a little clunky:

• Sometimes tedious boilerplate for translating data types across the language barrier

• Generating compiled functions for the wide range of data types can be difficult

• How can we use cutting edge hardware, like GPUs?

Page 11: GPU Computing with Apache Spark and Python

NUMBA: A PYTHON JIT COMPILER

11

Page 12: GPU Computing with Apache Spark and Python

Compiling Python

12

• Numba is an open-source, type-specializing compiler for Python functions

• Can translate Python syntax into machine code if all type information can be deduced when the function is called.

• Implemented as a module. Does not replace the Python interpreter!

• Code generation done with:

• LLVM (for CPU)

• NVVM (for CUDA GPUs).

Page 13: GPU Computing with Apache Spark and Python

Supported Platforms

13

OS HW SW

• Windows (7 and later) • 32 and 64-bit x86 CPUs • Python 2 and 3

• OS X (10.9 and later) • CUDA-capable NVIDIA GPUs • NumPy 1.7 through 1.10

• Linux (~RHEL 5 and later) • HSA-capable AMD GPUs

• Experimental support for ARMv7 (Raspberry Pi 2)

Page 14: GPU Computing with Apache Spark and Python

How Does Numba Work?

14

Python Function (bytecode)

Bytecode Analysis

Functions Arguments

Numba IR

Machine CodeExecute!

Type Inference

LLVM/NVVM JIT LLVM IR

Lowering

Rewrite IR

Cache

@jit def do_math(a, b): … >>> do_math(x, y)

Page 15: GPU Computing with Apache Spark and Python

15

Numba on the CPU

Page 16: GPU Computing with Apache Spark and Python

16

Array Allocation

Looping over ndarray x as an iterator

Using numpy math functions

Returning a slice of the array

2.7x speedup!

Numba decorator (nopython=True not required)

Numba on the CPU

Page 17: GPU Computing with Apache Spark and Python

17

CUDA Kernels in Python

Page 18: GPU Computing with Apache Spark and Python

18

CUDA Kernels in PythonDecoratorwillinfertypesignaturewhenyoucallit

NumPyarrayshaveexpectedattributesandindexing

HelperfunctiontocomputeblockIdx.x * blockDim.x + threadIdx.x

HelperfunctiontocomputeblockDim.x * gridDim.x

Page 19: GPU Computing with Apache Spark and Python

19

Calling the Kernel from Python

WorksjustlikeCUDAC,exceptNumbahandlesallocatingandcopyingdatato/fromthehostifneeded

Page 20: GPU Computing with Apache Spark and Python

20

Handling Device Memory Directly

Memoryallocationmattersinsmalltasks.

Page 21: GPU Computing with Apache Spark and Python

21

Higher Level Tools: GPU Ufuncs

Page 22: GPU Computing with Apache Spark and Python

22

Higher Level Tools: GPU UfuncsDecoratorforcreatingufunc

Listofsupportedtypesignatures

Codegenerationtarget

Page 23: GPU Computing with Apache Spark and Python

23

GPU Ufuncs Performance

4xspeedupincl.host<->deviceroundtriponGeForceGT650M

Page 24: GPU Computing with Apache Spark and Python

24

Accelerate Library Bindings: cuFFT

>2xspeedupincl.host<->deviceroundtriponGeForceGT650M

MKLacceleratedFFT

Page 25: GPU Computing with Apache Spark and Python

PYSPARK: DISTRIBUTED COMPUTING FOR PYTHON

25

Page 26: GPU Computing with Apache Spark and Python

26

What is Apache Spark?• An API and an execution engine for distributed computing on a cluster

• Based on the concept of Resilient Distributed Datasets (RDDs)

• Dataset: Collection of independent elements (files, objects, etc) in memory from previous calculations, or originating from some data store

• Distributed: Elements in RDDs are grouped into partitions and may be stored on different nodes

• Resilient: RDDs remember how they were created, so if a node goes down, Spark can recompute the lost elements on another node

Page 27: GPU Computing with Apache Spark and Python

27

Computation DAGs

Figfrom:https://databricks.com/blog/2015/06/22/understanding-your-spark-application-through-visualization.html

Page 28: GPU Computing with Apache Spark and Python

28

How does Spark Scale?• All cluster scaling is about minimizing I/O.

Spark does this in several ways:

• Keep intermediate results in memory with rdd.cache()

• Move computation to the data whenever possible (functions are small and data is big!)

• Provide computation primitives that expose parallelism and minimize communication between workers:map, filter, sample, reduce, …

Page 29: GPU Computing with Apache Spark and Python

29

Python and Spark

• Spark is implemented in Java & Scala on the JVM

• Full API support for Scala, Java, and Python(+ limited support for R)

• How does Python work, since it doesn’t run on the JVM?(not counting IronPython)

Page 30: GPU Computing with Apache Spark and Python

30

SparkContext

SparkContext

SparkWorker

PythonInterpreter

SparkWorker

PythonInterpreter

Python Java

Java

Java

Python

Python

Client Cluster

Py4J

Py4J

Py4J

Page 31: GPU Computing with Apache Spark and Python

31

Getting Started

• conda can create a local environment for Spark for you:

conda create -n spark -c anaconda-cluster python=3.5 spark numba \ cudatoolkit ipython-notebook

source activate spark #Uncomment below if on Mac OS X #export JRE_HOME=$(/usr/libexec/java_home) #export JAVA_HOME=$(/usr/libexec/java_home)

IPYTHON_OPTS="notebook" pyspark # starts jupyter notebook

Page 32: GPU Computing with Apache Spark and Python

32

Using Numba (CPU) with Spark

Page 33: GPU Computing with Apache Spark and Python

33

Using Numba (CPU) with Spark

LoadarraysontotheSparkWorkers

ApplymyfunctiontoeveryelementintheRDDandreturnfirstelement

Page 34: GPU Computing with Apache Spark and Python

34

Using CUDA Python with SparkDefineCUDAkernelCompilationhappenshere

WrapCUDAkernellaunchinglogic

CreatesSparkRDD(8partitions)

Applygpu_workoneachpartition

Page 35: GPU Computing with Apache Spark and Python

35

Client Cluster

CUDAPythonKernel

LLVMIR

PTX

LLVMIR PTX

PTX

CUDABinary

Compiles Serialize FinalizeDeserialize

useserializedPTXifCUDAarchmatchesclient

recompileifCUDAarchisdifferent

Page 36: GPU Computing with Apache Spark and Python

36

Client Cluster

CUDAPythonKernel

LLVMIR

PTX

LLVMIR PTX

PTX

CUDABinary

Compiles Serialize FinalizeDeserialize

Thishappensoneveryworkerprocess

Page 37: GPU Computing with Apache Spark and Python

EXAMPLE: IMAGE REGISTRATION

37

Page 38: GPU Computing with Apache Spark and Python

38

Basic Algorithm

Groupsimilarimages(unsupervisedkNNclustering)

attemptimageregistrationoneverypairineachgroup

(phasecorrelationbased;FFTheavy)

imageset

unusedimages newimages

Progress?

Page 39: GPU Computing with Apache Spark and Python

39

Basic Algorithm

• FFT2Dheavy• Expensive• MostlyrunningonGPU

• Reducenumberofpairwiseimageregistrationattempt

• RunonCPUGroupsimilarimages

(unsupervisedkNNclustering)

attemptimageregistrationoneverypairineachgroup

(phasecorrelationbased;FFTheavy)

imageset

unusedimages newimages

Progress?

Page 40: GPU Computing with Apache Spark and Python

40

Cross Power Spectrum Core of phase correlation based image registration algorithm

Page 41: GPU Computing with Apache Spark and Python

41

Cross Power Spectrum Core of phase correlation based image registration algorithm

cuFFT

explicitmemorytransfer

Page 42: GPU Computing with Apache Spark and Python

partition

i

u n

partition

i

u n

partition

i

u n …

42

Randompartitioningrdd.repartition(num_parts)

Imageset

ApplyImgRegoneachPartitionrdd.mapPartitions(imgreg_func)

Progress?

Scaling on Spark

• MachineshavemultipleGPUs• Eachworkercomputesapartitionatatime

Page 43: GPU Computing with Apache Spark and Python

43

CUDA Multi-Process Service (MPS)

• Sharing one GPU between multiple workers can be beneficial

• nvidia-cuda-mps-control

• Better GPU utilization from multiple processes

• For our example app: 5-10% improvement with MPS

• Effect can be bigger for app with higher GPU utilization

Page 44: GPU Computing with Apache Spark and Python

44

Scaling with Multiple GPUs and MPS• 1 CPU worker: 1 Tesla K20

• 2 CPU worker: 2 Tesla K20

• 4 CPU worker: 2 Tesla K20(2 workers per GPU using MPS)

• Spark + Dask = External process performing GPU calculation

Page 45: GPU Computing with Apache Spark and Python

partition

i

u n

partition

i

u n

partition

i

u n …

45

Randompartitioningrdd.repartition(num_parts

Imageset

ApplyImgRegoneachPartitionrdd.mapPartitions(imgreg_func)

Progress?

Spark vs CUDA

• SparklogicissimilartoCUDAhostlogic• mapPartitionsislikekernellaunch• SparknetworktransferoverheadvsPCI-expresstransferoverhead

• PartitionsarelikeCUDAblocks• Workcooperativelywithinapartition

Page 46: GPU Computing with Apache Spark and Python

TIPS AND TRICKS

46

Page 47: GPU Computing with Apache Spark and Python

47

Parallelism is about communication

• Be aware of the communication overhead when deciding how to chunk your work in Spark.

• Bigger chunks: Risk of underutilization of resources

• Smaller chunks: Risk of computation swamped by overhead

➡ Start with big chunks, then move to smaller ones if you fail to reach full utilization.

Page 48: GPU Computing with Apache Spark and Python

48

Start Small

• Easier to debug your logic on your local system!

• Run a Spark cluster on your workstation where it is easier to debug and profile.

• Move to the cluster only after you’ve seen things work locally. (Spark makes this fairly painless.)

Page 49: GPU Computing with Apache Spark and Python

49

Be interactive with Big Data!

• Run Jupyter notebook on Spark + MultiGPU cluster

• Live experimentation helps you develop and understand your progress

• Spark keeps track of intermediate data and recomputes them if they are lost

Page 50: GPU Computing with Apache Spark and Python

50

Amortize Setup Costs

• Make sure that one-time costs are actually done once:

• GPU state (like FFT plans) can be slow to initialize. Do this at the top of your mapPartition call and reuse for each element in the RDD partition.

• Coarser chunking of tasks allows GPU memory allocations to be reused between steps inside a single task.

Page 51: GPU Computing with Apache Spark and Python

51

Be(a)ware of Global State• Beware that PySpark may (quite often!) spawn new processes

• New processes may have the same initial random seed

• the python random module must be seeded properly in each newly spawned process

• Use memoize

• example: use global dictionary to track shared states and remember performed initializations

Page 52: GPU Computing with Apache Spark and Python

52

Spark and Multi-GPU• Spark is not aware of GPU resources

• Efficient usage of GPU resources require some workaroundse.g.

• option 1: random GPU assignmentnumba.cuda.select_device(random.randrange(num_of_gpus))

• option 2: use mapPartitionWithIndexselected_gpu = partition_index % num_of_gpus

• option 3: manage GPUs externally; like CaffeOnSpark

Page 53: GPU Computing with Apache Spark and Python

© 2015 Continuum Analytics- Confidential & Proprietary 53

GPUassignmentbypartitionindex

DelegateGPUworktoexternalGPU-awareprocess>2xspeedup

Page 54: GPU Computing with Apache Spark and Python

CONCLUSION

54

Page 55: GPU Computing with Apache Spark and Python

55

PySpark and Numba for GPU clusters• Numba let’s you create compiled CPU and CUDA functions right inside

your Python applications.

• Numba can be used with Spark to easily distribute and run your code on Spark workers with GPUs

• There is room for improvement in how Spark interacts with the GPU, but things do work.

• Beware of accidentally multiplying fixed initialization and compilation costs.