Scaling PyData Up and Out

66
Scaling PyData Up and Out Travis E. Oliphant, PhD @teoliphant Python Enthusiast since 1997 Making NumPy- and Pandas-style code faster and run in parallel. PyData London 2016

Transcript of Scaling PyData Up and Out

Scaling PyData Up and Out

Travis E. Oliphant, PhD @teoliphant Python Enthusiast since 1997

Making NumPy- and Pandas-style code faster and run in parallel.

PyData London 2016

@teoliphantScale Up vs Scale Out

2

Big Memory & Many Cores / GPU Box

Best of Both (e.g. GPU Cluster)

Many commodity nodes in a cluster

Sca

le U

p (B

igge

r Nod

es)

Scale Out (More Nodes)

@teoliphant

3

@jakevdp

"Python's Scientific Ecosystem"

@teoliphant

4

I spent the first 15 years of my “Python life” working to build the foundation of the Python Scientific Ecosystem as a practitioner/developer

PEP 357 PEP 3118

SciPy

@teoliphant

5

I plan to spend the next 15 years ensuring this ecosystem can scale both up and out as an entrepreneur, evangelist, and director of innovation. CONDA

NUMBA DASK DYND BLAZE DATA-FABRIC

@teoliphant

6

In 2012 we “took off” in this effort to form the foundation of scalable Python — Numba (scale-up) and Blaze (scale-out) were the major efforts.

In early 2013, I formally “retired” from NumPy maintenance to pursue “next-generation” NumPy.

Not so much a single new library, but a new emphasis on scale

@teoliphant

7

Rally a community“Apache” for Open Data Science

Community-led and directed Conference Series with sponsorship proceeds going directly to NumFocus

@teoliphant

8

Organize a company

Empower people to solve the world’s greatest challenges.

We help people discover, analyze, and collaborate by connecting their curiosity and experience with any data.

Purpose

Mission

@teoliphant

9

Start working on key technologyBlaze — blaze, odo, datashape, dynd, dask Bokeh — interactive visualization in web including Jupyter Numba — Array-oriented Python subset compiler

Conda — Cross-platform package manager with environments

Bootstrap and seed funding through 2014 VC funded in 2015

@teoliphant

10

is….

the Open Data Science Platform powered by Python… the fastest growing open data science language

• Accelerate Time-to-Value

• Connect Data, Analytics & Compute

• Empower Data Science Teams

Create a Platform!

Free and Open Source!

@teoliphant

11

Governance

Provenance

Security

OPERATIONS

Python

R

Spark | Scala

JavaScript

Java

Fortran

C/C++

DATA SCIENCE LANGUAGES

DATAFlat Files (CSV, XLS…)) SQL DB NoSQL Hadoop Streaming

HARDWAREWorkstation Server Cluster

APPLICATIONSInteractive Presentations, Notebooks, Reports & Apps

Solution Templates

Visual Data Exploration

Advanced SpreadsheetsAPIs

ANALYTICS

Data Exploration

Querying Data Prep

Data Connectors

Visual Programming

Notebooks

Analytics Development

Stats Data Mining

Deep Learning

Machine Learning

Simulation & Optimization

Geospatial

Text & NLP

Graph & Network

Image Analysis

Advanced Analytics

IDEsCICDPackage, Dependency, Environment Management

Web & Desktop App Dev

SOFTWARE DEVELOPMENT

HIGH PERFORMANCE

Distributed Computing

Parallelism &Multi-threading

Compiled Assets

GPUs & Multi-core

Compiler

Business Analyst

Data Scientist

Developer

Data Engineer

DevOps

DATA SCIENCE TEAM

Cloud On-premises

@teoliphantFree Anaconda now with MKL as default

12

•Intel MKL (Math Kernel Libraries) provide enhanced algorithms for basic math functions.

•Using MKL provides optimal performance for basic BLAS, LAPACK, FFT, and math functions.

•Anaconda since version 2.5 has MKL provided as the default in the free download of Anaconda (you can also distribute binaries linked against these MKL-enhanced tools).

@teoliphantCommunity maintained conda packages

13

Automatic building and testing for your recipes

Awesome!

conda install -c conda-forge

conda config --add channels conda-forge

@teoliphantPromises Made

14

•In 2012 we talked about the need for parallel Pandas and Parallel NumPy (both scale-up and scale-out).

•At the first PyData NYC in 2012 we described Blaze as an effort to deliver scale out and Numba as well as DyND as an effort to deliver scale-up.

•We introduced them early to rally a developer community.

•We can now rally a user-community around working code!

@teoliphant

15

Milestone success — 2016Numba is delivering on scaling up

• NumPy’s ufuncs and gufuncs on multiple CPU and GPU threads

• General GPU code • General CPU code

Dask is delivering on scaling out • dask.array (NumPy scaled out) • dask.dataframe (Pandas scaled out) • dask.bag (lists scaled out) • dask.delayed (general scale-out

construction)

@teoliphantNumba + DaskLook at all of the data with Bokeh’s datashader. Decouple the data-processing from the visualization. Visualize arbitrarily large data.

16

• E.g. Open Street Map data:

• About 3 billion GPS coordinates

• https://blog.openstreetmap.org/2012/04/01/bulk-gps-point-data/.

• This image was rendered in one minute on a standard MacBook with 16 GB RAM

• Renders in a milliseconds on several 128GB Amazon EC2 instances

@teoliphantCategorical data: 2010 US Census

17

• One point per person

• 300 million total • Categorized by

race • Interactive

rendering with Numba+Dask

• No pre-tiling

@teoliphant

YARN

JVM

Bottom Line 10-100X faster performance • Interact with data in HDFS and

Amazon S3 natively from Python • Distributed computations without the

JVM & Python/Java serialization • Framework for easy, flexible

parallelism using directed acyclic graphs (DAGs)

• Interactive, distributed computing with in-memory persistence/caching

Bottom Line • Leverage Python &

R with Spark

Batch Processing Interactive

Processing

HDFS

Ibis

Impala

PySpark & SparkRPython & R ecosystem MPI

High Performance, Interactive,

Batch Processing

Native read & write

NumPy, Pandas, … 720+ packages

18

What happened to Blaze?

Still going strong — just at another place in the stack (and sometimes a description of an ecosystem).

@teoliphant

20

Expressions

Metadata Ru

ntime

@teoliphant

21

+ - / * ^ []join, groupby, filter

map, sort, takewhere, topk

datashape, dtype,

shape, stride

hdf5, json, csv, xls

protobuf, avro, ...

Num

Py, P

anda

s, R,

Julia

, K, S

QL, S

park

,

Mon

go, C

assa

ndra

, ...

@teoliphantAPIs, syntax, language

22

Data Runtime

Expressions

metadata

storage/containers

compute

datashape

blaze

daskodo

parallelize optimize, JIT

numba

@teoliphantBlaze

23

Interface to query data on different storage systems http://blaze.pydata.org/en/latest/

from blaze import Data

iris = Data('iris.csv')

iris = Data('sqlite:///flowers.db::iris')

iris = Data('mongodb://localhost/mydb::iris')

iris = Data('iris.json')

CSV

SQL

MongoDB

JSON

iris = Data('s3://blaze-data/iris.csv')S3…

Current focus is SQL and the pydata stack for run-time (dask, dynd, numpy, pandas, x-ray, etc.) + customer-driven needs (i.e. kdb, mongo, PostgreSQL).

@teoliphantBlaze

24

iris[['sepal_length', 'species']]Select columns

log(iris.sepal_length * 10)Operate

Reduce iris.sepal_length.mean()

Split-apply -combine

by(iris.species, shortest=iris.petal_length.min(), longest=iris.petal_length.max(), average=iris.petal_length.mean())

Add new columns

transform(iris, sepal_ratio = iris.sepal_length / iris.sepal_width, petal_ratio = iris.petal_length / iris.petal_width)

Text matching iris.like(species='*versicolor')

iris.relabel(petal_length='PETAL-LENGTH', petal_width='PETAL-WIDTH') Relabel columns

Filter iris[(iris.species == 'Iris-setosa') & (iris.sepal_length > 5.0)]

@teoliphant

25

datashapeblazeBlaze (like DyND) uses datashape as its type system

>>> iris = Data('iris.json') >>> iris.dshape

dshape("""var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }""")

@teoliphantDatashape

26

A structured data description language for all kinds of datahttp://datashape.pydata.org/

dimension dtype

unit types

var

3 string

int32

4 float64

*

*

*

*var * { x : int32, y : string, z : float64 }

datashape

tabular datashape

record

ordered struct dtype

{ x : int32, y : string, z : float64 }

collection of types keyed by labels

4

@teoliphantDatashape

27

{ flowersdb: { iris: var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } }, iriscsv: var * { sepal_length: ?float64, sepal_width: ?float64, petal_length: ?float64, petal_width: ?float64, species: ?string }, irisjson: var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }, irismongo: 150 * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } }

# Arrays of Structures 100 * { name: string, birthday: date, address: { street: string, city: string, postalcode: string, country: string } }

# Structure of Arrays { x: 100 * 100 * float32, y: 100 * 100 * float32, u: 100 * 100 * float32, v: 100 * 100 * float32, }

# Function prototype (3 * int32, float64) -> 3 * float64

# Function prototype with broadcasting dimensions (A... * int32, A... * int32) -> A... * int32

# Arrays 3 * 4 * int32 3 * 4 * int32 10 * var * float64 3 * complex[float64]

@teoliphant

iriscsv: source: iris.csv

irisdb: source: sqlite:///flowers.db::iris

irisjson: source: iris.json dshape: "var * {name: string, amount: float64}"

irismongo: source: mongodb://localhost/mydb::iris

Blaze Server — Lights up your Dark Data

28

Builds off of Blaze uniform interface to host data remotely through a JSON web API.

$ blaze-server server.yaml -e

localhost:6363/compute.json

server.yaml

@teoliphantBlaze Client

29

>>> from blaze import Data

>>> s = Data('blaze://localhost:6363') >>> t.fields

[u'iriscsv', u'irisdb', u'irisjson', u’irismongo']

>>> t.iriscsv

sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa

>>> t.irisdb

petal_length petal_width sepal_length sepal_width species 0 1.4 0.2 5.1 3.5 Iris-setosa 1 1.4 0.2 4.9 3.0 Iris-setosa 2 1.3 0.2 4.7 3.2 Iris-setosa

@teoliphantCompute recipes work with existing libraries and have multiple backends — write once and run anywhere. Layer over existing data!

• Python list

• Numpy arrays

• Dynd

• Pandas DataFrame

• Spark, Impala, Ibis

• Mongo

• Dask

30

• Hint: Use odo to copy from one format and/or engine to another!

@teoliphant

Data Fabric (pre-alpha, demo-ware)• New initiative just starting (if you can fund it, please tell me).

• Generalization of Python buffer-protocol to all languages

• A library to build a catalog of shared-memory chunks across the cluster holding data that can be described by data-shape

• A helper type library in DyND that any language can use to select appropriate analytics for

• Foundation for a cross-language “super-set” of PyData stack

31https://github.com/blaze/datafabric

Brief Overview of Numba

Go to Graham Markall’s Talk at 2:15 in LG6 to hear more.

@teoliphantClassic Example

33

from numba import jit

@jit def mandel(x, y, max_iters): c = complex(x,y) z = 0j for i in range(max_iters): z = z*z + c if z.real * z.real + z.imag * z.imag >= 4: return 255 * i // max_iters

return 255

Mandelbrot

@teoliphantThe Basics

34

CPython 1x

Numpy array-wide operations 13x

Numba (CPU) 120x

Numba (NVidia Tesla K20c) 2100x

Mandelbrot

@teoliphantHow Numba works

35

Bytecode Analysis

Python Function Function

Arguments

Type Inference

Numba IR

LLVM IRMachine Code

@jitdef do_math(a,b): …>>> do_math(x, y)

Cache

Execute!

Rewrite IR

Lowering

LLVM JIT

@teoliphantNumba Features

• Numba supports: – Windows, OS X, and Linux – 32 and 64-bit x86 CPUs and NVIDIA GPUs – Python 2 and 3 – NumPy versions 1.6 through 1.9

• Does not require a C/C++ compiler on the user’s system. • < 70 MB to install. • Does not replace the standard Python interpreter

(all of your existing Python libraries are still available)

36

@teoliphantHow to Use Numba

1. Create a realistic benchmark test case.(Do not use your unit tests as a benchmark!)

2. Run a profiler on your benchmark.(cProfile is a good choice)

3. Identify hotspots that could potentially be compiled by Numba with a little refactoring.(see online documentation)

4. Apply @numba.jit and @numba.vectorize as needed to critical functions. (Small rewrites may be needed to work around Numba limitations.)

5. Re-run benchmark to check if there was a performance improvement.

37

@teoliphant

Example: Filter an array

38

@teoliphant

Example: Filter an array

39

Array Allocation

Looping over ndarray x as an iteratorUsing numpy math functions

Returning a slice of the array

Numba decorator(nopython=True not required)

2.7x Speedup over NumPy!

@teoliphantNumPy UFuncs and GUFuncs

40

NumPy ufuncs (and gufuncs) are functions that operate “element-wise” (or “sub-dimension-wise”) across an array without an explicit loop.

This implicit loop (which is in machine code) is at the core of why NumPy is fast. Dispatch is done internally to a particular code-segment based on the type of the array. It is a very powerful abstraction in the PyData stack.

Making new fast ufuncs used to be only possible in C — painful!

With nb.vectorize and numba.guvectorize it is now easy!

The inner secrets of NumPy are now at your finger-tips for you to make your own magic!

@teoliphantExample: Making a windowed compute filter

41

Perform a computation on a finite window of the input.

For a linear system, this is a FIR filter and what np.convolve or sp.signal.lfilter can do.

But what if you want some arbitrary computation.

@teoliphantExample: Making a windowed compute filter

42

Hand-coded implementation

Build a ufunc for the kernel which is faster for large arrays!

This can now run easily on GPU with ‘target=gpu’ and many-cores ‘target=parallel’

Array-oriented!

@teoliphantExample: Making a windowed compute filter

43

@teoliphantOther interesting things

• CUDA Simulator to debug your code in Python interpreter • Generalized ufuncs (@guvectorize) including GPU support and multi-

core (threaded) support • Call ctypes and cffi functions directly and pass them as arguments • Support for types that understand the buffer protocol • Pickle Numba functions to run on remote execution engines • “numba annotate” to dump HTML annotated version of compiled

code • See: http://numba.pydata.org/numba-doc/0.23.0/

44

@teoliphantRecently Added Numba Features

• Support for named tuples in nopython mode • Support for sets in nopython mode • Limited support for lists in nopython mode • On-disk caching of compiled functions (opt-in) • JIT classes (zero-cost abstraction) • Support of np.dot (and ‘@‘ operator on Python 3.5) • Support for some of np.linalg • generated_jit (jit the functions that are the return values of the decorated

function — meta-programming) • SmartArrays which can exist on host and GPU (transparent data access). • Ahead of Time Compilation • Disk-caching of pre-compiled code

45

Overview of Dask as a Parallel Processing Framework with Distributed

@teoliphantPrecursors to Parallelism

47

• Consider the following approaches first:

1. Use better algorithms

2. Try Numba or C/Cython

3. Store data in efficient formats

4. Subsample your data

• If you have to parallelize:

1. Start with your laptop (4 cores, 16 GB RAM, 1 TB disk)

2. Then a large workstation (24 cores, 1 TB RAM)

3. Finally, scale out to a cluster

@teoliphantOverview of Dask

48

Dask is a Python parallel computing library that is:

• Familiar: Implements parallel NumPy and Pandas objects

• Fast: Optimized for demanding for numerical applications

• Flexible: for sophisticated and messy algorithms

• Scales out: Runs resiliently on clusters of 100s of machines

• Scales down: Pragmatic in a single process on a laptop

• Interactive: Responsive and fast for interactive data science

Dask complements the rest of Anaconda. It was developed withNumPy, Pandas, and scikit-learn developers.

@teoliphantSpectrum of Parallelization

49

Threads Processes

MPI ZeroMQ

DaskHadoop Spark

SQL: Hive Pig

Impala

Implicit control: Restrictive but easyExplicit control: Fast but hard

@teoliphant

Dask: From User Interaction to Execution

50

@teoliphant

Dask Collections: Familiar Expressions and API

51

x.T - x.mean(axis=0)

df.groupby(df.index).value.mean()def load(filename): def clean(data): def analyze(result):

Dask array (mimics NumPy)

Dask dataframe (mimics Pandas) Dask imperative (wraps custom code)

b.map(json.loads).foldby(...)

Dask bag (collection of data)

@teoliphant

Dask Graphs: Example Machine Learning Pipeline

52

@teoliphant

Dask Graphs: Example Machine Learning Pipeline + Grid Search

53

@teoliphant

• simple: easy to use API

• flexible: perform a lots of action with a minimal amount of code

• fast: dispatching to run-time engines & cython

• database like: familiar ops

• interop: integration with the PyData Stack

54

(((A + 1) * 2) ** 3)

@teoliphant

55

(B - B.mean(axis=0)) + (B.T / B.std())

@teoliphant

Scheduler

Worker

Worker

Worker

Worker

Client

Same network

User Machine (laptop)Client

Worker

Dask Schedulers: Example - Distributed Scheduler

56

@teoliphantCluster Architecture Diagram

57

Client Machine Compute Node

Compute Node

Compute Node

Head Node

@teoliphant

• Single machine with multiple threads or processes

• On a cluster with SSH (dcluster)

• Resource management: YARN (knit), SGE, Slurm

• On the cloud with Amazon EC2 (dec2)

• On a cluster with Anaconda for cluster management

• Manage multiple conda environments and packages on bare-metal or cloud-based clusters

Using Anaconda and Dask on your Cluster

58

@teoliphantScheduler Visualization with Bokeh

59

@teoliphantExamples

60

Analyzing NYC Taxi CSV data using distributed Dask DataFrames

• Demonstrate

Pandas at scale

• Observe responsive

user interface

Distributed language processing with text data using Dask Bags

• Explore data using

a distributed

memory cluster

• Interactively query

data using libraries from Anaconda

Analyzing global temperature data using Dask Arrays

• Visualize complex

algorithms

• Learn about dask

collections and

tasks

Handle custom code and workflows using Dask Imperative

• Deal with messy

situations

• Learn about

scheduling

1 2 3 4

@teoliphant

Example 1: Using Dask DataFrames on a cluster with CSV data

61

• Built from Pandas DataFrames

• Match Pandas interface

• Access data from HDFS, S3, local, etc.

• Fast, low latency

• Responsive user interface

January, 2016

Febrary, 2016

March, 2016

April, 2016

May, 2016

Pandas DataFrame}

Dask DataFrame}

@teoliphant

Example 2: Using Dask Bags on a cluster with text data

62

• Distributed natural language processing

with text data stored in HDFS

• Handles standard computations

• Looks like other parallel frameworks

(Spark, Hive, etc.)

• Access data from HDFS, S3, local, etc.

• Handles the common case...

(...)

data

...

(...)

data

function

...

...

(...)

data

function

...

result

merge

... ...

data

function

(...)

...

function

@teoliphant

NumPy Array} } Dask

Array

Example 3: Using Dask Arrays with global temperature data

63

• Built from NumPyn-dimensional arrays

• Matches NumPy interface

(subset)

• Solve medium-large

problems

• Complex algorithms

@teoliphant

Example 4: Using Dask Delayed to handle custom workflows

64

• Manually handle functions to support messy situations

• Life saver when collections aren't flexible enough

• Combine futures with collections for best of both worlds

• Scheduler provides resilient and elastic execution

@teoliphantScale Up vs Scale Out

65

Big Memory & Many Cores / GPU Box

Best of Both (e.g. GPU Cluster)

Many commodity nodes in a cluster

Sca

le U

p (B

igge

r Nod

es)

Scale Out (More Nodes)

Numba

Dask

Blaze

@teoliphantContribute to the next stage of the journey

66

Numba, Blaze and the Blaze ecosystem (dask, odo, dynd, datashape, data-fabric, and beyond…)