Dask - Going larger-than-memory and parallel with graphs

Post on 17-Jan-2017

1.791 views 3 download

Transcript of Dask - Going larger-than-memory and parallel with graphs

daskGoing larger-than-memory and parallel with graphs

Blake Griffith @cwlcks

Examples:

2

3

Graphs?

4

5

6

Directed Acyclic Graphs (DAGs)

7

8

edge

node

“Graph”

9

edge

node

“Directed”

10

edge

node

“Acyclic”

11

dask.arange(30, chunks=(10,)).sum()

12

('x_54',)

sum(...)

('x_53', 0) ('x_53', 2) ('x_53', 1)

sum

('arange­3', 0)

sum

('arange­3', 1)('arange­3', 2)

sum

arange arangearange

Collections create graphs

13

● dask.array● dask.dataframe● dask.bag● dask.imperative

14

dask.array● Scalar math: +, *, exp, log, … 

● Reductions: sum(axis=0), mean(), std(), …

● Slicing, indexing: x[:100, 500:100:-2]

● Load from hdf5, and others

15

dask.array limitations● NumPy API

● We always need to know the shape and dtype

● No argwhere(), nonzero(), etc.

16

dask.dataframe● Element and rowise operations

● Shuffle operations

● Ingest data from CSV's, pandas, numpy,

17

dask.dataframe limitations● Pandas API is huge.

● GIL

● Some things are hard to do in parallel, like sorting.

18

dask.bag● Parallelize funcs across generic python objs.

● filter, fold, distinct, groupby

19

dask.bag limitations● As slow as python

● Avoid groupby in favor of foldbys

● Multiprocessing scheduler

20

dask.imperative● Do, Value

● Supports most operators

● Slicing

● Attribute access

● Method calls

21

dask.imperative

22

dask.imperative limitations● Shared resources are bad

● code idempotency, impurities

● Iteration

● In-place operations, mutations (setitem, +=).

● Predicate use: if a do

23

Schedulers

24

Schedulers● Synchronous● Threaded● Multiprocessing● Distributed

25

Shared Memory● Threaded● Multiprocessing● synchronous

26

Distributed Memory● beta● Easy to set up with anaconda cluster● Not very smart

27

Dask distributed:● Workers● Scheduler● Clients

28

same network

scheduler

worker

worker worker

worker

client clientclient