Processing biggish data on commodity hardware: simple Python patterns

33
Processing biggish data on commodity hardware Simple Python patterns Ga¨ el Varoquaux INRIA/Parietal – Neurospin Disclaimer: I’m French, I have opinions We’re in Texas, I hope y’all have left your guns outside Yeah, I know, Texas is bigger than France

description

Scipy 2013 talk on simple Python patterns to process efficiently large datasets using Python. The talk focuses on the patterns and the concepts rather than the implementations. The implementations can be found by looking at the joblib and scikit-learn codebase

Transcript of Processing biggish data on commodity hardware: simple Python patterns

Page 1: Processing biggish data on commodity hardware: simple Python patterns

Processing biggish dataon commodity hardware

Simple Python patternsGael Varoquaux INRIA/Parietal – Neurospin

Disclaimer: I’m French, I have opinionsWe’re in Texas, I hope y’all have left your guns outside

Yeah, I know, Texas is bigger than France

Page 2: Processing biggish data on commodity hardware: simple Python patterns

“Big data”:Petabytes...Distributed storageComputing cluster

Mere mortals:Gigabytes...Python programmingOff-the-self computers

∼ 16 CPUs, 32 Gb RAM

G Varoquaux 2

Page 3: Processing biggish data on commodity hardware: simple Python patterns

My tools

Python, what else? + Numpy+ Scipy

The ndarray is underusedby the data community

G Varoquaux 3

Page 4: Processing biggish data on commodity hardware: simple Python patterns

My tools

Python, what else? Patterns in this presentation:scikit-learnMachine learning in PythonjoblibUsing Python functions aspipeline jobs

G Varoquaux 3

Page 5: Processing biggish data on commodity hardware: simple Python patterns

Design philosophy

1. Fail gracefullyEasy to debug. Robust to errors.

2. Don’t solve hard problemsThe original problem can be bent.

3. Dependencies suckDistribution is an age-old problem.

4. Performance mattersWaiting kills productivity.

G Varoquaux 4

Page 6: Processing biggish data on commodity hardware: simple Python patterns

Processing big dataSpeed ups in Hadoop, CPUs...

Execution pipelinesdataflow programmingparallel computing

Data accessstoringcaching

Pipelines can get messyDatabases are tedious

G Varoquaux 5

Page 7: Processing biggish data on commodity hardware: simple Python patterns

Processing big dataSpeed ups in Hadoop, CPUs...

Execution pipelinesdataflow programmingparallel computing

Data accessstoringcaching

Pipelines can get messyDatabases are tedious

G Varoquaux 5

Page 8: Processing biggish data on commodity hardware: simple Python patterns

5 simple Python patterns for efficient data crunching

1 On the fly data reduction

2 On-line algorithms

3 Parallel processing patterns

4 Caching

5 Fast I/O

G Varoquaux 6

Page 9: Processing biggish data on commodity hardware: simple Python patterns

Big how?2 scenarios:

Many observations –samplese.g. twitterMany descriptors per observation –featurese.g. brain scans

G Varoquaux 7

Page 10: Processing biggish data on commodity hardware: simple Python patterns

1 On the fly data reduction

Big data is often I/O bound

Layer memory accessCPU cachesRAMLocal disksDistant storage

Less data also means less work

G Varoquaux 8

Page 11: Processing biggish data on commodity hardware: simple Python patterns

1 On the fly data reductionBig data is often I/O bound

Layer memory accessCPU cachesRAMLocal disksDistant storage

Less data also means less work

G Varoquaux 8

Page 12: Processing biggish data on commodity hardware: simple Python patterns

1 Dropping dataNumber one technique used to handle large dataset

1 loop: take a random fraction of the data

2 run algorithm on that fraction

3 aggregate results across sub-samplingsLooks like bagging : bootstrap aggregation

Performance tip: run the loop in parallel

Exploits redundancy across observations

Great when the number of samples is largeG Varoquaux 9

Page 13: Processing biggish data on commodity hardware: simple Python patterns

1 Dimension reductionOften individual features are low SNR

Random projections (will average features)sklearn.random projection

random linear combinations of the features

Fast –sub-optimal– clustering of featuressklearn.cluster.WardAgglomeration

on images: super-pixel strategy

Hashing, when observations have varying size(e.g. words)

sklearn.feature extraction.text.HashingVectorizer

stateless: can be used in parallel

G Varoquaux 10

Page 14: Processing biggish data on commodity hardware: simple Python patterns

1 An example: randomized SVDsklearn.utils.extmath.randomized svd

One random projection + power iterationsX = np.random.normal(size=(50000, 200))%timeit lapack = linalg.svd(X, full matrices=False)

1 loops, best of 3: 6.09 s per loop%timeit arpack=splinalg.svds(X, 10)

1 loops, best of 3: 2.49 s per loop%timeit randomized = randomized svd(X, 10)

1 loops, best of 3: 303 ms per looplinalg.norm(lapack[0][:, :10] - arpack[0]) / 2000

0.0022360679774997738linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000

0.0022121161221386925

G Varoquaux 11

Page 15: Processing biggish data on commodity hardware: simple Python patterns

2 On-line algorithmsProcess the data one sample at a time

G Varoquaux 12

Page 16: Processing biggish data on commodity hardware: simple Python patterns

2 On-line algorithms

Compute the mean of a gazillionnumbers

Hard?

G Varoquaux 12

Page 17: Processing biggish data on commodity hardware: simple Python patterns

2 On-line algorithms

Compute the mean of a gazillionnumbers

Hard?No: just do a running mean

G Varoquaux 12

Page 18: Processing biggish data on commodity hardware: simple Python patterns

2 Convergence: statistics and speedIf the data are i.i.d., converges to expectations

Mini-batch = bunch observationsTrade-off between memory usage and vectorization

Example: K-Means clusteringX = np.random.normal(size=(10000, 200))

scipy.cluster.vq.kmeans(X, 10,

iter=2)11.33 s

sklearn.cluster.MiniBatchKMeans(n clusters=10,

n init=2).fit(X)0.62 s

G Varoquaux 13

Page 19: Processing biggish data on commodity hardware: simple Python patterns

3 Parallel processing patterns

Focus on embarassingly parallel for loopsLife is too short to worry about deadlocks

Workers compete for data accessMemory bus is a bottleneckOn grids: distributed storage

The right grain of parallelismToo fine ⇒ overheadToo coarse ⇒ memory shortage

Scale by the relevant cache pool

G Varoquaux 14

Page 20: Processing biggish data on commodity hardware: simple Python patterns

3 Parallel processing patternsFocus on embarassingly parallel for loopsLife is too short to worry about deadlocks

Workers compete for data accessMemory bus is a bottleneckOn grids: distributed storage

The right grain of parallelismToo fine ⇒ overheadToo coarse ⇒ memory shortage

Scale by the relevant cache pool

G Varoquaux 14

Page 21: Processing biggish data on commodity hardware: simple Python patterns

3 Parallel processing patternsFocus on embarassingly parallel for loopsLife is too short to worry about deadlocks

Workers compete for data accessMemory bus is a bottleneckOn grids: distributed storage

The right grain of parallelismToo fine ⇒ overheadToo coarse ⇒ memory shortage

Scale by the relevant cache pool

G Varoquaux 14

Page 22: Processing biggish data on commodity hardware: simple Python patterns

3 Parallel processing patternsFocus on embarassingly parallel for loopsLife is too short to worry about deadlocks

Workers compete for data accessMemory bus is a bottleneckOn grids: distributed storage

The right grain of parallelismToo fine ⇒ overheadToo coarse ⇒ memory shortage

Scale by the relevant cache poolG Varoquaux 14

Page 23: Processing biggish data on commodity hardware: simple Python patterns

3 Queues – the magic behind joblib.Parallel

Queues: high-performance, concurrent-friendly

Difficulty: callback on result arrival⇒ multiple threads in caller + risk of deadlocks

Dispatch queue should fill up “slowly”⇒ pre dispatch in joblib

⇒ Back and forth communicationDoor open to race conditions

G Varoquaux 15

Page 24: Processing biggish data on commodity hardware: simple Python patterns

3 What happens where: grand-central dispatch?

joblib design: Caller, dispatch queue, and collectqueue in same process

Benefit: robustness

Grand-central dispatch design: dispatch queue hasa process of its own

Benefit: resource managment in nested for loops

G Varoquaux 16

Page 25: Processing biggish data on commodity hardware: simple Python patterns

4 CachingFor reproducible science:avoid manually chained scripts (make-like usage)

For performance:avoiding re-computing is the crux of optimization

G Varoquaux 17

Page 26: Processing biggish data on commodity hardware: simple Python patterns

4 The joblib approachThe memoize pattern

mem = joblib.Memory(cachedir=’.’)g = mem.cache(f)b = g(a) # computes a using fc = g(a) # retrieves results from store

Challenges in the context of big dataa & b are big

Design goalsa & b arbitrary Python objectsNo dependencies

Drop-in, framework-less code for cachingG Varoquaux 18

Page 27: Processing biggish data on commodity hardware: simple Python patterns

4 Efficient input argument hashing – joblib.hash

Compute md5? of input arguments

Implementation1. Create an md5 hash object2. Subclass the standard-library pickler

= state machine that walks the object graph3. Walk the object graph:

- ndarrays: pass data pointer to md5 algorithm(“update” method)

- the rest: pickle4. Update the md5 with the pickle

? md5 is in the Python standard libraryG Varoquaux 19

Page 28: Processing biggish data on commodity hardware: simple Python patterns

4 Fast, disk-based, concurrent, store – joblib.dumpPersisting arbritrary objects

Once again sub-class the picklerUse .npy for large numpy arrays (np.save),pickle for the rest

⇒ Multiple files

Store concurrency issuesStrategy: atomic operations + try/except

Renaming a directory is atomicDirectory layout consistent with remove operations

Good performance, usable on shared disks (cluster)

G Varoquaux 20

Page 29: Processing biggish data on commodity hardware: simple Python patterns

5 Fast I/OFast read-outs, for out-of-core computing

G Varoquaux 21

Page 30: Processing biggish data on commodity hardware: simple Python patterns

5 Making I/O fastFast compression

CPU may be faster than disk accessChunk data for access patterns pytables

Standard library: zlib.compress with buffers(bypass gzip module to work online + in-memory)

Avoiding copieszlib.compress needs C-contiguous buffersStore raw buffer + meta-information (strides, class...)

- use reduce- rebuild: np.core.multiarray. reconstruct

not in pytables

G Varoquaux 22

Page 31: Processing biggish data on commodity hardware: simple Python patterns

5 Benchmarking to np.save and pytables

yax

issc

ale:

1is

np.s

ave

NeuroImaging data (MNI atlas)G Varoquaux 23

Page 32: Processing biggish data on commodity hardware: simple Python patterns

@GaelVaroquaux

Summing up

5 simple Python patterns for efficient data crunching

1 On the fly data reduction2 On-line algorithms3 Parallel processing patterns4 Caching5 Fast I/O

Page 33: Processing biggish data on commodity hardware: simple Python patterns

@GaelVaroquaux

Cost of complexity underestimatedKnow your problem

& solve it with simple primitives

Python modules

scikit-learn: machine learning

joblib: pipeline-ish patterns

Come work with me!Positions available