Processing biggish data on commodity hardware: simple Python patterns
-
Upload
gaelvaroquaux -
Category
Technology
-
view
3.538 -
download
3
description
Transcript of Processing biggish data on commodity hardware: simple Python patterns
![Page 1: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/1.jpg)
Processing biggish dataon commodity hardware
Simple Python patternsGael Varoquaux INRIA/Parietal – Neurospin
Disclaimer: I’m French, I have opinionsWe’re in Texas, I hope y’all have left your guns outside
Yeah, I know, Texas is bigger than France
![Page 2: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/2.jpg)
“Big data”:Petabytes...Distributed storageComputing cluster
Mere mortals:Gigabytes...Python programmingOff-the-self computers
∼ 16 CPUs, 32 Gb RAM
G Varoquaux 2
![Page 3: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/3.jpg)
My tools
Python, what else? + Numpy+ Scipy
The ndarray is underusedby the data community
G Varoquaux 3
![Page 4: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/4.jpg)
My tools
Python, what else? Patterns in this presentation:scikit-learnMachine learning in PythonjoblibUsing Python functions aspipeline jobs
G Varoquaux 3
![Page 5: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/5.jpg)
Design philosophy
1. Fail gracefullyEasy to debug. Robust to errors.
2. Don’t solve hard problemsThe original problem can be bent.
3. Dependencies suckDistribution is an age-old problem.
4. Performance mattersWaiting kills productivity.
G Varoquaux 4
![Page 6: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/6.jpg)
Processing big dataSpeed ups in Hadoop, CPUs...
Execution pipelinesdataflow programmingparallel computing
Data accessstoringcaching
Pipelines can get messyDatabases are tedious
G Varoquaux 5
![Page 7: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/7.jpg)
Processing big dataSpeed ups in Hadoop, CPUs...
Execution pipelinesdataflow programmingparallel computing
Data accessstoringcaching
Pipelines can get messyDatabases are tedious
G Varoquaux 5
![Page 8: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/8.jpg)
5 simple Python patterns for efficient data crunching
1 On the fly data reduction
2 On-line algorithms
3 Parallel processing patterns
4 Caching
5 Fast I/O
G Varoquaux 6
![Page 9: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/9.jpg)
Big how?2 scenarios:
Many observations –samplese.g. twitterMany descriptors per observation –featurese.g. brain scans
G Varoquaux 7
![Page 10: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/10.jpg)
1 On the fly data reduction
Big data is often I/O bound
Layer memory accessCPU cachesRAMLocal disksDistant storage
Less data also means less work
G Varoquaux 8
![Page 11: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/11.jpg)
1 On the fly data reductionBig data is often I/O bound
Layer memory accessCPU cachesRAMLocal disksDistant storage
Less data also means less work
G Varoquaux 8
![Page 12: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/12.jpg)
1 Dropping dataNumber one technique used to handle large dataset
1 loop: take a random fraction of the data
2 run algorithm on that fraction
3 aggregate results across sub-samplingsLooks like bagging : bootstrap aggregation
Performance tip: run the loop in parallel
Exploits redundancy across observations
Great when the number of samples is largeG Varoquaux 9
![Page 13: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/13.jpg)
1 Dimension reductionOften individual features are low SNR
Random projections (will average features)sklearn.random projection
random linear combinations of the features
Fast –sub-optimal– clustering of featuressklearn.cluster.WardAgglomeration
on images: super-pixel strategy
Hashing, when observations have varying size(e.g. words)
sklearn.feature extraction.text.HashingVectorizer
stateless: can be used in parallel
G Varoquaux 10
![Page 14: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/14.jpg)
1 An example: randomized SVDsklearn.utils.extmath.randomized svd
One random projection + power iterationsX = np.random.normal(size=(50000, 200))%timeit lapack = linalg.svd(X, full matrices=False)
1 loops, best of 3: 6.09 s per loop%timeit arpack=splinalg.svds(X, 10)
1 loops, best of 3: 2.49 s per loop%timeit randomized = randomized svd(X, 10)
1 loops, best of 3: 303 ms per looplinalg.norm(lapack[0][:, :10] - arpack[0]) / 2000
0.0022360679774997738linalg.norm(lapack[0][:, :10] - randomized[0]) / 2000
0.0022121161221386925
G Varoquaux 11
![Page 15: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/15.jpg)
2 On-line algorithmsProcess the data one sample at a time
G Varoquaux 12
![Page 16: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/16.jpg)
2 On-line algorithms
Compute the mean of a gazillionnumbers
Hard?
G Varoquaux 12
![Page 17: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/17.jpg)
2 On-line algorithms
Compute the mean of a gazillionnumbers
Hard?No: just do a running mean
G Varoquaux 12
![Page 18: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/18.jpg)
2 Convergence: statistics and speedIf the data are i.i.d., converges to expectations
Mini-batch = bunch observationsTrade-off between memory usage and vectorization
Example: K-Means clusteringX = np.random.normal(size=(10000, 200))
scipy.cluster.vq.kmeans(X, 10,
iter=2)11.33 s
sklearn.cluster.MiniBatchKMeans(n clusters=10,
n init=2).fit(X)0.62 s
G Varoquaux 13
![Page 19: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/19.jpg)
3 Parallel processing patterns
Focus on embarassingly parallel for loopsLife is too short to worry about deadlocks
Workers compete for data accessMemory bus is a bottleneckOn grids: distributed storage
The right grain of parallelismToo fine ⇒ overheadToo coarse ⇒ memory shortage
Scale by the relevant cache pool
G Varoquaux 14
![Page 20: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/20.jpg)
3 Parallel processing patternsFocus on embarassingly parallel for loopsLife is too short to worry about deadlocks
Workers compete for data accessMemory bus is a bottleneckOn grids: distributed storage
The right grain of parallelismToo fine ⇒ overheadToo coarse ⇒ memory shortage
Scale by the relevant cache pool
G Varoquaux 14
![Page 21: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/21.jpg)
3 Parallel processing patternsFocus on embarassingly parallel for loopsLife is too short to worry about deadlocks
Workers compete for data accessMemory bus is a bottleneckOn grids: distributed storage
The right grain of parallelismToo fine ⇒ overheadToo coarse ⇒ memory shortage
Scale by the relevant cache pool
G Varoquaux 14
![Page 22: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/22.jpg)
3 Parallel processing patternsFocus on embarassingly parallel for loopsLife is too short to worry about deadlocks
Workers compete for data accessMemory bus is a bottleneckOn grids: distributed storage
The right grain of parallelismToo fine ⇒ overheadToo coarse ⇒ memory shortage
Scale by the relevant cache poolG Varoquaux 14
![Page 23: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/23.jpg)
3 Queues – the magic behind joblib.Parallel
Queues: high-performance, concurrent-friendly
Difficulty: callback on result arrival⇒ multiple threads in caller + risk of deadlocks
Dispatch queue should fill up “slowly”⇒ pre dispatch in joblib
⇒ Back and forth communicationDoor open to race conditions
G Varoquaux 15
![Page 24: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/24.jpg)
3 What happens where: grand-central dispatch?
joblib design: Caller, dispatch queue, and collectqueue in same process
Benefit: robustness
Grand-central dispatch design: dispatch queue hasa process of its own
Benefit: resource managment in nested for loops
G Varoquaux 16
![Page 25: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/25.jpg)
4 CachingFor reproducible science:avoid manually chained scripts (make-like usage)
For performance:avoiding re-computing is the crux of optimization
G Varoquaux 17
![Page 26: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/26.jpg)
4 The joblib approachThe memoize pattern
mem = joblib.Memory(cachedir=’.’)g = mem.cache(f)b = g(a) # computes a using fc = g(a) # retrieves results from store
Challenges in the context of big dataa & b are big
Design goalsa & b arbitrary Python objectsNo dependencies
Drop-in, framework-less code for cachingG Varoquaux 18
![Page 27: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/27.jpg)
4 Efficient input argument hashing – joblib.hash
Compute md5? of input arguments
Implementation1. Create an md5 hash object2. Subclass the standard-library pickler
= state machine that walks the object graph3. Walk the object graph:
- ndarrays: pass data pointer to md5 algorithm(“update” method)
- the rest: pickle4. Update the md5 with the pickle
? md5 is in the Python standard libraryG Varoquaux 19
![Page 28: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/28.jpg)
4 Fast, disk-based, concurrent, store – joblib.dumpPersisting arbritrary objects
Once again sub-class the picklerUse .npy for large numpy arrays (np.save),pickle for the rest
⇒ Multiple files
Store concurrency issuesStrategy: atomic operations + try/except
Renaming a directory is atomicDirectory layout consistent with remove operations
Good performance, usable on shared disks (cluster)
G Varoquaux 20
![Page 29: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/29.jpg)
5 Fast I/OFast read-outs, for out-of-core computing
G Varoquaux 21
![Page 30: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/30.jpg)
5 Making I/O fastFast compression
CPU may be faster than disk accessChunk data for access patterns pytables
Standard library: zlib.compress with buffers(bypass gzip module to work online + in-memory)
Avoiding copieszlib.compress needs C-contiguous buffersStore raw buffer + meta-information (strides, class...)
- use reduce- rebuild: np.core.multiarray. reconstruct
not in pytables
G Varoquaux 22
![Page 31: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/31.jpg)
5 Benchmarking to np.save and pytables
yax
issc
ale:
1is
np.s
ave
NeuroImaging data (MNI atlas)G Varoquaux 23
![Page 32: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/32.jpg)
@GaelVaroquaux
Summing up
5 simple Python patterns for efficient data crunching
1 On the fly data reduction2 On-line algorithms3 Parallel processing patterns4 Caching5 Fast I/O
![Page 33: Processing biggish data on commodity hardware: simple Python patterns](https://reader038.fdocuments.net/reader038/viewer/2022110308/557cf4bbd8b42a89158b4802/html5/thumbnails/33.jpg)
@GaelVaroquaux
Cost of complexity underestimatedKnow your problem
& solve it with simple primitives
Python modules
scikit-learn: machine learning
joblib: pipeline-ish patterns
Come work with me!Positions available