PyData Paris 2015 - Closing keynote Francesc Alted

43
New Trends In Storing And Analyzing Large Data Silos With Python Francesc Alted Freelancer (ÜberResearch, University of Oslo) April 3rd, 2015

Transcript of PyData Paris 2015 - Closing keynote Francesc Alted

Page 1: PyData Paris 2015 - Closing keynote Francesc Alted

New Trends In Storing And Analyzing Large

Data Silos With PythonFrancesc Alted!

Freelancer (ÜberResearch, University of Oslo)

! April 3rd, 2015

Page 2: PyData Paris 2015 - Closing keynote Francesc Alted

About ÜberResearch• Team’s 10+ years experience delivering solutions

and services for funding and research institutions

• Over 20 development partners, and clients globally, from smallest non-profits to large government agencies

• Portfolio company of Digital Science (Macmillan Publishers), the younger sibling of the Nature Publishing Group

http://www.uberresearch.com/

Page 3: PyData Paris 2015 - Closing keynote Francesc Alted

Data$layer$

Data$Enrichment$

Func1onal$layer$

Applica1on$layer$

Global$grant$database$$

Publica1ons,$trials,$patents$

Internal$data,$databases$

Data$model$mapping$/$cleansing$

Ins1tu1on$disambigua1on$

Person$disambigua1on$

Search$Clustering$/$

topic$modelling$

Research$classifica1on$support$

Visualisa1on$support$ APIs$

NLP$/$noun$phrase$extrac1on$

Customer$services$and$APIs$

Customer$$func1onali1es$

Thesaurus$support$for$indexing$

Integra1on$in$customer$applica1on$and$scenarios$

Page 4: PyData Paris 2015 - Closing keynote Francesc Alted

About Me• Physicist by training

• Computer scientist by passion

• Open Source enthusiast by philosophy

• PyTables (2002 - 2011)

• Blosc (2009 - now)

• bcolz (2010 - now)

Page 5: PyData Paris 2015 - Closing keynote Francesc Alted

–Manuel Oltra, music composer

“The art is in the execution of an idea. Not in the idea. There is not much left just from an idea.”

“Real artists ship” –Seth Godin, writer

Why Free/Libre Projects?

• Nice way to realize yourself while helping others

Page 6: PyData Paris 2015 - Closing keynote Francesc Alted

OPSI

Out-of-core Expressions

Indexed Queries

+ a Twist

Page 7: PyData Paris 2015 - Closing keynote Francesc Alted

Overview

• The need for speed: fitting and analyzing as much data as possible with your existing resources

• Recent trends in computer hardware

• bcolz: an example of data container for large datasets following the principles of newer computer architectures

Page 8: PyData Paris 2015 - Closing keynote Francesc Alted

The Need For Speed

Page 9: PyData Paris 2015 - Closing keynote Francesc Alted

Don’t Forget Python’s Real Strengths

• Interactivity

• Data-oriented libraries (NumPy, Pandas, Scikit-Learn…)

• Interactivity

• Performance (thanks to Cython, SWIG, f2py…)

• Interactivity (did I mentioned that already?)

Page 10: PyData Paris 2015 - Closing keynote Francesc Alted

The Need For Speed• But interactivity without performance in Big Data is

a no go

• Designing code for data storage performance depends very much on computer architecture

• IMO, existing Python libraries need more effort in getting the most out of existing and future computer architectures

Page 11: PyData Paris 2015 - Closing keynote Francesc Alted

The Daily Python Working Scenario

Quiz: which computer is best for interactivity?

Page 12: PyData Paris 2015 - Closing keynote Francesc Alted

Although Modern Servers/Laptops Can Be Very Complex Beasts

We need to know them better so as to get the most out of them

Page 13: PyData Paris 2015 - Closing keynote Francesc Alted

Recent Trends In Computer Hardware

“There's Plenty of Room at the Bottom”

An Invitation to Enter a New Field of Physics

—Talk by Richard Feynman at Caltech, 1959

Page 14: PyData Paris 2015 - Closing keynote Francesc Alted

Memory Access Time vs CPU Cycle Time

The gap is wide and still opening!

Page 15: PyData Paris 2015 - Closing keynote Francesc Alted

Computer Architecture Evolution

Up to end 80’s 90’s and 2000’s 2010’s

MARCH/APRIL 2010 3

implemented several memory lay-ers with different capabilities: lower-level caches (that is, those closer to the CPU) are faster but have reduced capacities and are best suited for per-forming computations; higher-level caches are slower but have higher ca-pacity and are best suited for storage purposes.

Figure 1 shows the evolution of this hierarchical memory model over time. The forthcoming (or should I say the present?) hierarchical model includes a minimum of six memory levels. Taking advantage of such a deep hierarchy isn’t trivial at all, and programmers must grasp this fact if they want their code to run at an acceptable speed.

Techniques to Fight Data Starvation Unlike the good old days when the processor was the main bottleneck, memory organization has now be-come the key factor in optimization. Although learning assembly language to get direct processor access is (rela-tively) easy, understanding how the hierarchical memory model works—and adapting your data structures accordingly—requires considerable knowledge and experience. Until we have languages that facilitate the de-velopment of programs that are aware

of memory hierarchy (for an example in progress, see the Sequoia project at www.stanford.edu/group/sequoia), programmers must learn how to deal with this problem at a fairly low level.4

There are some common techniques to deal with the CPU data-starvation problem in current hierarchical mem-ory models. Most of them exploit the principles of temporal and spatial locality. In temporal locality, the target dataset is reused several times over a short period. The first time the dataset is accessed, the system must bring it to cache from slow memory; the next time, however, the processor will fetch it directly (and much more quickly) from the cache.

In spatial locality, the dataset is ac-cessed sequentially from memory. In this case, circuits are designed to fetch memory elements that are clumped together much faster than if they’re dispersed. In addition, specialized circuitry (even in current commodity hardware) offers prefetching—that is, it can look at memory-access patterns and predict when a certain chunk of data will be used and start to trans-fer it to cache before the CPU has actually asked for it. The net result is that the CPU can retrieve data much faster when spatial locality is properly used.

Programmers should exploit the op-timizations inherent in temporal and spatial locality as much as possible. One generally useful technique that leverages these principles is the block-ing technique (see Figure 2). When properly applied, the blocking tech-nique guarantees that both spatial and temporal localities are exploited for maximum benefit.

Although the blocking technique is relatively simple in principle, it’s less straightforward to implement in practice. For example, should the basic block fit in cache level one, two, or three? Or would it be bet-ter to fit it in main memory—which can be useful when computing large, disk-based datasets? Choosing from among these different possibilities is difficult, and there’s no substitute for experimentation and empirical analysis.

In general, it’s always wise to use libraries that already leverage the blocking technique (and others) for achieving high performance; exam-ples include Lapack (www.netlib.org/lapack) and Numexpr (http://code.google.com/p/numexpr). Numexpr is a virtual machine written in Python and C that lets you evaluate poten-tially complex arithmetic expressions over arbitrarily large arrays. Using the blocking technique in combination

Figure 1. Evolution of the hierarchical memory model. (a) The primordial (and simplest) model; (b) the most common current implementation, which includes additional cache levels; and (c) a sensible guess at what’s coming over the next decade: three levels of cache in the CPU and solid state disks lying between main memory and classical mechanical disks.

Mechanical disk Mechanical disk Mechanical disk

Speed

Cap

acity

Solid state disk

Main memory

Level 3 cache

Level 2 cache

Level 1 cache

Level 2 cache

Level 1 cache

Main memoryMain memory

CPUCPU

(a) (b) (c)

Centralprocessingunit (CPU)

CISE-12-2-ScientificPro.indd 3 1/29/10 11:21:43 AM

Page 16: PyData Paris 2015 - Closing keynote Francesc Alted

Latency Numbers Every Programmer Should Know

Latency Comparison Numbers -------------------------- L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns 14x L1 cache Mutex lock/unlock 25 ns Main memory reference 100 ns 20x L2 cache, 200x L1 cache Read 4K randomly from memory 1,000 ns 0.001 ms Compress 1K bytes with Zippy 3,000 ns Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms Read 4K randomly from SSD* 150,000 ns 0.15 ms Read 1 MB sequentially from memory 250,000 ns 0.25 ms Round trip within same datacenter 500,000 ns 0.5 ms Read 1 MB sequentially from SSD* 1,000,000 ns 1 ms 4X memory Disk seek 10,000,000 ns 10 ms 20x datacenter roundtrip Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80x memory, 20X SSD Send packet CA->Netherlands->CA 150,000,000 ns 150 ms

Source: Jeff Dean and Peter Norvig (Google), with some additions

https://gist.github.com/hellerbarde/2843375

Page 17: PyData Paris 2015 - Closing keynote Francesc Alted

tref

ttrans

CPU cache

CPU cache

Block in storage to transmit

to CPU

Reference Time vs Transmission Time

tref ~= ttrans => optimizes memory access

Page 18: PyData Paris 2015 - Closing keynote Francesc Alted

Not All Storage Layers Are Created Equal

Memory: tref: 100 ns / ttrans (1 KB): ~100 ns

Solid State Disk: tref: 10 us / ttrans (4 KB): ~10 us

Mechanical Disk: tref: 10 ms / ttrans (1 MB): ~10 ms

This has profound implications on how you access storage!

The slower the media, the larger the block that is worth to transmit

Page 19: PyData Paris 2015 - Closing keynote Francesc Alted

We Are In A Multicore Age

• This requires special programming measures to leverage all its potential: threads, multiprocessing

Page 20: PyData Paris 2015 - Closing keynote Francesc Alted

SIMD: Single Instruction, Multiple Data

More operations in the same CPU clock

Page 21: PyData Paris 2015 - Closing keynote Francesc Alted

The growing gap between DRAM and HDD is facilitating the introduction ofnew SDD devices

Forthcoming Trends (I)

Page 22: PyData Paris 2015 - Closing keynote Francesc Alted

Forthcoming Trends (II)

CPU+GPU Integration

Page 23: PyData Paris 2015 - Closing keynote Francesc Alted

Bcolz: An Example Of Data Containers Applying The

Principles Of New Hardware

Page 24: PyData Paris 2015 - Closing keynote Francesc Alted

What is bcolz?• bcolz provides data containers that can be used

in a similar way than the ones in NumPy, Pandas

• The main difference is that data storage is chunked, not contiguous!

• Two flavors:

• carray: homogenous, n-dim data types

• ctable: heterogeneous types, columnar

Page 25: PyData Paris 2015 - Closing keynote Francesc Alted

Contiguous vs ChunkedNumPy container

Contiguous memory

carray container

chunk 1

chunk 2

Discontiguous memory

chunk N

...

Page 26: PyData Paris 2015 - Closing keynote Francesc Alted

Why Chunking?• Chunking means more difficulty handling data, so

why bother?

• Efficient enlarging and shrinking

• Compression is possible

• Chunk size can be adapted to the storage layer (memory, SSD, mechanical disk)

Page 27: PyData Paris 2015 - Closing keynote Francesc Alted

Why Columnar?

• Because it adapts better to newer computer architectures

Page 28: PyData Paris 2015 - Closing keynote Francesc Alted

String …String Int32 Float64 Int16

String …String Int32 Float64 Int16

String …String Int32 Float64 Int16

String …String Int32 Float64 Int16

Interesting column

Interesting Data: N * 4 bytes (Int32) Actual Data Read: N * 64 bytes (cache line)

} N rows

In-Memory Row-Wise Table (Structured NumPy array)

Page 29: PyData Paris 2015 - Closing keynote Francesc Alted

String …String Int32 Float64 Int16

String …String Int32 Float64 Int16

String …String Int32 Float64 Int16

String …String Int32 Float64 Int16

Interesting column

Interesting Data: N * 4 bytes (Int32) Actual Data Read: N * 4 bytes (Int32)

In-Memory Column-Wise Table (bcolz ctable)

} N rows

Less memory travels

to CPU!

Page 30: PyData Paris 2015 - Closing keynote Francesc Alted

Copy!

Array to be enlarged

Final array object

Data to appendNew memory

allocation• Both memory areas have to exist simultaneously

Appending Data in NumPy

Page 31: PyData Paris 2015 - Closing keynote Francesc Alted

Appending Data in bcolzfinal carray object

chunk 1

chunk 2

new chunk(s)

carray to be enlarged

chunk 1

chunk 2

data to append

X

compression

Only compression onnew data is required!

Blosc

Less memory travels

to CPU!

Page 32: PyData Paris 2015 - Closing keynote Francesc Alted

Why Compression (I)?

Compressed Dataset

Original Dataset

More data can be packed using the same storage

Page 33: PyData Paris 2015 - Closing keynote Francesc Alted

Why Compression (II)?Less data needs to be transmitted to the CPU

Disk or Memory Bus

Decompression

Disk or Memory (RAM)

CPU Cache

Original Dataset

Compressed Dataset

Transmission + decompression faster than direct transfer?

Page 34: PyData Paris 2015 - Closing keynote Francesc Alted

Blosc: Compressing Faster Than memcpy()

Page 35: PyData Paris 2015 - Closing keynote Francesc Alted

How Blosc Works

Multithreading & SIMD

at work!Figure attr: Valentin Haenel

Page 36: PyData Paris 2015 - Closing keynote Francesc Alted

Accelerating I/O With Blosc

Blosc���������

���� �����������

������� �

���

���

�������������

�������������

���������������

�������������

}}

Other compressors

Page 37: PyData Paris 2015 - Closing keynote Francesc Alted

–Release Notes for OpenVDB 3.0, maintained by DreamWorks Animation

“Blosc compresses almost as well as ZLIB, but it is much faster”

Blosc In OpenVDB And Houdini

Page 38: PyData Paris 2015 - Closing keynote Francesc Alted

Some Projects Using bcolz• Visualfabriq’s bquery (out-of-core groupby’s):

https://github.com/visualfabriq/bquery

• Continuum’s Blaze: http://blaze.pydata.org/

• Quantopian: http://quantopian.github.io/talks/NeedForSpeed/slides#/

Page 39: PyData Paris 2015 - Closing keynote Francesc Alted

bquery - On-Disk GroupBy

In-memory (pandas) vs on-disk (bquery+bcolz) groupby

“Switching to bcolz enabled us to have a much better scalable architecture yet with near in-memory performance”

— Carst Vaartjes, co-founder visualfabriq

Page 40: PyData Paris 2015 - Closing keynote Francesc Alted

Quantopian’s Use Case

“We set up a project to convert Quantopian’s production and development infrastructure to use bcolz” — Eddie Herbert

Page 41: PyData Paris 2015 - Closing keynote Francesc Alted

Closing Notes• If you need a data container that fits your needs, look

for already nice libraries out there (NumPy, DyND, Pandas, PyTables, bcolz…)

• Pay attention to hardware and software trends and make informed decisions in your current developments (which, btw, will be deployed in the future :)

• Performance is needed for improving interactivity, so do not hesitate to optimize the hot spots in C if needed (via Cython or other means)

Page 42: PyData Paris 2015 - Closing keynote Francesc Alted

“It is change, continuing change, inevitable change, that is the dominant factor in Computer

Sciences today. No sensible decision can be made any longer without taking into account not only the computer as it is, but the computer as it will be.”

— Based on a quote by Isaac Asimov

Page 43: PyData Paris 2015 - Closing keynote Francesc Alted

Thank You!