Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed...
Transcript of Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed...
![Page 1: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/1.jpg)
Data Formatsfor Data Science
Data Scientist and ResearcherFondazione Bruno Kessler (FBK)Trento, Italy
Valerio Maggio
@leriomaggio
![Page 2: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/2.jpg)
About me• Post Doc Researcher @ FBK
• Complex Data Analytics Unit (MPBA)
• Interested in Machine Learning, Text and Data Processing
• with “Deep” divergences recently
• Fellow Pythonista since 2006
• scientific Python ecosystem
• PyData Italy Chair
• http://pydata.it
• @pydatait
kidding, that’s me!-)
![Page 3: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/3.jpg)
worthwhile mentioning…
End of early-bird: Jul 21, 2106
(that’s today! 😱)
The Program is online: https://www.euroscipy.org/2016/program/
![Page 4: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/4.jpg)
Data Formats 4 Data Science• Data Processing
• Q: What’s the better way to process data
• Q+: What’s the most Pythonic Way to do that?
• Data Sharing
• Q: What’s the best way to share (and to present data)
• A: [Interactive] Charts - Data Visualisation
• OMG, Bokeh is better than ever! by Fabio Pliger (after this session!)
![Page 5: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/5.jpg)
Jupyter Notebook for Data and Documentation Sharing
![Page 6: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/6.jpg)
1.
Textual Data format
![Page 7: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/7.jpg)
![Page 8: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/8.jpg)
More Pythonic
![Page 9: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/9.jpg)
Numpy to the rescue
![Page 10: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/10.jpg)
![Page 11: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/11.jpg)
csv files
![Page 12: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/12.jpg)
csv Module (in standard library)
![Page 13: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/13.jpg)
![Page 14: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/14.jpg)
![Page 15: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/15.jpg)
![Page 16: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/16.jpg)
![Page 17: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/17.jpg)
Textual Data format• Be Pythonic: use context managers (with)
• numpy (mostly numerical) and pandas (csv) to the rescue
• np.loadtxt and pd.read_csv
• (+) Very easy to (re)create and share
• very easy to process
• (-) Not storage friendly but highly compressible!
• (-) No structured information
![Page 18: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/18.jpg)
2.
Binary Data format
![Page 19: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/19.jpg)
Binary format
• Space is not the only concern (for text). Speed matters!
• Python conversion to int() and float() are slow
• costly atoi()/atof() C functions
*
A. Scopatz, K.D. Huff - Effective Computations in Physics - Field Guide to Research in Python, O’Reilly 2015*
Integers and floats in native and string representations
![Page 20: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/20.jpg)
import pickle
Still, it is often desirable to have something more than a binary chunk of data in a file.
![Page 21: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/21.jpg)
Hierarchical Data Format 5 (a.k.a. hdf5)
• Free and open source file format specification
• HDFGroup - Univ. Illinois Champagne-Urbana
• (+) Works great with both big or tiny datasets
• (+) Storage friendly
• Allows for Compression
• (+) Dev. Friendly
• Query DSL + Multiple-language support
• Python: PyTables, hdf5, h5py
![Page 22: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/22.jpg)
![Page 23: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/23.jpg)
with PyTables
Numpy Arrays tight integration
Accessing the table
![Page 24: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/24.jpg)
Hierarchy and Groups
![Page 25: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/25.jpg)
Data Chunking
A. Scopatz, K.D. Huff - Effective Computations in Physics - Field Guide to Research in Python, O’Reilly 2015*
![Page 26: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/26.jpg)
Data Chunking
A. Scopatz, K.D. Huff - Effective Computations in Physics - Field Guide to Research in Python, O’Reilly 2015*
• Small chunks are good for accessing only some of the data at a time.
• Large chunks are good for accessing lots of data at a time.
• Reading and writing chunks may happen in parallel
![Page 27: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/27.jpg)
Parallel HDF5
MPI (mpi4py) integration
![Page 28: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/28.jpg)
Learn More
• How to migrate from PostgreSQL to HDF5 and live happily ever after by Michele Simionato @PyData Track on Friday
![Page 29: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/29.jpg)
Data Format
• Data Analysis Framework (and tool) dev. @CERN
• written in C++;
• native extension in Python (aka PyROOT)
• ROOT6 also ships a Jupyter Kernel
• Definition of a new Binary Data Format (.root)
• based on the serialisation of C++ Objects
![Page 30: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/30.jpg)
![Page 31: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/31.jpg)
rootpyroot_numpy rootpy.github.io/root_numpy/
rootpy.github.io/
C++ style
![Page 32: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/32.jpg)
![Page 33: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/33.jpg)
root_numpy examples
Tight integration with PyROOT objects
![Page 34: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/34.jpg)
root2hdf5 (included in rootpy)
http://www.rootpy.org/commands/root2hdf5.html
![Page 35: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/35.jpg)
3.
JSON Data format
![Page 36: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/36.jpg)
![Page 37: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/37.jpg)
Jupyter Notebook Data Format
![Page 38: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/38.jpg)
JSON is the format of choice for Document Oriented DBs
(a.k.a. NOSQL DBs)
![Page 39: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/39.jpg)
HDF5 vs MongoDBTotal Number of Documents Total Number of Entries Total Number of Calls
100.000 8.755.882 319.970
Average time per Single Call (sec.)
0
0,001
0,003
0,004
0,005
HDF5 (blosc filter)
MongoDB (flat storage)
MongoDB(compact storage)
![Page 40: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/40.jpg)
HDF5 vs MongoDB
Total Number of Documents Total Number of Entries Total Number of Calls
100.000 8.755.882 319.970
Storage (MB)
0
1.000.000
2.000.000
3.000.000
4.000.000
HDF5 (blosc filter)
MongoDB (flat storage)
MongoDB(compact storage)
Systems Storage (MB)
HDF5 (blosc filter) 922.528
MongoDB (flat storage) 3.952.148
MongoDB (compact storage) 1.953.125
![Page 41: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/41.jpg)
4.
HDFS Data format
matthewrocklin.com/blog/work/2016/02/22/dask-distributed-part-2
![Page 42: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/42.jpg)
HDFS• HDFS: Hadoop Filesystem
• Distributed Filesystem on top of Hadoop
• Data can be organised in shardes and distributed among several machines (cluster config)
• (de facto) Big Data Data Format
• Python: hdfs3
• Native implementation of HDFS in C++
• No Java along the way!
![Page 43: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/43.jpg)
Opening a Single File on the HDFS
HDFS + CSV
![Page 44: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/44.jpg)
Wildcard opening of CSVs on the HDFS
HDFS + CSV
![Page 45: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/45.jpg)
![Page 46: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/46.jpg)
Big Data and Columnar DBs• Big Data World is shifting towards columnar DBs
• better oriented to OLAP (analytics) rather than OLTP
![Page 47: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/47.jpg)
![Page 48: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/48.jpg)
• In-Database analytics with python and MonetDB by G. Emireni @PyData Italy 2016
![Page 49: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/49.jpg)
A format has no name
![Page 50: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/50.jpg)
http://xarray.pydata.org/en/stable/index.html
http://blaze.pydata.org
![Page 51: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/51.jpg)
Out-of-Core Processing
![Page 52: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/52.jpg)
![Page 53: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/53.jpg)
Complicated data require complicated formats
Complicated formats require good tools
OPeNDAP: http://goo.gl/fMehjh
![Page 54: Data Formats for Data Science - PyData@EP2016 · • HDFS: Hadoop Filesystem • Distributed Filesystem on top of Hadoop • Data can be organised in shardes and distributed among](https://reader033.fdocuments.net/reader033/viewer/2022042708/5f397a5cf7a4952f76095a6f/html5/thumbnails/54.jpg)
Thanks a lot for your kind attention
+ValerioMaggio
it.linkedin.com/in/valeriomaggio
@leriomaggio