Analyzing data with docker v4
-
Upload
andreas-dewes -
Category
Data & Analytics
-
view
377 -
download
1
Transcript of Analyzing data with docker v4
![Page 1: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/1.jpg)
Analyzing Data With DockerAndreas Dewes (@japh44)
EuroPython 2016 - Bilbao
![Page 2: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/2.jpg)
Outline
Data Analysis: Small & Large-Scale, Easy & Difficult
Introduction To Docker
Containerizing our Data Analysis
Possible Approaches
Relevant Technologies & Outlook
![Page 3: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/3.jpg)
Data Analysis: Use Cases
small-scale large-scale
automated
interactive
Interactive, UI-based analysis(e.g. iPython notebook)
analysis scripts usingLocal data sources(e.g. databases)
non-interactive analysis pipelines(e.g. Apache Hadoop)
Interactive “Big Data” tools, e.gApache Spark or Google BigQuery
![Page 4: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/4.jpg)
So what's so difficult about data analysis?
![Page 5: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/5.jpg)
Sharing Data & Tools
![Page 6: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/6.jpg)
Reproducibility
![Page 7: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/7.jpg)
Scaling
![Page 8: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/8.jpg)
Enter Docker....
![Page 9: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/9.jpg)
What is Docker?
A tool that allows us to deploy applications inside "software containers".
Containers work at the process level and isolate the view of the operating system (i.e. the processes, resources and files an application sees)
Provides a high-level API to manage, version-control, deploy and network containers.
![Page 10: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/10.jpg)
Docker Swarm
Docker Core-Concepts
Docker EngineDocker Engine
Docker API
Registry
CLI
Image
Image
ImageContainer
Container
Container
Container
Container
![Page 11: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/11.jpg)
Images Are Space-Efficient(or at least more efficient than VMs)
![Page 12: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/12.jpg)
Containers Have Little Overhead
https://domino.research.ibm.com/library/cyberdig.nsf/papers/0929052195DD819C85257D2300681E7B/$File/rc25482.pdf
![Page 13: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/13.jpg)
Containers Are Self-Sufficient
![Page 14: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/14.jpg)
Containers Are "Lego" For Data Analytics!
Container
output
inputsconfiguration
datanetworked containers
![Page 15: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/15.jpg)
We Can Build Reproducible Data-Analysis Workflows With Them
Map Apache
logs
Map Nginx logs
BI
Aggregate results Filtering Monitoring
Archiving
![Page 16: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/16.jpg)
Example: Analyzing Github Data
analysis script
log filesfrom Github
output
analysis process(es)
Repository with code: https://github.com/adewes/docker-map-reduce-example
![Page 17: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/17.jpg)
Live Demo (fingers crossed)
![Page 18: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/18.jpg)
Containerizing Our Analysis
analysis script
log filesfrom Github
output
analysis container
image
analysis container
analysis container
supervisor
![Page 19: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/19.jpg)
Live demo (what could go wrong?)
![Page 20: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/20.jpg)
Advantages DisadvantagesEasy to share
Each analysis step is self-sufficient
Analysis components are "plug & play"
Easy to parallelize (for the right problems)
Versioning included
Requires to prepare containers
Requires Docker on each machine
Slightly decreases interactivity & flexibility
![Page 21: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/21.jpg)
Which Parts Are Missing?
![Page 22: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/22.jpg)
Orchestration
![Page 23: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/23.jpg)
Dependency Management
![Page 24: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/24.jpg)
Resource ManagementResource Management
![Page 25: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/25.jpg)
Rouster:A Python Tool for Containerized Data Analysis
Built on top of the Docker API"Make for Docker"
Resource ManagementContainer OrchestrationDependency Management
![Page 26: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/26.jpg)
Rouster Uses Recipes to Describe Data Analysis Workflows
Resources(including dependencies)
Services
Actions
versioning, dependency calculation,backup / copying, distribution, ...
startup (including dependencies),resource provisioning, networking, ...
scheduling, monitoring, exceptionhandling, logging, ...
![Page 27: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/27.jpg)
Live Demo: CSV -> Postgres
![Page 28: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/28.jpg)
Open Questions
How to handle communication between containers(through files, network, ...)?
How to provide resources/data to containers in adistributed environment?
![Page 29: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/29.jpg)
Pachyderm is a data lake that offers complete version control for data and leverages the container ecosystem to provide reproducible data processing. Built on top of Kubernetes.
http://www.pachyderm.io
Pachyderm
LuigiLuigi is a Python module that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization etc. It also comes with Hadoop support built in.
https://github.com/spotify/luigi
Other relevant technologies
![Page 30: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/30.jpg)
Summary & Outlook
Containers are here to stay!
They are useful in various data analysis contexts.
They don't solve all our problems though.
We need additional tools to use them effectively.
![Page 31: Analyzing data with docker v4](https://reader036.fdocuments.net/reader036/viewer/2022062822/5883fe361a28ab884b8b5319/html5/thumbnails/31.jpg)
Thanks!Want to contribute?
https://github.com/7scientists/rouster
Andreas Dewes (@japh44)
Image Licenses:
https://commons.wikimedia.org/wiki/File:Matryoshka_dolls_(3671820040)_(2).jpghttps://pixabay.com/de/nordlichter-lager-zelt-abenteuer-1203289/https://en.wikipedia.org/wiki/Orchestrahttps://de.wikipedia.org/wiki/Graph_(Graphentheorie)http://www.library.illinois.edu/prescons/disaster_response/high_density_storage_disaster_plan/https://brookeborel.com/2011/06/02/363/https://en.wikipedia.org/wiki/Data_sharing