DataPrep-The easiest way to prepare data in Python
Transcript of DataPrep-The easiest way to prepare data in Python
![Page 1: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/1.jpg)
Jiannan Wang
DataPrep - The easiest way to prepare data in Python
Simon Fraser University
Apr 21, 2021, Thomson Reuters
![Page 2: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/2.jpg)
FromModel-Centric to Data-Centric
2https://analyticsindiamag.com/big-data-to-good-data-andrew-ng-urges-ml-community-to-be-more-data-centric-and-less-model-centric/
![Page 3: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/3.jpg)
Data Preparation Is Still the Bottleneck!!!
https://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html
https://www.anaconda.com/state-of-data-science-2020
2014
3
2020
![Page 4: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/4.jpg)
Why Is Data Preparation Hard?
Collection Cleaning Integration Analysis
How much time is spent on preparation?
1. Too many small problems (e.g., standardize date, dedup address, etc)
2. Humans have different levels of expertise (in data science and programming)
3. Domain specific (finance, social science, healthcare, economics, etc.)
4
![Page 5: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/5.jpg)
Human-in-the-loop Data PreparationThree Directions
• Spreadsheet GUI
• Workflow GUI
• Notebook GUI
5
![Page 6: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/6.jpg)
Spreadsheet GUI
6
![Page 7: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/7.jpg)
7
Workflow GUI
![Page 8: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/8.jpg)
Notebook GUI
8
![Page 9: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/9.jpg)
Which Direction To Go?
Source: https://www.verifiedmarketresearch.com/product/data-prep-market/
Data Prep Market was valued at USD 3.29 Billion in2019 and is projected to reach USD 18.11 Billion by2027, growing at a CAGR of 25.64% from 2020 to 2027“ ”Three Directions
• Spreadsheet GUI
• Workflow GUI
• Notebook GUI9
Targeted at non-programmers
Targeted at data scientists
![Page 10: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/10.jpg)
Our Vision
Machine Learning Made Easy
Data Preparation Made Easy
Deep Learning Made Easy
Big Data Made Easy
Visualization Made Easy
10
![Page 11: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/11.jpg)
DataPrep Components
DataPrep.EDA
DataPrep.Connector
DataPrep.Clean
DataPrep.Feature
Simplify Web Data Collection
Simplify Exploratory Data Analysis
Simplify Data Cleaning
DataPrep.Integrate
Simplify Feature Engineering
Simplify Data Integration
Planning
11
May 2019 - Now
Nov 2019 - Now
Sept 2020 - Now
Planning
![Page 12: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/12.jpg)
User Feedback
https://www.reddit.com/r/Python/comments/hlqnim/understand_your_data_with_a_few_lines_of_code_in/ 12
. . .
![Page 13: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/13.jpg)
Talk Outline
1. DataPrep Overview
2. Dive into DataPrep
• DataPrep.EDA
• DataPrep.Connector
3. Future Direction
13
![Page 14: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/14.jpg)
DataPrep.EDATask-Centric Exploratory Data Analysis
14Jinglin Peng*, Weiyuan Wu*, et al.DataPrep.EDA: Task-Centric Exploratory Data Analysis for Statistical Modeling in Python. SIGMOD 2021
![Page 15: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/15.jpg)
Exploratory Data Analysis (EDA)Understand data and discover insights
via data visualization, data summarization, etc.
15
Understand “Age” column
![Page 16: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/16.jpg)
Current EDA Solutions in Python
16
Solution 1: Pandas + Matplotlib
L Hard to Use
• Beginner: Need to know how to write plotting code
• Expert: Need to write lengthy and repetitive code
Write Code Write Code Write CodeUnderstand “Age” column
![Page 17: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/17.jpg)
Current EDA Solutions in Python
17
Solution 2: Pandas-profiling
L Slow
L Hard to Customize
![Page 18: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/18.jpg)
DataPrep.EDA Design Goals
18
EDA Solutions Easy to Use InteractiveSpeed
Easy toCustomize
1. Pandas + Matplotlib L J J
2. Pandas-profiling J L L
3. DataPrep.EDA J J J
![Page 19: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/19.jpg)
Key Idea
19
Task-Centric API Design
• Declarative
• Support both coarse-grained and fine-grained EDA tasks
Example• plot(df): “I want to see an overview of the dataset”
• plot_missing(df): “I want to understand the missing values of the dataset”
• plot(df, x): “I want to understand the column x”
• plot(df, x, y): “I want to understand the relationship between x and y”
• …
![Page 20: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/20.jpg)
DataPrep.EDA (Demo)
20
![Page 21: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/21.jpg)
Under the Hood
21
??
??
Mapping Rules
Data Processing Pipeline
![Page 22: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/22.jpg)
Mapping RulesN = Numerical, C = Categorical
22
[1] https://www.data-to-viz.com/[2] Exploratory data analysis with R[3] Missingno: a missing data visualization suite…
![Page 23: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/23.jpg)
Data Processing Pipeline
ConfigManager
Compute Module
Render Module
Intermediates
Config1
2
Data
3
23
![Page 24: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/24.jpg)
Interactive Speed
24
Ubuntu 16.04 Linux server with 64 GB memory and 8 Intel E7-4830 cores
![Page 25: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/25.jpg)
Efficiency ComparisonDataPrep.EDA vs Pandas-Profiling
Pandas-Profiling DataPrep.EDA
25
![Page 26: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/26.jpg)
DataPrep.EDA TakeawaysInnovation
The first task-centric EDA system in Python
Achieve three design goalsEasy to useInteractive speedEasy to customize
26
![Page 27: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/27.jpg)
DataPrep.ConnectorA Unified API Wrapper
to Simplify Web Data Collection
27
![Page 28: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/28.jpg)
Data Collection Through Restful APIs
Business DataSocial Data
Event Data Publication Data
28
![Page 29: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/29.jpg)
Restful API Example
29
![Page 30: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/30.jpg)
Restful API WrapperWrap API calls into Easy-to-Use Python Functions
. . .30
![Page 31: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/31.jpg)
Build a New API Wrapper is Tedious!
Authorization
HTTP Connection
Pagination
Concurrency
Result Parsing
…...
Connect to the website server
Handle authorization schemes
Request data from multiple pages
Retrieve data in parallel with less time
Convert Json string to Pandas Dataframe
…...
31
![Page 32: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/32.jpg)
Yelp Spotify TwitterYoutube Wiki Facebook
IMDb Pinterest WalmartNY Times Reddit
...
If we don’t unify API wrappers, then ...
32
![Page 33: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/33.jpg)
Yelp Spotify TwitterYoutube Wiki Facebook
IMDb Pinterest WalmartNY Times Reddit
...
If we don’t unify API wrappers, then ...
● Bad for developers (repetitive building efforts)
● Bad for users (burden to learn many API wrappers)
33
![Page 34: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/34.jpg)
Reusable Components
DataPrep.ConnectorA Unified API Wrapper
Yelp Config File
Spotify Config File
Youtube Config File
Twitter Config File
DBLP Config File
Facebook Config File
Reddit Config File
….
Configuration Files
Good for developers (No repetitive building efforts)
34
![Page 35: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/35.jpg)
The Unified API
Good for users (No burden to learn many API wrappers)
35
![Page 36: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/36.jpg)
DataPrep.Connector (Demo)
36
![Page 37: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/37.jpg)
DataPrep.Connector TakeawaysInnovation
The first unified API Wrapper in Python
Good For DevelopersSpeed up wrapper development process
Good For UsersSpeed up data collection from Web APIs
37
![Page 38: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/38.jpg)
Talk Outline
1. DataPrep Overview
2. Dive into DataPrep
• DataPrep.EDA
• DataPrep.Connector
3. Future Direction
38
![Page 39: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/39.jpg)
Future DirectionDataPrep.EDA• Make plots look attractive
• Understand multiple dataframes (plot_diff, plot_db, …)
DataPrep.Connector
• Speed up read_sql() with arrow and parallel connection
39
(http://cx.dataprep.ai)
![Page 40: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/40.jpg)
Future DirectionDataPrep.Clean• Goal: Implement 100+ clean_{type}(df, x) functions
• Example: clean_email, clean_date, clean_phone, clean_country, etc.
• Application: Data Validation, Data Standardization, Semantic Type Detection
40
![Page 41: DataPrep-The easiest way to prepare data in Python](https://reader033.fdocuments.net/reader033/viewer/2022052106/628791459ecc6b2bc475da7a/html5/thumbnails/41.jpg)
The easiest way to prepare data in Python
http://dataprep.ai
pip install –U dataprep
41