Data Science Project Lifecycle

11
Data Science Project Lifecycle Jason Geng @Data Application Lab Miya Du @Data Science Association

Transcript of Data Science Project Lifecycle

Page 1: Data Science Project Lifecycle

Data Science Project Lifecycle

Jason Geng @Data Application Lab

Miya Du @Data Science Association

Page 2: Data Science Project Lifecycle

Business Requirement

Data Acquisition

Data Preparation

Hypothesis & Modeling

Evaluation & Interpretation

Deployment

Operations

Optimization

Page 3: Data Science Project Lifecycle

Business Requirements

u Data scientists need to work with business people and those with expertise in understanding the data, understanding the business

u Specify the business requirements

u For instance, the healthcare data

Page 4: Data Science Project Lifecycle

e.g. ‘DISCWT’:‘This the discharge-level weight on the HCUP nationwide data to

produce national estimates’

Understand the data:

Understand the Business:

Goal:Predict Readmission Rate

Database:

Healthcare:Readmissions Database

Modeling

Page 5: Data Science Project Lifecycle

Data Collection

u Data from product line

u Purchase third party data

u Social media (Facebook, LinkedIn)

u Web crawling

u Open source data (Opendata, U.S. Census Data)

Challenge

Data Storage

Data Management

Page 6: Data Science Project Lifecycle

Legacy data

OLTP Web Log

Web Crawler

Open Source

Third Party Data

Social Media Data

XML

CSV

LOG

SQL

Product Line

Business Intelligence

Data Science App

Page 7: Data Science Project Lifecycle

Data Preparation (Data Wrangling)

u Cleaning data (semantic errors, missing entries, or inconsistent formatting)

u Challenge: data integration

u 80% time in project workflow

Data Source A

Data Source B

Data Source B

ETLData

Warehouse

Page 8: Data Science Project Lifecycle

Feature Engineering

Select or creating features

Research feature

relevance

Experiment and

validation

Change the feature set

Go back to feature

selection step

Page 9: Data Science Project Lifecycle

Modeling

Reference Source: http://scikit-learn.org/stable/tutorial/machine_learning_map/

Page 10: Data Science Project Lifecycle

Deploy to Product Line

Page 11: Data Science Project Lifecycle

Thank you!

https://www.DataAppLab.com

Feb 2017PPT: Xiaolu Zhao @ Feb 16, 2017