Download - From Lab to Factory: Creating value with data

From Lab to FactoryLessons learned developing Data Products

Keynote PyCon Colombia

Feb 2017

[email protected]

All opinions my own

Who am I?

Data Scientist at Elevate (Recruitment Startup) and Author@springcoil About 5 years in Machine Learning/Data ScienceOSS contributor (mostly PyMC3)Built Ad-hoc analysis, Data Products and Data Pipelines at

Why talk about this?

Data Moats

Roadmap

What is Data Science?

Type of Data Scientists

Deploying Data Science (tech/culture)

Success stories and recommendations

Bigger idea (Change)

What IS a Data Scientist?

There's a Data Science Spectrum

HT: Sean J. Taylor (Facebook Research Scientist)

Data Science to Value

Data doesn't do anything!

People need to make decisions (strategic/ tactical)

Or the product needs to alter someones decision making (SpotifyDiscover Weekly, Amazon Recommendations, etc)

(Taken from Josh Wills talk - Lab to Factory 2014)or https://hbr.org/2013/04/two-departments-for-data-succe

https://hbr.org/2013/04/two-departments-for-data-succe

Expectations and reality

Data is the new oil...

But. Data isn't a product. It needs to be turned into one!

Data Science is hard

Data and technical problems

Skill set (emotional and tech)

Machine Learning Ops

Data Science projects are risky!

Many stakeholders think that data science is just an engineering problem,but research is high risk and high reward

De-risking the project - how? Send me examples :)

http://www.martingoodson.com/ten-ways-your-data-project-is-going-to-fail/

https://github.com/ianozsvald/data_science_delivered

http://www.martingoodson.com/ten-ways-your-data-project-is-going-to-fail/

https://github.com/ianozsvald/data_science_delivered

Success with Data Science

1. People

2. Ideas

3. Things

Source: USAF

R and D needs cultural support

Your culture needs to avoid the HiPPO (highest paidpersons opinion) to be data-informed or data-driven.

Good cultures

Data democratization

Clear objectives and metrics against objectives

Financial metrics don't always win - sometimes brand wins

See by Martin Goodsonhttp://bit.ly/cultureanddata

http://bit.ly/cultureanddata

Some data scientistsdo experiments and

build prototypes

Production

(HT: The Yhat people - ) www.yhathq.com

https://slides.com/secure/decks/888121/www.yhathq.com

Deploying Data Products is hard

Monitoring and Alerting

Deployment

Real world data is tricky (training/ testing)

Interpretability (explaining fraud detection/ad tech)

Feature engineering

Models decay over time

What projects work?

Explain existing data (visualization!)

Automate repetitive/ slow processes

Augment data to make new data (Search engines, ML models)

Predict the future (do something more accurately than gut feel )

Simulate using statistics :) (Sports analytics models)

"You need data first" - Peadar Coyle

Copying and pasting PDF/PNG data Getting data in some areas is hard!!Some tools for web data extractionMessy APIs without documentation :(

Visualise ALL THE THINGS!!(Relay foods dataset - HT Greg Reda)Consumer behaviour at a Fast Food Restaurant per year in the USA

Augmenting data and using API's

Machine Learning

Production data/ Real world

HT: Andrew Ng

Everyone ETLSOnly 1% of your time will be spent modellingData pipelines and your infrastructure matters - Tools such as Spark, Luigi, Airflow or Dask are importantMonitoring of pipelines. Scalability. Machine provision

Eoin Brazil Talk

https://github.com/braz/pycon2015_talk

Success Story (1)1. Ad Tech project (Media company) -

original process took 10 hours per week of analyst time.

2. Broke down process and produced simple predictive model

3. Data type challenges (unicode in Python 2.7) and businessexpectations

4. Now takes less than 30 minutes each week.

Success Story (2)Elevate is a Total Talent Management platform (tools for streamliningrecruitment)

Several models in production - microservices (recommenders, job typeprediction)

Data pipeline (Luigi) necessary for producing features

Lots of challenges about Docker, time, updated data.

Works: MyPy (Python 3.5), code review, testing -

Next: Moving to PySpark for ETL, more pair programming

http://hypothesis.works/

http://hypothesis.works/

Failure1. Lack of support from tech teams

2. Lack of clearly defined customer

3. No culture of DevOps

4. Management had no clear vision

Lessons learned from Lab to Factory

1. Monitoring - data products need evaluation in production

2. Lack of a shared language between software engineers and datascientists. Pair programming/ Code review are some solutions.

3. To help data scientists and analysts succeed your business needs to beprepared to invest in tooling. (Pipelines, DevOps, Reproducible)

4. Often you're working with other teams who use different languages -so micro services can be a good idea (example C#.net app and Pythonmicroservice)

How to deploy a model?Palladium (Otto Group)AzureFlask Microservice (DIY) and Docker

Dev Ops

We need each other!

Use small data where possible!!

Small problems with clean data are more important - (Ian Ozsvald)Amazon machine with many Xeons and 244GB of RAM is less than 3euros per hour. - (Ian Ozsvald) Blaze, Xray, Dask, Ibis, etc etc - "The mean size of a cluster will remain 1" - Matt Rocklin

PyData Bikeshed

https://www.youtube.com/watch?v=RTiAMB2tQjo

Focus on the real problem

Closing remarks

Dirty data stops projects

Change happens, we need to help businesses adapt or be agile

There are some good projects like Icy, Luigi, etc for transforming dataand improving data extraction

- Google paperon rule for building ML systemshttp://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf

http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf

Closing Remarks (2)

Stakeholder management is a challenge too

Send me your dirty data and data deployment stories :)

www.peadarcoyle.com

https://leanpub.com/interviewswithdatascientists

https://slides.com/secure/decks/888121/www.peadarcoyle.com

https://leanpub.com/interviewswithdatascientists

A New World

Simulate: Rugby games with MCMC(PyMC3)

What is the Data Science process?

ObtainScrubExploreModelInterpretCommunicate (or Deploy)

Some NLP on the Interviews!

What do Data Scientists talk about?

Based on my Interview series!Dataconomy

https://slides.com/secure/decks/888121/www.dataconomy.com/author/pcoyle

A famous 'data product' - Recommendation engines

Creditshttp://cdn.yourarticlelibrary.com/wp-content/uploads/2013/12/080.jpg

https://www.flickr.com/photos/82066314@N06/7624218468/sizes/z/

http://cdn.yourarticlelibrary.com/wp-content/uploads/2013/12/080.jpg

https://www.flickr.com/photos/82066314@N06/7624218468/sizes/z/