Dataiku productive application to production - pap is may 2015

27
Imagine How 5 Years from Now will predictive applications be put in production Our Goal Today How are we doing today ? What is difficult ? What should be simpler?

Transcript of Dataiku productive application to production - pap is may 2015

Imagine How

5 Years from Now will

predictive applications be put

in production

Our Goal Today

How are we doing today ? What is difficult ?

What should be simpler?

What is a predictive application ?

Churn Prevention

Fraud Detection

Demand Forecast

Targeting

Maintenance

Match Making

Ad Bidding

Drug Studies

Pricing

Ranking

This discussion not relevant to all

Churn

Maintenance

Drug Studies Multi-Years

Multi-Years

Multi-Years Weekly

Weekly

Yearly

Bidding Two Weeks Sub-Second

Data SpanRetrain every … Score

every…

Yearly

Day

Monthly

Monthly

Production = Dev

Online Learning

Not just a “model”

Data Prep

Domain Specific

Feature Eng.Feature Eng. Model(s) Scoring

/Decision

Data Collection

Let’s call this a Predictive Service Specification

How much effort ?

Data Prep

Domain Specific

Feature Eng.Feature Eng. Model(s) Scoring

/Decision

20% 30% 25% 5% 5% 15%

Data Collection

Who Does What ?

Data Prep

Domain Specific

Feature Eng.Feature Eng. Model(s) Scoring

/Decision

Data Domain Engineers

Data AnalystsData ScientistsBusiness Intelligence Engineers

Huge Variety of Tech

Data Prep

Domain Specific

Feature Eng.Feature Eng. Model(s) Scoring

/Decision

Data Collection

ETL ? Ad-Hoc?

ETL ? Ad-Hoc?

ETL ? SQL ? R ? Python ?

Matlab ?

R ? Python ? R ? Python ? SAS? Java / Python

Business Rules Management System

Data Prep

Domain Specific

Feature Eng.Feature Eng. Model(s) Scoring

/Decision

From Build to Run

Data Prep

Domain Specific

Feature Eng.Feature Eng. Model(s) Scoring

/Decision

?Input Data Decision

Build Time

Run Time

How People Do that Today ?

Data Prep

Domain Specific

Feature Eng.Feature Eng. Model(s) Scoring

/Decision

PMMLETL WebServiceScript/SQL

Data Collection

A Predictive Service =

Up to 4 different “Applications" that can run out-of-sync

Some Integrated Per-Platform Approach

in Database

in SAS

in Hadoop/Spark

SQL Commercial Warehouse + Scoring UDF

End-to-end integration script

Ad-hoc development

Top Companies invested a lot

Each probably >5M$ in their ML production platform

Reason 1 : Prohibitive Costs kill projects

Data Prep

Domain Specific

Feature Eng.Feature Eng. Model(s) Scoring

/Decision

RSQL PythonR

Data Prep

Domain Specific

Feature Eng.Feature Eng. Model(s) Scoring

/Decision

SQLETL WebServiceSQL PMML

300K$ 50K$ 200K$ 100K$

50K$

650K$

Reason 2: Distribution DriftNew behaviour

New productNew competitor

Model stops working as planned

You need to be able to do same week update

Reason 3: Mitigate with Data Hazards

You need to be able to do same week update

Most interesting “Big Data” Sources are fragile

Reason 4: Decide is beyond Predict

Most Interesting Problems Require To Combine Models + Heuristics + Non-local Optimization

Reason 5: “Suits ready” for scalability

Data Prep

Domain Specific

Feature Eng.Feature Eng. Model(s) Scoring

/Decision

Your CTO could certainly maintain it up and running all by himself

Your CTO could certainly maintain it up and running all by himself

Imagine the Dream Platform That Would Solve All This

?

Let’s call it Blue Box

New Data Decision

Feature : Cleansing, Enrich and Merge

Blue Box must be the perfect Data Blending runtime

Feature: Aggregating Data

Raw Events Stream Aggregate State

Consolidating History Must be part of Blue Box

1TB-100TB+ 100MB-1OGB

Feature : External Data Compliant

main data

enriched main data

additional data

e.g. Census, Map, Etc..

Third Data Data Must Be “In” the Blue Box

Feature : Update Data Service

Smart Lazy Human

A/B Test Support in Blue Box

Decision Ver. A

Decision Ver. B

P D F M SNew

Model

Feature : Programatic Decision

Need for Business Compliant “Real-Time” Rules in Blue Box

model 1

model 2 model 3

if

combine with

if proba > 0,63 decision A else decision B

if proba > 0,79 decision A else decision B

Feature : Audit and Logs

Smart Lazy Human

?

Blue Box needs to keep track of its decisions and Why

Decision Cause Log

External Data Advanced Join / Matching Ad-Hoc Transformation Python / R / Spark DataFrame transformations SQL Like Transformations Scoring Causes / Audit A/B Test Support Model Rollback / Versioning Prediction Log. Stats / Audit Ad-hoc scoring/decision code/scoring Open Source

What does Blue Box look like?

?

Interesting / Potential Open Source Project

Real-Time Entity Update, Management, Scoring

Open Source PMML Scoring in Java

Oryx: Lambda Architecture built on Spark and Kafka, with specialisation on real-time machine learning

How will we create the “blue box” ?

?

Specification ? PMML Extension ?

Open Source Framework ?

Hadoop / Spark Specific ?

Thank you !

is blue

Convince decisions makers to make data their competitive advantage

[email protected]@dataiku.com

Wanna work on this topic ?

Wanna share your dream features?