Dataiku - for Data Geek Paris@Criteo - Close the Data Circle
Dataiku productive application to production - pap is may 2015
Transcript of Dataiku productive application to production - pap is may 2015
Imagine How
5 Years from Now will
predictive applications be put
in production
Our Goal Today
How are we doing today ? What is difficult ?
What should be simpler?
What is a predictive application ?
Churn Prevention
Fraud Detection
Demand Forecast
Targeting
Maintenance
Match Making
Ad Bidding
Drug Studies
Pricing
Ranking
This discussion not relevant to all
Churn
Maintenance
Drug Studies Multi-Years
Multi-Years
Multi-Years Weekly
Weekly
Yearly
Bidding Two Weeks Sub-Second
Data SpanRetrain every … Score
every…
Yearly
Day
Monthly
Monthly
Production = Dev
Online Learning
Not just a “model”
Data Prep
Domain Specific
Feature Eng.Feature Eng. Model(s) Scoring
/Decision
Data Collection
Let’s call this a Predictive Service Specification
How much effort ?
Data Prep
Domain Specific
Feature Eng.Feature Eng. Model(s) Scoring
/Decision
20% 30% 25% 5% 5% 15%
Data Collection
Who Does What ?
Data Prep
Domain Specific
Feature Eng.Feature Eng. Model(s) Scoring
/Decision
Data Domain Engineers
Data AnalystsData ScientistsBusiness Intelligence Engineers
Huge Variety of Tech
Data Prep
Domain Specific
Feature Eng.Feature Eng. Model(s) Scoring
/Decision
Data Collection
ETL ? Ad-Hoc?
ETL ? Ad-Hoc?
ETL ? SQL ? R ? Python ?
Matlab ?
R ? Python ? R ? Python ? SAS? Java / Python
Business Rules Management System
Data Prep
Domain Specific
Feature Eng.Feature Eng. Model(s) Scoring
/Decision
From Build to Run
Data Prep
Domain Specific
Feature Eng.Feature Eng. Model(s) Scoring
/Decision
?Input Data Decision
Build Time
Run Time
How People Do that Today ?
Data Prep
Domain Specific
Feature Eng.Feature Eng. Model(s) Scoring
/Decision
PMMLETL WebServiceScript/SQL
Data Collection
A Predictive Service =
Up to 4 different “Applications" that can run out-of-sync
Some Integrated Per-Platform Approach
in Database
in SAS
in Hadoop/Spark
SQL Commercial Warehouse + Scoring UDF
End-to-end integration script
Ad-hoc development
Reason 1 : Prohibitive Costs kill projects
Data Prep
Domain Specific
Feature Eng.Feature Eng. Model(s) Scoring
/Decision
RSQL PythonR
Data Prep
Domain Specific
Feature Eng.Feature Eng. Model(s) Scoring
/Decision
SQLETL WebServiceSQL PMML
300K$ 50K$ 200K$ 100K$
50K$
650K$
Reason 2: Distribution DriftNew behaviour
New productNew competitor
Model stops working as planned
You need to be able to do same week update
Reason 3: Mitigate with Data Hazards
You need to be able to do same week update
Most interesting “Big Data” Sources are fragile
Reason 4: Decide is beyond Predict
Most Interesting Problems Require To Combine Models + Heuristics + Non-local Optimization
Reason 5: “Suits ready” for scalability
Data Prep
Domain Specific
Feature Eng.Feature Eng. Model(s) Scoring
/Decision
Your CTO could certainly maintain it up and running all by himself
Your CTO could certainly maintain it up and running all by himself
Feature: Aggregating Data
Raw Events Stream Aggregate State
Consolidating History Must be part of Blue Box
1TB-100TB+ 100MB-1OGB
Feature : External Data Compliant
main data
enriched main data
additional data
e.g. Census, Map, Etc..
Third Data Data Must Be “In” the Blue Box
Feature : Update Data Service
Smart Lazy Human
A/B Test Support in Blue Box
Decision Ver. A
Decision Ver. B
P D F M SNew
Model
Feature : Programatic Decision
Need for Business Compliant “Real-Time” Rules in Blue Box
model 1
model 2 model 3
if
combine with
if proba > 0,63 decision A else decision B
if proba > 0,79 decision A else decision B
Feature : Audit and Logs
Smart Lazy Human
?
Blue Box needs to keep track of its decisions and Why
Decision Cause Log
External Data Advanced Join / Matching Ad-Hoc Transformation Python / R / Spark DataFrame transformations SQL Like Transformations Scoring Causes / Audit A/B Test Support Model Rollback / Versioning Prediction Log. Stats / Audit Ad-hoc scoring/decision code/scoring Open Source
What does Blue Box look like?
?
Interesting / Potential Open Source Project
Real-Time Entity Update, Management, Scoring
Open Source PMML Scoring in Java
Oryx: Lambda Architecture built on Spark and Kafka, with specialisation on real-time machine learning
How will we create the “blue box” ?
?
Specification ? PMML Extension ?
Open Source Framework ?
Hadoop / Spark Specific ?
Thank you !
is blue
Convince decisions makers to make data their competitive advantage
[email protected]@dataiku.com
Wanna work on this topic ?
Wanna share your dream features?