Production-Ready BIG ML Workflows - from zero to hero

By Daniel MarcousGoogle, Waze, Data Wizarddmarcous@gmail/google.com

Big Data Analytics : Production Ready Flows &Waze Use Cases

Rules1. Interactive is interesting.2. If you got something to say, say!3. Be open minded - I’m sure I got something to

learn from you, hope you got something to learn from me.

What’s a Data Wizard you ask?

Gain Actionable Insights!

What’s here?

What’s here?

MethodologyDeploying big models to production - step by step

PitfallsWhat to look out for in both methodology and code

Use CasesShowing off what we actually do in Waze Analytics

Based on tough lessons learned & Google experts recommendations and inputs.

Why Big Data?

Google in just 1 minute:

1000 new devices

3M Searches 100 Hours

1B Activated Devices

100M GB Search

Content

10+ Years of Tackling Big Data Problems

8

Google Papers

20082002 2004 2006 2010 2012 2014 2015

GFS MapReduce

Flume Java Millwheel

OpenSource

2005

GoogleCloudProducts BigQuery Pub/Sub Dataflow Bigtable

BigTable Dremel PubSub

Apache Beam

Tensorflow

“Google is living a few years in the future and sending the rest of us

messages”

Doug Cutting, Hadoop Co-Creator

Why Big ML?

Bigger is better

● More processing power○ Grid search all the parameters you ever

wanted.○ Cross validate in parallel with no extra

effort.● Keep training until you hit 0

○ Some models can not overfit when optimising until training error is 0.

■ RF - more trees■ ANN - more iterations

● Handle BIG data○ Tons of training data (if you have it) - no

need for sampling on wrong populations!○ Millions of features? Easy… (text

processing with TF-IDF)○ Some models (ANN) can’t do good without

training on a lot of data.

Challenges

Bigger is harder

● Skill gap - Big data engineer (Scala/Java) VS Researcher/PHD (Python/R)● Curse of dimensionality

○ Some algorithms require exponential time/memory as dimensions grow○ Harder and more important to tell what’s gold and what’s noise○ Unbalance data goes a long way with more records

● Big model != Small model○ Different parameter settings○ Different metric readings

■ Different implementations (distributed VS central memory)■ Different programming language (heuristics)

○ Different populations trained on (sampling)

Solution = Workflow

Measure first, optimize second.

Before you start

● Create example input○ Raw input

● Set up your metrics○ Derived from business needs○ Confusion matrix○ Precision / recall

■ Per class metrics○ AUC○ Coverage

■ Amount of subjects affected■ Sometimes measures as average

precision per K random subjects.

Remember : Desired short term behaviour does not imply long term behaviour

MeasurePreprocess(parse, clean, join, etc.)

● Create example output○ Featured input○ Prediction rows

NaiveMatrix

1

1

2

3

3

Preprocess

● Naive feature matrix○ Parse (Text -> RDD[Object] -> DataFrame)○ Clean (remove outliers / bad records)○ Join○ Remove non-features

● Get real data● Create a baseline dataset for training

○ Add some basic features■ Day of week / hour / etc.

○ Write a READABLE CSV that you can start and work with.

Preprocess

Case Class RDD to DataFrame

RDD[String] to Case Class RDD

String row to object

Preprocess

Parse string to Object with

java.sql types

Metric Generation

Craft useful metrics.

Pre class metrics

Confusion matrix by hand

Monitor.

Visualise - easiest way to measure quickly

● Set up your dashboard○ Amounts of input data

■ Before /after joining○ Amounts of output data○ Metrics (See “Measure first, optimize second”)

● Different model comparison - what’s best, when and where● Timeseries Analysis

○ Anomaly detection - Does a metric suddenly drastically change?○ Impact analysis - Did deploying a model had a significant effect on metric change?

● Web application framework for R.○ Introduces user interaction to analysis○ Combines ad-hoc testing with R statistical / modeling power

● Turns R function wrappers to interactive dashboard elements.○ Generates HTML, CSS, JS behind the scenes so you only write R.

● Get started● Get inspired● Shiny @Waze

Shiny

http://shiny.rstudio.com/tutorial/

http://shiny.rstudio.com/tutorial/

http://shiny.rstudio.com/gallery/

http://shiny.rstudio.com/gallery/

https://www.rstudio.com/wp-content/uploads/2014/08/WAZE-Story.pdf

https://www.rstudio.com/wp-content/uploads/2014/08/WAZE-Story.pdf

Dashboard monitoring

Dashboard should support - picking different models, comparing metrics.

Pick models to compare

Statistical tests on distributionst.test / AUC

Dashboard monitoring

Dashboard should support - Timeseries anomaly detection, and impact analysis (deploying new model)

https://github.com/twitter/AnomalyDetection

https://google.github.io/CausalImpact/CausalImpact.html

https://google.github.io/CausalImpact/CausalImpact.html

Start small and grow.

Reduce the problem

● Tradeoff : Time to market VS Loss of accuracy● Sample data

○ Is random actually what you want?■ Keep label distributions■ Keep important features distributions

● Test everything you believe worthy○ Choose model○ Choose features (important when you go big)

■ Leave the “boarder” significant ones in○ Test different parameter configurations (you’ll need to validate your choice later)

Remember : This isn’t your production model. You’re only getting a sense of the data for now.

Getting a feel

Exploring a dataset with R.

Dividing data to training and testing.

Random partitioning

Getting a feel

Logistic regression and basic variable selection with R.

Logistic regression

Variable significance test

Getting a feel

Advanced variable selection with regularisation techniques in R.

Intercepts - by significance

No intercept = not entered to model

Getting a feel

Trying modeling techniques in R.

Root mean square error

Lower = better (~ kinda)

Fit a gradient boosted trees model

Getting a feel

Modeling bigger data with R, using parallelism.

Fit and combine 6 random forest

models (10k trees each) in parallel

Start with a flow.

Basic moving parts

Data source 1

Data source N

Preprocess

Training

Feature matrix

Scoring

Models 1..N

Predictions1..N

Dashboard

Serving DB

Feedback loop

Conf.

User/Model assignments

Flow motives

● Only 1 job for preprocessing○ Used in both training and serving - reduces risk of training on wrong population○ Should also be used before sampling when experimenting on a smaller scale.

○ When data sources are different for training and serving (RT VS Batch for example) use interfaces!

● Saving training & scoring feature matrices aside○ Try new algorithms / parameters on the same data○ Measure changes on same data as used in production.

Reusable flow code

Create a feature generation interface and some UDFs with Spark. Use later for both training and scoring purposes with minor changes.

SparkSQL UDFsImplement feature generation -decouples training and serving

Data cleaning work

Create a feature generation interface and some UDFs with Spark. Use later for both training and scoring purposes with minor changes.

Reusable flow code

Generate feature matrixBlackbox from app view

Good ML code trumps performance.

Why so many parts you ask?

● Scaling● Fault tolerance

○ Failed preprocessing /training doesn’t affect serving model○ Rerunning only failed parts

● Different logical parts - Different processes (@”Clean code” by Uncle Bob)○ Easier to read○ Easier to change code - targeted changes only affect their specific process○ One input, one output (almost…)

● Easier to tweak and deploy changes

Test your infrastructure.

@Test

● Suppose to happen throughout development, if not - now is the time to make sure you have it!

○ Data read correctly■ Null rates?

○ Features calculated correctly■ Does my complex join / function / logic return what is should?

○ Access ■ Can I access all the data sources from my “production” account?

○ Formats■ Adapt for variance in non-structured formats such as JSONs

○ Required Latency

Set up a baseline.Start with a neutral launch

● Take a snapshot of your metric reads:○ The ones you chose earlier in the process as important to you

■ Confusion matrix■ Weighted average % classified correctly■ % subject coverage

● Latency○ Building feature matrix on last day data takes X minutes○ Training takes X hours○ Serving predictions on Y records takes X seconds

You are here:

Remember : You are running with a naive model. Everything better than the old model / random is OK.

Go to work.Coffee recommended at this point.

Optimize What? How?

● Grid search over parameters

● Evaluate metrics○ Using a Spark predefined

Evaluator○ Using user defined metrics

● Cross validate Everything

● Tweak preprocessing (mainly features)○ Feature engineering○ Feature transformers

■ Discretize / Normalise○ Feature selectors○ In Apache Spark 1.6

● Tweak training○ Different models○ Different model parameters

http://spark.apache.org/docs/latest/ml-features.html

http://spark.apache.org/docs/latest/ml-features.html

Spark ML

Building an easy to use wrapper around training and serving.

Build model pipeline, train, evaluate

Not necessarily a random split

Spark ML

Building a training pipeline with spark.ml.

Create dummy variables

Required response label format

The ML model itself

Labels back to readable format

Assembled training pipeline

Spark ML

Cross-validate, grid search params and evaluate metrics.

Grid search with reference to ML model stage (RF)

Metrics to evaluate

Yes, you can definitely extendand add your own metrics.

Spark ML

Score a feature matrix and parse output.

Get probability for predicted class(default is a probability vector for all classes)

A/BTest your changes

● Same data, different results○ Use preprocessed feature matrix (same one used for current model)

● Best testing - production A/B test○ Use current production model and new model in parallel

● Metrics improvements (Remember your dashboard?)○ Time series analysis of metrics○ Compare metrics over different code versions (improves preprocessing / modeling)

● Deploy / Revert = Update user assignments○ Based on new metrics / feedback loop if possible

Compare to baseline

A/B Infrastructures

Setting up a very basic A/B testing infrastructure built upon our earlier presented modeling wrapper.

Conf hold Mapping of: model -> user_id/subject list

Score in parallel (inside a map)Distributed=awesome.

Fancy scala union for all score files

Watch. Iterate.

● Respond to anomalies (alerts) on metric reads● Try out new stuff

○ Tech versions (e.g. new Spark version)○ New data sources○ New features

● When you find something interesting - “Go to Work.”

Constant improvement

Remember : Trends and industries change, re-training on new data is not a bad thing.

Ad-Hoc statistics

● If you wrote your code right, you can easily reuse it in a notebook !● Answer ad-hoc questions

○ How many predictions did you output last month?○ How many new users had a prediction with probability > 0.7○ How accurate were we on last month predictions? (join with real data)

● No need to compile anything!

Enter Apache Zeppelin

Playing with it

Setting up zeppelin to user our jars.

Playing with it

Read a parquet file , show statistics, register as table and run SparkSQL on it.

Parquet - already has a schema inside

For usage in SparkSQL

Playing with it

Using spark-csv by Databricks.

CSV to DataFrame by Databricks

Using user compiled code inside a notebook.

Playing with it

Bring your own code

Technological pitfalls

Keep in mind

● Code produced with○ Apache Spark 1.6 / Scala 2.11.4

● RDD VS Dataframe○ Enter “Dataset API” (V2.0+)

● mllib VS spark.ml○ Always use spark.ml if

functionality exists

● Algorithmic Richness● Using Parquet

○ Intermediate outputs

● Unbalanced partitions○ Stuck on reduce○ Stragglers

● Output size○ Coalesce to desired size

● Dataframe Windows - Buggy○ Write your own over RDD

● Parameter tuning○ Spark.sql.partitions○ Executors ○ Driver VS executor memory

Putting it all together

Work Process

Step by step for deploying your big ML workflows to production, ready for operations and optimisations.

1. Measure first, optimize second.a. Define metrics.b. Preprocess data (using examples)c. Monitor. (dashboard setup)

2. Start small and grow.3. Start with a flow.

a. Good ML code trums performance.b. Test your infrastructure.

4. Set up a baseline.5. Go to work.

a. Optimize.b. A/B.

i. Test new flow in parallel to existing flow.ii. Update user assignments.

6. Watch. Iterate. (see 5.)

Code:https://github.com/dmarcous/BigMLFlow/

Slides:http://www.slideshare.net/DanielMarcous/productionready-big-ml-workflows-from-zero-to-hero

https://github.com/dmarcous/BigMLFlow

https://github.com/dmarcous/BigMLFlow

http://www.slideshare.net/DanielMarcous/productionready-big-ml-workflows-from-zero-to-hero



Use CasesWhat Waze does with all its data?

Trending Locations / Day of Week Breakdown

Opening Hours Inference

Optimising - Ad clicks / Time from drive start

Time to Content (US) - Day of week / Category

Irregular Events / Anomaly Detection

Major events, causing out of the ordinary traffic/road blocks etc’ affecting large numbers of users.

Dangerous Places - Clustering

Find most dangerous areas / streets, using custom developed clustering algorithms

● Alert authorities / users● Compare & share with 3rd parties (NYPD)

Parking Places Detection

Parking entrance

Parking lot

Street parking

Server Distribution Optimisation

Calculate the optimal routing servers distribution according to geographical load.

● Better experience - faster response time● Saves money - no need for redundant elastic scaling of servers

Text Mining - Topic Analysis

Topic 1 - ETA Topic 2 - Unusual Topic 3 - Share info Topic 4 - Reports Topic 5 - Jams Topic 6 -Voice

wazers usual road social still morgan

eta traffic driving drivers will ang

con stay info reporting update freeman

zona today using helped drive kanan

usando times area nearby delay voice

real clear realtime traffic add meter

tiempo slower sharing jam jammed kan

carretera accident soci drive near masuk

Text Mining - New Version Impressions

● Text analysis - stemming / stopword detection etc.● Topic modeling● Sentiment analysis

Waze V4 update :● Good - “redesign”, ”smarter”, “cleaner”, “improved”● Bad - “stuck”

Overall very positive score!

Text Mining - Store Sentiments

Text Mining - Sentiment by Time & Place

Daniel [email protected]@gmail.com

Production-Ready BIG ML Workflows - from zero to hero

Data & Analytics

Transcript of Production-Ready BIG ML Workflows - from zero to hero