Production-Ready BIG ML Workflows - from zero to hero

80
By Daniel Marcous Google, Waze, Data Wizard dmarcous@gmail/google.com Big Data Analytics : Production Ready Flows & Waze Use Cases

Transcript of Production-Ready BIG ML Workflows - from zero to hero

Page 1: Production-Ready BIG ML Workflows - from zero to hero

By Daniel MarcousGoogle, Waze, Data Wizarddmarcous@gmail/google.com

Big Data Analytics : Production Ready Flows &Waze Use Cases

Page 2: Production-Ready BIG ML Workflows - from zero to hero

Rules1. Interactive is interesting.2. If you got something to say, say!3. Be open minded - I’m sure I got something to

learn from you, hope you got something to learn from me.

Page 3: Production-Ready BIG ML Workflows - from zero to hero

What’s a Data Wizard you ask?

Gain Actionable Insights!

Page 4: Production-Ready BIG ML Workflows - from zero to hero

What’s here?

Page 5: Production-Ready BIG ML Workflows - from zero to hero

What’s here?

MethodologyDeploying big models to production - step by step

PitfallsWhat to look out for in both methodology and code

Use CasesShowing off what we actually do in Waze Analytics

Based on tough lessons learned & Google experts recommendations and inputs.

Page 6: Production-Ready BIG ML Workflows - from zero to hero

Why Big Data?

Page 7: Production-Ready BIG ML Workflows - from zero to hero

Google in just 1 minute:

1000 new devices

3M Searches 100 Hours

1B Activated Devices

100M GB Search

Content

Page 8: Production-Ready BIG ML Workflows - from zero to hero

10+ Years of Tackling Big Data Problems

8

Google Papers

20082002 2004 2006 2010 2012 2014 2015

GFS MapReduce

Flume Java Millwheel

OpenSource

2005

GoogleCloudProducts BigQuery Pub/Sub Dataflow Bigtable

BigTable Dremel PubSub

Apache Beam

Tensorflow

Page 9: Production-Ready BIG ML Workflows - from zero to hero

“Google is living a few years in the future and sending the rest of us

messages”

Doug Cutting, Hadoop Co-Creator

Page 10: Production-Ready BIG ML Workflows - from zero to hero

Why Big ML?

Page 11: Production-Ready BIG ML Workflows - from zero to hero

Bigger is better

● More processing power○ Grid search all the parameters you ever

wanted.○ Cross validate in parallel with no extra

effort.● Keep training until you hit 0

○ Some models can not overfit when optimising until training error is 0.

■ RF - more trees■ ANN - more iterations

● Handle BIG data○ Tons of training data (if you have it) - no

need for sampling on wrong populations!○ Millions of features? Easy… (text

processing with TF-IDF)○ Some models (ANN) can’t do good without

training on a lot of data.

Page 12: Production-Ready BIG ML Workflows - from zero to hero

Challenges

Page 13: Production-Ready BIG ML Workflows - from zero to hero

Bigger is harder

● Skill gap - Big data engineer (Scala/Java) VS Researcher/PHD (Python/R)● Curse of dimensionality

○ Some algorithms require exponential time/memory as dimensions grow○ Harder and more important to tell what’s gold and what’s noise○ Unbalance data goes a long way with more records

● Big model != Small model○ Different parameter settings○ Different metric readings

■ Different implementations (distributed VS central memory)■ Different programming language (heuristics)

○ Different populations trained on (sampling)

Page 14: Production-Ready BIG ML Workflows - from zero to hero

Solution = Workflow

Page 15: Production-Ready BIG ML Workflows - from zero to hero

Measure first, optimize second.

Page 16: Production-Ready BIG ML Workflows - from zero to hero

Before you start

● Create example input○ Raw input

● Set up your metrics○ Derived from business needs○ Confusion matrix○ Precision / recall

■ Per class metrics○ AUC○ Coverage

■ Amount of subjects affected■ Sometimes measures as average

precision per K random subjects.

Remember : Desired short term behaviour does not imply long term behaviour

MeasurePreprocess(parse, clean, join, etc.)

● Create example output○ Featured input○ Prediction rows

NaiveMatrix

1

1

2

3

3

Page 17: Production-Ready BIG ML Workflows - from zero to hero

Preprocess

● Naive feature matrix○ Parse (Text -> RDD[Object] -> DataFrame)○ Clean (remove outliers / bad records)○ Join○ Remove non-features

● Get real data● Create a baseline dataset for training

○ Add some basic features■ Day of week / hour / etc.

○ Write a READABLE CSV that you can start and work with.

Page 18: Production-Ready BIG ML Workflows - from zero to hero

Preprocess

Case Class RDD to DataFrame

RDD[String] to Case Class RDD

String row to object

Page 19: Production-Ready BIG ML Workflows - from zero to hero

Preprocess

Parse string to Object with

java.sql types

Page 20: Production-Ready BIG ML Workflows - from zero to hero

Metric Generation

Craft useful metrics.

Pre class metrics

Confusion matrix by hand

Page 21: Production-Ready BIG ML Workflows - from zero to hero

Monitor.

Page 22: Production-Ready BIG ML Workflows - from zero to hero

Visualise - easiest way to measure quickly

● Set up your dashboard○ Amounts of input data

■ Before /after joining○ Amounts of output data○ Metrics (See “Measure first, optimize second”)

● Different model comparison - what’s best, when and where● Timeseries Analysis

○ Anomaly detection - Does a metric suddenly drastically change?○ Impact analysis - Did deploying a model had a significant effect on metric change?

Page 23: Production-Ready BIG ML Workflows - from zero to hero

● Web application framework for R.○ Introduces user interaction to analysis○ Combines ad-hoc testing with R statistical / modeling power

● Turns R function wrappers to interactive dashboard elements.○ Generates HTML, CSS, JS behind the scenes so you only write R.

● Get started● Get inspired● Shiny @Waze

Shiny

Page 24: Production-Ready BIG ML Workflows - from zero to hero

Dashboard monitoring

Dashboard should support - picking different models, comparing metrics.

Pick models to compare

Statistical tests on distributionst.test / AUC

Page 25: Production-Ready BIG ML Workflows - from zero to hero

Dashboard monitoring

Dashboard should support - Timeseries anomaly detection, and impact analysis (deploying new model)

Page 26: Production-Ready BIG ML Workflows - from zero to hero

Start small and grow.

Page 27: Production-Ready BIG ML Workflows - from zero to hero

Reduce the problem

● Tradeoff : Time to market VS Loss of accuracy● Sample data

○ Is random actually what you want?■ Keep label distributions■ Keep important features distributions

● Test everything you believe worthy○ Choose model○ Choose features (important when you go big)

■ Leave the “boarder” significant ones in○ Test different parameter configurations (you’ll need to validate your choice later)

Remember : This isn’t your production model. You’re only getting a sense of the data for now.

Page 28: Production-Ready BIG ML Workflows - from zero to hero

Getting a feel

Exploring a dataset with R.

Dividing data to training and testing.

Random partitioning

Page 29: Production-Ready BIG ML Workflows - from zero to hero

Getting a feel

Logistic regression and basic variable selection with R.

Logistic regression

Variable significance test

Page 30: Production-Ready BIG ML Workflows - from zero to hero

Getting a feel

Advanced variable selection with regularisation techniques in R.

Intercepts - by significance

No intercept = not entered to model

Page 31: Production-Ready BIG ML Workflows - from zero to hero

Getting a feel

Trying modeling techniques in R.

Root mean square error

Lower = better (~ kinda)

Fit a gradient boosted trees model

Page 32: Production-Ready BIG ML Workflows - from zero to hero

Getting a feel

Modeling bigger data with R, using parallelism.

Fit and combine 6 random forest

models (10k trees each) in parallel

Page 33: Production-Ready BIG ML Workflows - from zero to hero

Start with a flow.

Page 34: Production-Ready BIG ML Workflows - from zero to hero

Basic moving parts

Data source 1

Data source N

Preprocess

Training

Feature matrix

Scoring

Models 1..N

Predictions1..N

Dashboard

Serving DB

Feedback loop

Conf.

User/Model assignments

Page 35: Production-Ready BIG ML Workflows - from zero to hero

Flow motives

● Only 1 job for preprocessing○ Used in both training and serving - reduces risk of training on wrong population○ Should also be used before sampling when experimenting on a smaller scale.

○ When data sources are different for training and serving (RT VS Batch for example) use interfaces!

● Saving training & scoring feature matrices aside○ Try new algorithms / parameters on the same data○ Measure changes on same data as used in production.

Page 36: Production-Ready BIG ML Workflows - from zero to hero

Reusable flow code

Create a feature generation interface and some UDFs with Spark. Use later for both training and scoring purposes with minor changes.

SparkSQL UDFsImplement feature generation -decouples training and serving

Data cleaning work

Page 37: Production-Ready BIG ML Workflows - from zero to hero

Create a feature generation interface and some UDFs with Spark. Use later for both training and scoring purposes with minor changes.

Reusable flow code

Generate feature matrixBlackbox from app view

Page 38: Production-Ready BIG ML Workflows - from zero to hero

Good ML code trumps performance.

Page 39: Production-Ready BIG ML Workflows - from zero to hero

Why so many parts you ask?

● Scaling● Fault tolerance

○ Failed preprocessing /training doesn’t affect serving model○ Rerunning only failed parts

● Different logical parts - Different processes (@”Clean code” by Uncle Bob)○ Easier to read○ Easier to change code - targeted changes only affect their specific process○ One input, one output (almost…)

● Easier to tweak and deploy changes

Page 40: Production-Ready BIG ML Workflows - from zero to hero

Test your infrastructure.

Page 41: Production-Ready BIG ML Workflows - from zero to hero

@Test

● Suppose to happen throughout development, if not - now is the time to make sure you have it!

○ Data read correctly■ Null rates?

○ Features calculated correctly■ Does my complex join / function / logic return what is should?

○ Access ■ Can I access all the data sources from my “production” account?

○ Formats■ Adapt for variance in non-structured formats such as JSONs

○ Required Latency

Page 42: Production-Ready BIG ML Workflows - from zero to hero

Set up a baseline.Start with a neutral launch

Page 43: Production-Ready BIG ML Workflows - from zero to hero

● Take a snapshot of your metric reads:○ The ones you chose earlier in the process as important to you

■ Confusion matrix■ Weighted average % classified correctly■ % subject coverage

● Latency○ Building feature matrix on last day data takes X minutes○ Training takes X hours○ Serving predictions on Y records takes X seconds

You are here:

Remember : You are running with a naive model. Everything better than the old model / random is OK.

Page 44: Production-Ready BIG ML Workflows - from zero to hero

Go to work.Coffee recommended at this point.

Page 45: Production-Ready BIG ML Workflows - from zero to hero

Optimize What? How?

● Grid search over parameters

● Evaluate metrics○ Using a Spark predefined

Evaluator○ Using user defined metrics

● Cross validate Everything

● Tweak preprocessing (mainly features)○ Feature engineering○ Feature transformers

■ Discretize / Normalise○ Feature selectors○ In Apache Spark 1.6

● Tweak training○ Different models○ Different model parameters

Page 46: Production-Ready BIG ML Workflows - from zero to hero

Spark ML

Building an easy to use wrapper around training and serving.

Build model pipeline, train, evaluate

Not necessarily a random split

Page 47: Production-Ready BIG ML Workflows - from zero to hero

Spark ML

Building a training pipeline with spark.ml.

Create dummy variables

Required response label format

The ML model itself

Labels back to readable format

Assembled training pipeline

Page 48: Production-Ready BIG ML Workflows - from zero to hero

Spark ML

Cross-validate, grid search params and evaluate metrics.

Grid search with reference to ML model stage (RF)

Metrics to evaluate

Yes, you can definitely extendand add your own metrics.

Page 49: Production-Ready BIG ML Workflows - from zero to hero

Spark ML

Score a feature matrix and parse output.

Get probability for predicted class(default is a probability vector for all classes)

Page 50: Production-Ready BIG ML Workflows - from zero to hero

A/BTest your changes

Page 51: Production-Ready BIG ML Workflows - from zero to hero

● Same data, different results○ Use preprocessed feature matrix (same one used for current model)

● Best testing - production A/B test○ Use current production model and new model in parallel

● Metrics improvements (Remember your dashboard?)○ Time series analysis of metrics○ Compare metrics over different code versions (improves preprocessing / modeling)

● Deploy / Revert = Update user assignments○ Based on new metrics / feedback loop if possible

Compare to baseline

Page 52: Production-Ready BIG ML Workflows - from zero to hero

A/B Infrastructures

Setting up a very basic A/B testing infrastructure built upon our earlier presented modeling wrapper.

Conf hold Mapping of: model -> user_id/subject list

Score in parallel (inside a map)Distributed=awesome.

Fancy scala union for all score files

Page 53: Production-Ready BIG ML Workflows - from zero to hero

Watch. Iterate.

Page 54: Production-Ready BIG ML Workflows - from zero to hero

● Respond to anomalies (alerts) on metric reads● Try out new stuff

○ Tech versions (e.g. new Spark version)○ New data sources○ New features

● When you find something interesting - “Go to Work.”

Constant improvement

Remember : Trends and industries change, re-training on new data is not a bad thing.

Page 55: Production-Ready BIG ML Workflows - from zero to hero

Ad-Hoc statistics

Page 56: Production-Ready BIG ML Workflows - from zero to hero

● If you wrote your code right, you can easily reuse it in a notebook !● Answer ad-hoc questions

○ How many predictions did you output last month?○ How many new users had a prediction with probability > 0.7○ How accurate were we on last month predictions? (join with real data)

● No need to compile anything!

Enter Apache Zeppelin

Page 57: Production-Ready BIG ML Workflows - from zero to hero

Playing with it

Setting up zeppelin to user our jars.

Page 58: Production-Ready BIG ML Workflows - from zero to hero

Playing with it

Read a parquet file , show statistics, register as table and run SparkSQL on it.

Parquet - already has a schema inside

For usage in SparkSQL

Page 59: Production-Ready BIG ML Workflows - from zero to hero

Playing with it

Using spark-csv by Databricks.

CSV to DataFrame by Databricks

Page 60: Production-Ready BIG ML Workflows - from zero to hero

Using user compiled code inside a notebook.

Playing with it

Bring your own code

Page 61: Production-Ready BIG ML Workflows - from zero to hero

Technological pitfalls

Page 62: Production-Ready BIG ML Workflows - from zero to hero

Keep in mind

● Code produced with○ Apache Spark 1.6 / Scala 2.11.4

● RDD VS Dataframe○ Enter “Dataset API” (V2.0+)

● mllib VS spark.ml○ Always use spark.ml if

functionality exists

● Algorithmic Richness● Using Parquet

○ Intermediate outputs

● Unbalanced partitions○ Stuck on reduce○ Stragglers

● Output size○ Coalesce to desired size

● Dataframe Windows - Buggy○ Write your own over RDD

● Parameter tuning○ Spark.sql.partitions○ Executors ○ Driver VS executor memory

Page 63: Production-Ready BIG ML Workflows - from zero to hero

Putting it all together

Page 64: Production-Ready BIG ML Workflows - from zero to hero

Work Process

Step by step for deploying your big ML workflows to production, ready for operations and optimisations.

1. Measure first, optimize second.a. Define metrics.b. Preprocess data (using examples)c. Monitor. (dashboard setup)

2. Start small and grow.3. Start with a flow.

a. Good ML code trums performance.b. Test your infrastructure.

4. Set up a baseline.5. Go to work.

a. Optimize.b. A/B.

i. Test new flow in parallel to existing flow.ii. Update user assignments.

6. Watch. Iterate. (see 5.)

Page 66: Production-Ready BIG ML Workflows - from zero to hero

Use CasesWhat Waze does with all its data?

Page 67: Production-Ready BIG ML Workflows - from zero to hero

Trending Locations / Day of Week Breakdown

Page 68: Production-Ready BIG ML Workflows - from zero to hero

Opening Hours Inference

Page 69: Production-Ready BIG ML Workflows - from zero to hero

Optimising - Ad clicks / Time from drive start

Page 70: Production-Ready BIG ML Workflows - from zero to hero

Time to Content (US) - Day of week / Category

Page 71: Production-Ready BIG ML Workflows - from zero to hero

Irregular Events / Anomaly Detection

Major events, causing out of the ordinary traffic/road blocks etc’ affecting large numbers of users.

Page 72: Production-Ready BIG ML Workflows - from zero to hero

Dangerous Places - Clustering

Find most dangerous areas / streets, using custom developed clustering algorithms

● Alert authorities / users● Compare & share with 3rd parties (NYPD)

Page 73: Production-Ready BIG ML Workflows - from zero to hero

Parking Places Detection

Parking entrance

Parking lot

Street parking

Page 74: Production-Ready BIG ML Workflows - from zero to hero

Server Distribution Optimisation

Calculate the optimal routing servers distribution according to geographical load.

● Better experience - faster response time● Saves money - no need for redundant elastic scaling of servers

Page 75: Production-Ready BIG ML Workflows - from zero to hero

Text Mining - Topic Analysis

Topic 1 - ETA Topic 2 - Unusual Topic 3 - Share info Topic 4 - Reports Topic 5 - Jams Topic 6 -Voice

wazers usual road social still morgan

eta traffic driving drivers will ang

con stay info reporting update freeman

zona today using helped drive kanan

usando times area nearby delay voice

real clear realtime traffic add meter

tiempo slower sharing jam jammed kan

carretera accident soci drive near masuk

Page 76: Production-Ready BIG ML Workflows - from zero to hero

Text Mining - New Version Impressions

● Text analysis - stemming / stopword detection etc.● Topic modeling● Sentiment analysis

Waze V4 update :● Good - “redesign”, ”smarter”, “cleaner”, “improved”● Bad - “stuck”

Overall very positive score!

Page 77: Production-Ready BIG ML Workflows - from zero to hero

Text Mining - Store Sentiments

Page 78: Production-Ready BIG ML Workflows - from zero to hero

Text Mining - Sentiment by Time & Place

Page 79: Production-Ready BIG ML Workflows - from zero to hero
Page 80: Production-Ready BIG ML Workflows - from zero to hero

Daniel [email protected]@gmail.com