Production-Ready BIG ML Workflows - from zero to hero
-
Upload
daniel-marcous -
Category
Data & Analytics
-
view
1.030 -
download
0
Transcript of Production-Ready BIG ML Workflows - from zero to hero
By Daniel MarcousGoogle, Waze, Data Wizarddmarcous@gmail/google.com
Big Data Analytics : Production Ready Flows &Waze Use Cases
Rules1. Interactive is interesting.2. If you got something to say, say!3. Be open minded - I’m sure I got something to
learn from you, hope you got something to learn from me.
What’s a Data Wizard you ask?
Gain Actionable Insights!
What’s here?
What’s here?
MethodologyDeploying big models to production - step by step
PitfallsWhat to look out for in both methodology and code
Use CasesShowing off what we actually do in Waze Analytics
Based on tough lessons learned & Google experts recommendations and inputs.
Why Big Data?
Google in just 1 minute:
1000 new devices
3M Searches 100 Hours
1B Activated Devices
100M GB Search
Content
10+ Years of Tackling Big Data Problems
8
Google Papers
20082002 2004 2006 2010 2012 2014 2015
GFS MapReduce
Flume Java Millwheel
OpenSource
2005
GoogleCloudProducts BigQuery Pub/Sub Dataflow Bigtable
BigTable Dremel PubSub
Apache Beam
Tensorflow
“Google is living a few years in the future and sending the rest of us
messages”
Doug Cutting, Hadoop Co-Creator
Why Big ML?
Bigger is better
● More processing power○ Grid search all the parameters you ever
wanted.○ Cross validate in parallel with no extra
effort.● Keep training until you hit 0
○ Some models can not overfit when optimising until training error is 0.
■ RF - more trees■ ANN - more iterations
● Handle BIG data○ Tons of training data (if you have it) - no
need for sampling on wrong populations!○ Millions of features? Easy… (text
processing with TF-IDF)○ Some models (ANN) can’t do good without
training on a lot of data.
Challenges
Bigger is harder
● Skill gap - Big data engineer (Scala/Java) VS Researcher/PHD (Python/R)● Curse of dimensionality
○ Some algorithms require exponential time/memory as dimensions grow○ Harder and more important to tell what’s gold and what’s noise○ Unbalance data goes a long way with more records
● Big model != Small model○ Different parameter settings○ Different metric readings
■ Different implementations (distributed VS central memory)■ Different programming language (heuristics)
○ Different populations trained on (sampling)
Solution = Workflow
Measure first, optimize second.
Before you start
● Create example input○ Raw input
● Set up your metrics○ Derived from business needs○ Confusion matrix○ Precision / recall
■ Per class metrics○ AUC○ Coverage
■ Amount of subjects affected■ Sometimes measures as average
precision per K random subjects.
Remember : Desired short term behaviour does not imply long term behaviour
MeasurePreprocess(parse, clean, join, etc.)
● Create example output○ Featured input○ Prediction rows
NaiveMatrix
1
1
2
3
3
Preprocess
● Naive feature matrix○ Parse (Text -> RDD[Object] -> DataFrame)○ Clean (remove outliers / bad records)○ Join○ Remove non-features
● Get real data● Create a baseline dataset for training
○ Add some basic features■ Day of week / hour / etc.
○ Write a READABLE CSV that you can start and work with.
Preprocess
Case Class RDD to DataFrame
RDD[String] to Case Class RDD
String row to object
Preprocess
Parse string to Object with
java.sql types
Metric Generation
Craft useful metrics.
Pre class metrics
Confusion matrix by hand
Monitor.
Visualise - easiest way to measure quickly
● Set up your dashboard○ Amounts of input data
■ Before /after joining○ Amounts of output data○ Metrics (See “Measure first, optimize second”)
● Different model comparison - what’s best, when and where● Timeseries Analysis
○ Anomaly detection - Does a metric suddenly drastically change?○ Impact analysis - Did deploying a model had a significant effect on metric change?
● Web application framework for R.○ Introduces user interaction to analysis○ Combines ad-hoc testing with R statistical / modeling power
● Turns R function wrappers to interactive dashboard elements.○ Generates HTML, CSS, JS behind the scenes so you only write R.
● Get started● Get inspired● Shiny @Waze
Shiny
Dashboard monitoring
Dashboard should support - picking different models, comparing metrics.
Pick models to compare
Statistical tests on distributionst.test / AUC
Dashboard monitoring
Dashboard should support - Timeseries anomaly detection, and impact analysis (deploying new model)
Start small and grow.
Reduce the problem
● Tradeoff : Time to market VS Loss of accuracy● Sample data
○ Is random actually what you want?■ Keep label distributions■ Keep important features distributions
● Test everything you believe worthy○ Choose model○ Choose features (important when you go big)
■ Leave the “boarder” significant ones in○ Test different parameter configurations (you’ll need to validate your choice later)
Remember : This isn’t your production model. You’re only getting a sense of the data for now.
Getting a feel
Exploring a dataset with R.
Dividing data to training and testing.
Random partitioning
Getting a feel
Logistic regression and basic variable selection with R.
Logistic regression
Variable significance test
Getting a feel
Advanced variable selection with regularisation techniques in R.
Intercepts - by significance
No intercept = not entered to model
Getting a feel
Trying modeling techniques in R.
Root mean square error
Lower = better (~ kinda)
Fit a gradient boosted trees model
Getting a feel
Modeling bigger data with R, using parallelism.
Fit and combine 6 random forest
models (10k trees each) in parallel
Start with a flow.
Basic moving parts
Data source 1
Data source N
Preprocess
Training
Feature matrix
Scoring
Models 1..N
Predictions1..N
Dashboard
Serving DB
Feedback loop
Conf.
User/Model assignments
Flow motives
● Only 1 job for preprocessing○ Used in both training and serving - reduces risk of training on wrong population○ Should also be used before sampling when experimenting on a smaller scale.
○ When data sources are different for training and serving (RT VS Batch for example) use interfaces!
● Saving training & scoring feature matrices aside○ Try new algorithms / parameters on the same data○ Measure changes on same data as used in production.
Reusable flow code
Create a feature generation interface and some UDFs with Spark. Use later for both training and scoring purposes with minor changes.
SparkSQL UDFsImplement feature generation -decouples training and serving
Data cleaning work
Create a feature generation interface and some UDFs with Spark. Use later for both training and scoring purposes with minor changes.
Reusable flow code
Generate feature matrixBlackbox from app view
Good ML code trumps performance.
Why so many parts you ask?
● Scaling● Fault tolerance
○ Failed preprocessing /training doesn’t affect serving model○ Rerunning only failed parts
● Different logical parts - Different processes (@”Clean code” by Uncle Bob)○ Easier to read○ Easier to change code - targeted changes only affect their specific process○ One input, one output (almost…)
● Easier to tweak and deploy changes
Test your infrastructure.
@Test
● Suppose to happen throughout development, if not - now is the time to make sure you have it!
○ Data read correctly■ Null rates?
○ Features calculated correctly■ Does my complex join / function / logic return what is should?
○ Access ■ Can I access all the data sources from my “production” account?
○ Formats■ Adapt for variance in non-structured formats such as JSONs
○ Required Latency
Set up a baseline.Start with a neutral launch
● Take a snapshot of your metric reads:○ The ones you chose earlier in the process as important to you
■ Confusion matrix■ Weighted average % classified correctly■ % subject coverage
● Latency○ Building feature matrix on last day data takes X minutes○ Training takes X hours○ Serving predictions on Y records takes X seconds
You are here:
Remember : You are running with a naive model. Everything better than the old model / random is OK.
Go to work.Coffee recommended at this point.
Optimize What? How?
● Grid search over parameters
● Evaluate metrics○ Using a Spark predefined
Evaluator○ Using user defined metrics
● Cross validate Everything
● Tweak preprocessing (mainly features)○ Feature engineering○ Feature transformers
■ Discretize / Normalise○ Feature selectors○ In Apache Spark 1.6
● Tweak training○ Different models○ Different model parameters
Spark ML
Building an easy to use wrapper around training and serving.
Build model pipeline, train, evaluate
Not necessarily a random split
Spark ML
Building a training pipeline with spark.ml.
Create dummy variables
Required response label format
The ML model itself
Labels back to readable format
Assembled training pipeline
Spark ML
Cross-validate, grid search params and evaluate metrics.
Grid search with reference to ML model stage (RF)
Metrics to evaluate
Yes, you can definitely extendand add your own metrics.
Spark ML
Score a feature matrix and parse output.
Get probability for predicted class(default is a probability vector for all classes)
A/BTest your changes
● Same data, different results○ Use preprocessed feature matrix (same one used for current model)
● Best testing - production A/B test○ Use current production model and new model in parallel
● Metrics improvements (Remember your dashboard?)○ Time series analysis of metrics○ Compare metrics over different code versions (improves preprocessing / modeling)
● Deploy / Revert = Update user assignments○ Based on new metrics / feedback loop if possible
Compare to baseline
A/B Infrastructures
Setting up a very basic A/B testing infrastructure built upon our earlier presented modeling wrapper.
Conf hold Mapping of: model -> user_id/subject list
Score in parallel (inside a map)Distributed=awesome.
Fancy scala union for all score files
Watch. Iterate.
● Respond to anomalies (alerts) on metric reads● Try out new stuff
○ Tech versions (e.g. new Spark version)○ New data sources○ New features
● When you find something interesting - “Go to Work.”
Constant improvement
Remember : Trends and industries change, re-training on new data is not a bad thing.
Ad-Hoc statistics
● If you wrote your code right, you can easily reuse it in a notebook !● Answer ad-hoc questions
○ How many predictions did you output last month?○ How many new users had a prediction with probability > 0.7○ How accurate were we on last month predictions? (join with real data)
● No need to compile anything!
Enter Apache Zeppelin
Playing with it
Setting up zeppelin to user our jars.
Playing with it
Read a parquet file , show statistics, register as table and run SparkSQL on it.
Parquet - already has a schema inside
For usage in SparkSQL
Playing with it
Using spark-csv by Databricks.
CSV to DataFrame by Databricks
Using user compiled code inside a notebook.
Playing with it
Bring your own code
Technological pitfalls
Keep in mind
● Code produced with○ Apache Spark 1.6 / Scala 2.11.4
● RDD VS Dataframe○ Enter “Dataset API” (V2.0+)
● mllib VS spark.ml○ Always use spark.ml if
functionality exists
● Algorithmic Richness● Using Parquet
○ Intermediate outputs
● Unbalanced partitions○ Stuck on reduce○ Stragglers
● Output size○ Coalesce to desired size
● Dataframe Windows - Buggy○ Write your own over RDD
● Parameter tuning○ Spark.sql.partitions○ Executors ○ Driver VS executor memory
Putting it all together
Work Process
Step by step for deploying your big ML workflows to production, ready for operations and optimisations.
1. Measure first, optimize second.a. Define metrics.b. Preprocess data (using examples)c. Monitor. (dashboard setup)
2. Start small and grow.3. Start with a flow.
a. Good ML code trums performance.b. Test your infrastructure.
4. Set up a baseline.5. Go to work.
a. Optimize.b. A/B.
i. Test new flow in parallel to existing flow.ii. Update user assignments.
6. Watch. Iterate. (see 5.)
Code:https://github.com/dmarcous/BigMLFlow/
Slides:http://www.slideshare.net/DanielMarcous/productionready-big-ml-workflows-from-zero-to-hero
Use CasesWhat Waze does with all its data?
Trending Locations / Day of Week Breakdown
Opening Hours Inference
Optimising - Ad clicks / Time from drive start
Time to Content (US) - Day of week / Category
Irregular Events / Anomaly Detection
Major events, causing out of the ordinary traffic/road blocks etc’ affecting large numbers of users.
Dangerous Places - Clustering
Find most dangerous areas / streets, using custom developed clustering algorithms
● Alert authorities / users● Compare & share with 3rd parties (NYPD)
Parking Places Detection
Parking entrance
Parking lot
Street parking
Server Distribution Optimisation
Calculate the optimal routing servers distribution according to geographical load.
● Better experience - faster response time● Saves money - no need for redundant elastic scaling of servers
Text Mining - Topic Analysis
Topic 1 - ETA Topic 2 - Unusual Topic 3 - Share info Topic 4 - Reports Topic 5 - Jams Topic 6 -Voice
wazers usual road social still morgan
eta traffic driving drivers will ang
con stay info reporting update freeman
zona today using helped drive kanan
usando times area nearby delay voice
real clear realtime traffic add meter
tiempo slower sharing jam jammed kan
carretera accident soci drive near masuk
Text Mining - New Version Impressions
● Text analysis - stemming / stopword detection etc.● Topic modeling● Sentiment analysis
Waze V4 update :● Good - “redesign”, ”smarter”, “cleaner”, “improved”● Bad - “stuck”
Overall very positive score!
Text Mining - Store Sentiments
Text Mining - Sentiment by Time & Place
Daniel [email protected]@gmail.com