Scalable predictive pipelines with Spark and Scala
-
Upload
tom-van-der-weide -
Category
Data & Analytics
-
view
172 -
download
5
Transcript of Scalable predictive pipelines with Spark and Scala
THE FUTURE IS NOW
Scalable Predictive Pipelines with Spark and ScalaBased on a presentation from Dimitris PapadopoulosTom van der Weide
3
About Schibsted
4
About Schibsted
5
About Schibsted
6
Event Tracking Data
7
Event Tracking Data
8
Event Tracking Data
9
Event Tracking Data
10
Data Science Tasks
DataModel
Results
Preprocessing
1. Using Spark ML Pipelines
2. Scalable Pipelines
11
Outline
12
Pipeline
13
Pipeline
14
Pipeline
15
Pipeline
Check out:Machine Learning: The High-Interest Credit Card of Technical Debt
by Google employees
16
Not a pipe
17
Pipeline Stage
One or more inputs Strictly one output
18
Pipeline Stage
● Closed under concatenation
One or more inputs Strictly one output
19
Pipeline Stage
● Closed under concatenation● Standalone and runnable● Spark™ ML inside
One or more inputs Strictly one output
20
Spark ML Pipelines
21
Spark ML Pipelines
Using a Pipeline to train a model
22
Spark ML Pipelines
Using a PipelineModel to get predictions
23
Example: Gender Prediction PipelineGender PredictorFeaturizer
Generates features from events
24
Example: Gender Prediction PipelinePreprocessorCleans up events
Aggregationaggregates featuresper user
Gender PredictorFeaturizerGenerates features from events
25
Example: Gender Prediction PipelinePreprocessorCleans up events
User Account Preprocessorharmonises schemas of user accounts across sites. Provides ground truth data for training.
Aggregationaggregates featuresper user
Gender PredictorJoins features with users,trains classifier & computes predictions
FeaturizerGenerates features from events
26
Example: Gender Prediction Pipeline
User Account Preprocessorharmonises schemas of user accounts across sites. Provides ground truth data for training.
Gender PredictorJoins features with users,trains classifier & computes predictions
Performance Evaluatorcomputes performance metricse.g. accuracy and area under ROC
PreprocessorCleans up events
Aggregationaggregates featuresper user
FeaturizerGenerates features from events
27
Scalability pain points
User Account Preprocessorharmonises schemas of user accounts across sites. Provides ground truth data for training.
Gender PredictorJoins features with users,trains classifier & computes predictions
Performance Evaluatorcomputes performance metricse.g. accuracy and area under ROC
30 days of input dataFor better accuracy
PreprocessorCleans up events
Aggregationaggregates featuresper user
FeaturizerGenerates features from events
28
More data for better performance
Redundant processing
29
25/05 26/05 27/05 28/05 29/05 30/05 31/05 01/06
Model for 28/05
Model for 29/05
Model for 30/05
Model for 31/05
Input data per day
………...
30
Scalable Pipelines: pain points
Memory and processing heavy● In one use-case, for 7 days lookback (~7 x 200M events)
we used to need 20 Spark executors with 22G of memory each.
Not easily scalable● As the lookback increases ● As more and more sites are incorporated into our pipelines
Redundant processing● For K days of lookback, we are repeating processing of K-1 days worth of data,
when we run the pipeline every day, in a rolling window fashion.
“What will happen if we try to process30 days worth of data (e.g. 6B events) ???”
31
Saved by Algebra
● The operations (op) along with the corresponding data structures (S) that we are interested in are monoids.○ Associative:
■ for all A,B,C in S, (A op B) op C = A op (B op C) ○ Identity element:
■ there exists E in S such that for each A in S, E op A = A op E = A
● Examples:○ Summation: 1 + 2 + 3 + 4 = (1 + 2) + (3 + 4) = 1 + ((2 + 3) + 4)○ String array concatenation: [“foo”] + [“bar”] + [“baz”] = [“foo”, “bar”] + [“baz”]
32
Scalable Pipelines: in monoids fashion
Split work into smaller chunks, cache intermediate results
● Split the aggregations into smaller chunks○ Persist aggregations per day
● Combine partial aggregations into final feature vectors
User Defined Functions (UDF)compute a value looking at a single row
e.g. SELECT concat(‘user’, user_id) FROM events
User Defined Aggregate Functions (UDAF) since Spark 1.5computation over groups of rows
e.g. SELECT user_id, AVG(time_spent) FROM events GROUP BY user_id
33
Scalable Pipelines: building blocks
For some user, we already aggregated events per day:
34
Example: map with term frequencies
Monoid operation: MapAggregator ⊕i.e. m1 ⊕ m2 ⊕ m3 = (m1 ⊕ m2) ⊕ m3 = m1 ⊕ (m2 ⊕ m3) m0 = Map()
Result:
35
Buffer for intermediate results
Get content from buffer
Merge buffers of two executors
36
Aggregation pipeline stage
User Account Preprocessorharmonises schemas of user accounts across sites. Provides ground truth data for training.
Gender PredictorJoins features with users,trains classifier & computes predictions
Performance Evaluatorcomputes performance metricse.g. accuracy and area under ROC
PreprocessorCleans up events
Aggregationaggregates featuresper user
FeaturizerGenerates features from events
37
Scalable Pipelines: Aggregating Events
38
Scalable Pipelines: Aggregating EventsIt’s a Transformer
39
Scalable Pipelines: Aggregating EventsIt’s a Transformer
DataFrame in , DataFrame out
40
Scalable Pipelines: Aggregating EventsIt’s a Transformer
DataFrame in , DataFrame out
Aggregating maps of feature frequency counts
41
Scalable Pipelines: closing remarks
● With User Defined Aggregate Functions, we have reduced the workload of our pipelines by a factor of 20!
42
Scalable Pipelines: closing remarks
● With User Defined Aggregate Functions, we have reduced the workload of our pipelines by a factor of 20!
● Obvious gains: freeing up resources that can be used for running even more pipelines, faster, over even more input data
43
Scalable Pipelines: closing remarks
● Needless to say, more factors contribute towards a scalable pipeline:○ Performance tuning of the Spark cluster○ Use of a workflow manager (e.g. Luigi) for pipeline orchestration
44
Scalable Pipelines: closing remarks
● Needless to say, more factors contribute towards a scalable pipeline:○ Performance tuning of the Spark cluster○ Use of a workflow manager (e.g. Luigi) for pipeline orchestration
● But each one of these is a topic for a separate talk
45
Shameless plug
We are hiring!
Across all our hubs
in London, Oslo, Stockholm, Barcelona
for Data Science, Engineering, UX and Product roles
https://jobs.lever.co/[email protected]
46
Q/A
Thank you!