Scalable predictive pipelines with Spark and Scala

46
THE FUTURE IS NOW

Transcript of Scalable predictive pipelines with Spark and Scala

Page 1: Scalable predictive pipelines with Spark and Scala

THE FUTURE IS NOW

Page 2: Scalable predictive pipelines with Spark and Scala

Scalable Predictive Pipelines with Spark and ScalaBased on a presentation from Dimitris PapadopoulosTom van der Weide

Page 3: Scalable predictive pipelines with Spark and Scala

3

About Schibsted

Page 4: Scalable predictive pipelines with Spark and Scala

4

About Schibsted

Page 5: Scalable predictive pipelines with Spark and Scala

5

About Schibsted

Page 6: Scalable predictive pipelines with Spark and Scala

6

Event Tracking Data

Page 7: Scalable predictive pipelines with Spark and Scala

7

Event Tracking Data

Page 8: Scalable predictive pipelines with Spark and Scala

8

Event Tracking Data

Page 9: Scalable predictive pipelines with Spark and Scala

9

Event Tracking Data

Page 10: Scalable predictive pipelines with Spark and Scala

10

Data Science Tasks

DataModel

Results

Preprocessing

Page 11: Scalable predictive pipelines with Spark and Scala

1. Using Spark ML Pipelines

2. Scalable Pipelines

11

Outline

Page 12: Scalable predictive pipelines with Spark and Scala

12

Pipeline

Page 13: Scalable predictive pipelines with Spark and Scala

13

Pipeline

Page 14: Scalable predictive pipelines with Spark and Scala

14

Pipeline

Page 15: Scalable predictive pipelines with Spark and Scala

15

Pipeline

Check out:Machine Learning: The High-Interest Credit Card of Technical Debt

by Google employees

Page 16: Scalable predictive pipelines with Spark and Scala

16

Not a pipe

Page 17: Scalable predictive pipelines with Spark and Scala

17

Pipeline Stage

One or more inputs Strictly one output

Page 18: Scalable predictive pipelines with Spark and Scala

18

Pipeline Stage

● Closed under concatenation

One or more inputs Strictly one output

Page 19: Scalable predictive pipelines with Spark and Scala

19

Pipeline Stage

● Closed under concatenation● Standalone and runnable● Spark™ ML inside

One or more inputs Strictly one output

Page 20: Scalable predictive pipelines with Spark and Scala

20

Spark ML Pipelines

Page 21: Scalable predictive pipelines with Spark and Scala

21

Spark ML Pipelines

Using a Pipeline to train a model

Page 22: Scalable predictive pipelines with Spark and Scala

22

Spark ML Pipelines

Using a PipelineModel to get predictions

Page 23: Scalable predictive pipelines with Spark and Scala

23

Example: Gender Prediction PipelineGender PredictorFeaturizer

Generates features from events

Page 24: Scalable predictive pipelines with Spark and Scala

24

Example: Gender Prediction PipelinePreprocessorCleans up events

Aggregationaggregates featuresper user

Gender PredictorFeaturizerGenerates features from events

Page 25: Scalable predictive pipelines with Spark and Scala

25

Example: Gender Prediction PipelinePreprocessorCleans up events

User Account Preprocessorharmonises schemas of user accounts across sites. Provides ground truth data for training.

Aggregationaggregates featuresper user

Gender PredictorJoins features with users,trains classifier & computes predictions

FeaturizerGenerates features from events

Page 26: Scalable predictive pipelines with Spark and Scala

26

Example: Gender Prediction Pipeline

User Account Preprocessorharmonises schemas of user accounts across sites. Provides ground truth data for training.

Gender PredictorJoins features with users,trains classifier & computes predictions

Performance Evaluatorcomputes performance metricse.g. accuracy and area under ROC

PreprocessorCleans up events

Aggregationaggregates featuresper user

FeaturizerGenerates features from events

Page 27: Scalable predictive pipelines with Spark and Scala

27

Scalability pain points

User Account Preprocessorharmonises schemas of user accounts across sites. Provides ground truth data for training.

Gender PredictorJoins features with users,trains classifier & computes predictions

Performance Evaluatorcomputes performance metricse.g. accuracy and area under ROC

30 days of input dataFor better accuracy

PreprocessorCleans up events

Aggregationaggregates featuresper user

FeaturizerGenerates features from events

Page 28: Scalable predictive pipelines with Spark and Scala

28

More data for better performance

Page 29: Scalable predictive pipelines with Spark and Scala

Redundant processing

29

25/05 26/05 27/05 28/05 29/05 30/05 31/05 01/06

Model for 28/05

Model for 29/05

Model for 30/05

Model for 31/05

Input data per day

………...

Page 30: Scalable predictive pipelines with Spark and Scala

30

Scalable Pipelines: pain points

Memory and processing heavy● In one use-case, for 7 days lookback (~7 x 200M events)

we used to need 20 Spark executors with 22G of memory each.

Not easily scalable● As the lookback increases ● As more and more sites are incorporated into our pipelines

Redundant processing● For K days of lookback, we are repeating processing of K-1 days worth of data,

when we run the pipeline every day, in a rolling window fashion.

“What will happen if we try to process30 days worth of data (e.g. 6B events) ???”

Page 31: Scalable predictive pipelines with Spark and Scala

31

Saved by Algebra

● The operations (op) along with the corresponding data structures (S) that we are interested in are monoids.○ Associative:

■ for all A,B,C in S, (A op B) op C = A op (B op C) ○ Identity element:

■ there exists E in S such that for each A in S, E op A = A op E = A

● Examples:○ Summation: 1 + 2 + 3 + 4 = (1 + 2) + (3 + 4) = 1 + ((2 + 3) + 4)○ String array concatenation: [“foo”] + [“bar”] + [“baz”] = [“foo”, “bar”] + [“baz”]

Page 32: Scalable predictive pipelines with Spark and Scala

32

Scalable Pipelines: in monoids fashion

Split work into smaller chunks, cache intermediate results

● Split the aggregations into smaller chunks○ Persist aggregations per day

● Combine partial aggregations into final feature vectors

Page 33: Scalable predictive pipelines with Spark and Scala

User Defined Functions (UDF)compute a value looking at a single row

e.g. SELECT concat(‘user’, user_id) FROM events

User Defined Aggregate Functions (UDAF) since Spark 1.5computation over groups of rows

e.g. SELECT user_id, AVG(time_spent) FROM events GROUP BY user_id

33

Scalable Pipelines: building blocks

Page 34: Scalable predictive pipelines with Spark and Scala

For some user, we already aggregated events per day:

34

Example: map with term frequencies

Monoid operation: MapAggregator ⊕i.e. m1 ⊕ m2 ⊕ m3 = (m1 ⊕ m2) ⊕ m3 = m1 ⊕ (m2 ⊕ m3) m0 = Map()

Result:

Page 35: Scalable predictive pipelines with Spark and Scala

35

Buffer for intermediate results

Get content from buffer

Merge buffers of two executors

Page 36: Scalable predictive pipelines with Spark and Scala

36

Aggregation pipeline stage

User Account Preprocessorharmonises schemas of user accounts across sites. Provides ground truth data for training.

Gender PredictorJoins features with users,trains classifier & computes predictions

Performance Evaluatorcomputes performance metricse.g. accuracy and area under ROC

PreprocessorCleans up events

Aggregationaggregates featuresper user

FeaturizerGenerates features from events

Page 37: Scalable predictive pipelines with Spark and Scala

37

Scalable Pipelines: Aggregating Events

Page 38: Scalable predictive pipelines with Spark and Scala

38

Scalable Pipelines: Aggregating EventsIt’s a Transformer

Page 39: Scalable predictive pipelines with Spark and Scala

39

Scalable Pipelines: Aggregating EventsIt’s a Transformer

DataFrame in , DataFrame out

Page 40: Scalable predictive pipelines with Spark and Scala

40

Scalable Pipelines: Aggregating EventsIt’s a Transformer

DataFrame in , DataFrame out

Aggregating maps of feature frequency counts

Page 41: Scalable predictive pipelines with Spark and Scala

41

Scalable Pipelines: closing remarks

● With User Defined Aggregate Functions, we have reduced the workload of our pipelines by a factor of 20!

Page 42: Scalable predictive pipelines with Spark and Scala

42

Scalable Pipelines: closing remarks

● With User Defined Aggregate Functions, we have reduced the workload of our pipelines by a factor of 20!

● Obvious gains: freeing up resources that can be used for running even more pipelines, faster, over even more input data

Page 43: Scalable predictive pipelines with Spark and Scala

43

Scalable Pipelines: closing remarks

● Needless to say, more factors contribute towards a scalable pipeline:○ Performance tuning of the Spark cluster○ Use of a workflow manager (e.g. Luigi) for pipeline orchestration

Page 44: Scalable predictive pipelines with Spark and Scala

44

Scalable Pipelines: closing remarks

● Needless to say, more factors contribute towards a scalable pipeline:○ Performance tuning of the Spark cluster○ Use of a workflow manager (e.g. Luigi) for pipeline orchestration

● But each one of these is a topic for a separate talk

Page 45: Scalable predictive pipelines with Spark and Scala

45

Shameless plug

We are hiring!

Across all our hubs

in London, Oslo, Stockholm, Barcelona

for Data Science, Engineering, UX and Product roles

https://jobs.lever.co/[email protected]

Page 46: Scalable predictive pipelines with Spark and Scala

46

Q/A

Thank you!