Scalable predictive pipelines with Spark and Scala

54
THE FUTURE IS NOW

Transcript of Scalable predictive pipelines with Spark and Scala

Page 1: Scalable predictive pipelines with Spark and Scala

THE FUTURE IS NOW

Page 2: Scalable predictive pipelines with Spark and Scala

Scalable Predictive Pipelines with Spark and ScalaDimitris Papadopoulos

Page 3: Scalable predictive pipelines with Spark and Scala

3

About Schibsted

Page 4: Scalable predictive pipelines with Spark and Scala

4

About Schibsted

Page 5: Scalable predictive pipelines with Spark and Scala

5

About Schibsted

Page 6: Scalable predictive pipelines with Spark and Scala

6

Event Tracking Data

Page 7: Scalable predictive pipelines with Spark and Scala

7

Event Tracking Data

Page 8: Scalable predictive pipelines with Spark and Scala

8

Event Tracking Data

Page 9: Scalable predictive pipelines with Spark and Scala

9

Event Tracking Data

Page 10: Scalable predictive pipelines with Spark and Scala

10

Data Science Tasks

DataModel

Results

Preprocessing

Page 11: Scalable predictive pipelines with Spark and Scala

1. Using Spark ML Pipelines

2. Scalable Pipelines

11

Outline

Page 12: Scalable predictive pipelines with Spark and Scala

12

Pipeline

Page 13: Scalable predictive pipelines with Spark and Scala

13

Pipeline

Page 14: Scalable predictive pipelines with Spark and Scala

14

Pipeline

Page 15: Scalable predictive pipelines with Spark and Scala

15

Not a pipe

Page 16: Scalable predictive pipelines with Spark and Scala

16

Pipeline Stage

● One or more inputs ● Strictly one output

Page 17: Scalable predictive pipelines with Spark and Scala

17

Pipeline Stage

● One or more inputs ● Strictly one output

● Closed under concatenation

Page 18: Scalable predictive pipelines with Spark and Scala

18

Pipeline Stage

● One or more inputs ● Strictly one output

● Closed under concatenation● Standalone and runnable● Spark™ ML inside

Page 19: Scalable predictive pipelines with Spark and Scala

19

Spark ML Pipelines

Page 20: Scalable predictive pipelines with Spark and Scala

20

Spark ML Pipelines

Using a Pipeline to train a model

Page 21: Scalable predictive pipelines with Spark and Scala

21

Spark ML Pipelines

Using a PipelineModel to get predictions

Page 22: Scalable predictive pipelines with Spark and Scala

22

Peek inside a Spark pipelineIt’s a Pipeline

Page 23: Scalable predictive pipelines with Spark and Scala

23

Peek inside a Spark pipelineIt’s a Pipeline

plain Spark API

Page 24: Scalable predictive pipelines with Spark and Scala

24

Peek inside a Spark pipelineIt’s a Pipeline

plain Spark API

From DataFrame to a Model

Page 25: Scalable predictive pipelines with Spark and Scala

25

Peek inside a Spark pipelineInstantiating a Pipeline

Running it!

Page 26: Scalable predictive pipelines with Spark and Scala

26

Example PipelineEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

EventPreprocessor:aggregates events per user

GenderPredictor:creates labels and features, trains classifier & computes predictions

Page 27: Scalable predictive pipelines with Spark and Scala

27

Example PipelineEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

EventPreprocessor:aggregates events per user

GenderPredictor:creates labels and features, trains classifier & computes predictions

GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC

Page 28: Scalable predictive pipelines with Spark and Scala

28

Scalable Pipelines: pain pointsEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

EventPreprocessor:aggregates events per user

GenderPredictor:creates labels and features, trains classifier & computes predictions

GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC

Page 29: Scalable predictive pipelines with Spark and Scala

29

Scalable Pipelines: pain pointsEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

EventPreprocessor:aggregates events per user

GenderPredictor:creates labels and features, trains classifier & computes predictions

GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC

Input: 1 day’s / 7 days’ worth of events data. Larger lookbacks needed for better accuracy.

Page 30: Scalable predictive pipelines with Spark and Scala

30

More data for better performance

Performance of three different pipelines, vs lookback length (1, 7, 30, 45)

Page 31: Scalable predictive pipelines with Spark and Scala

31

Scalable Pipelines: pain pointsEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

EventPreprocessor:aggregates events per user

GenderPredictor:creates labels and features, trains classifier & computes predictions

GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC

What will happen if we try to process30 days worth of data (e.g. 3B events) ???

Page 32: Scalable predictive pipelines with Spark and Scala

32

Scalable Pipelines: pain points

Memory and processing heavy:● In one use-case, for 7 days lookback (~7 x 100M events) we used to need 20 Spark

executors with 22G of memory each.

Not easily scalable● As the lookback increases ● As more and more sites are incorporated into our pipelines

Redundant processing● For K days of lookback, we are repeating processing of K - 2 days worth of data, when we

run the pipeline every day, in a rolling window fashion.

“What will happen if we try to process30 days worth of data (e.g. 3B events) ???”

Page 33: Scalable predictive pipelines with Spark and Scala

33

Saved by Algebra

● The operations (op) along with the corresponding data structures (S) that we are interested in are monoids.○ Associative:

■ for all A,B,C in S, (A op B) op C = A op (B op C) ○ Identity element:

■ there exists E in S such that for each A in S, E op A = A op E = A

● Examples:○ Summation: 1 + 2 + 3 + 4 = (1 + 2) + (3 + 4)○ String array concatenation: [“foo”] + [“bar”] + [“baz”] = [“foo”, “bar”] + [“baz”]

Page 34: Scalable predictive pipelines with Spark and Scala

34

Scalable Pipelines: in monoids fashion

● Split the aggregations in smaller chunks○ i.e. pre-process events per user and single day (not over the entire lookback)

Page 35: Scalable predictive pipelines with Spark and Scala

35

Scalable Pipelines: in monoids fashion

● Split the aggregations in smaller chunks○ i.e. pre-process events per user and single day (not over the entire lookback)

● Make one (or multiple) day aggregates and combine○ i.e. aggregate over the pre-preprocessed events per user and day

Page 36: Scalable predictive pipelines with Spark and Scala

36

Scalable Pipelines: in monoids fashion

● Split the aggregations in smaller chunks○ i.e. pre-process events per user and single day (not over the entire lookback)

● Make one (or multiple) day aggregates and combine○ i.e. aggregate over the pre-preprocessed events per user and day

● It’s like trying to ...eat an elephant: one piece at a time!

Page 37: Scalable predictive pipelines with Spark and Scala

37

Scalable Pipelines: building blocks

● Imagine we had a MapAggregator, for aggregating maps of [String->Double].

Page 38: Scalable predictive pipelines with Spark and Scala

38

Scalable Pipelines: building blocks

● Imagine we had a MapAggregator, for aggregating maps of [String->Double].

● The spec for such an aggregator implemented in Scala on Spark could look like this. :-)

Page 39: Scalable predictive pipelines with Spark and Scala

39

Scalable Pipelines: building blocks

● Imagine we had a MapAggregator, for aggregating maps of [String->Double].

● The spec for such an aggregator implemented in Scala on Spark could look like this. :-)

Page 40: Scalable predictive pipelines with Spark and Scala

40

Scalable Pipelines: building blocks

● In Spark we can define our own functions, also known as User Defined Functions (UDF)

● A UDF takes as arguments one or more columns, and returns some output.

● It gets executed for each row of the DataFrame.● It can also be parameterized.● e.g. val myUDF = udf((myArg: myType) => ...)

Page 41: Scalable predictive pipelines with Spark and Scala

● Since Spark 1.5, we can also define our own User Defined Aggregate Functions (UDAF).

● UDAFs can be used to compute custom calculations over groups of input data (in contrast, UDFs compute a value looking at a single input row)

● Examples: calculating geometric mean or calculating the product of values for every group.

● A UDAF maintains an aggregation buffer to store intermediate results for every group of input data.

● It updates this buffer for every input row. ● Once it has processed all input rows, it generates a result value based on

values of the aggregation buffer.

41

Scalable Pipelines: UDAF

Page 42: Scalable predictive pipelines with Spark and Scala

42

Scalable Pipelines: UDAFA User Defined Aggregate Function

Implementation of abstract methods

Page 43: Scalable predictive pipelines with Spark and Scala

43

Scalable Pipelines: adding a new stageEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

EventPreprocessor:aggregates events per user

GenderPredictor:creates labels and features, trains classifier & computes predictions

GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC

What will happen if we try to process30 days worth of data (e.g. 3B events) ???

Page 44: Scalable predictive pipelines with Spark and Scala

44

Scalable Pipelines: adding a new stageEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

EventPreprocessor:aggregates events per user and day

GenderPredictor:creates labels and features, trains classifier & computes predictions

GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC

EventAggregator:aggregates pre-processed events per user over multiple days (lookback)

Page 45: Scalable predictive pipelines with Spark and Scala

45

Scalable Pipelines: Aggregating Events

Page 46: Scalable predictive pipelines with Spark and Scala

46

Scalable Pipelines: Aggregating EventsIt’s a Transformer

Page 47: Scalable predictive pipelines with Spark and Scala

47

Scalable Pipelines: Aggregating EventsIt’s a Transformer

DataFrame in , DataFrame out

Page 48: Scalable predictive pipelines with Spark and Scala

48

Scalable Pipelines: Aggregating EventsIt’s a Transformer

DataFrame in , DataFrame out

Aggregating maps of feature frequency counts

Page 49: Scalable predictive pipelines with Spark and Scala

49

Scalable Pipelines: closing remarks

● With User Defined Aggregate Functions, we have reduced the workload of our pipelines by a factor of 20!

Page 50: Scalable predictive pipelines with Spark and Scala

50

Scalable Pipelines: closing remarks

● With User Defined Aggregate Functions, we have reduced the workload of our pipelines by a factor of 20!

● Obvious gains: freeing up resources that can be used for running even more pipelines, faster, over even more input data

Page 51: Scalable predictive pipelines with Spark and Scala

51

Scalable Pipelines: closing remarks

● Needles to say, more factors contribute towards a scalable pipeline:○ Performance tuning of the Spark cluster○ Use of a workflow manager (e.g. Luigi) for pipeline orchestration

Page 52: Scalable predictive pipelines with Spark and Scala

52

Scalable Pipelines: closing remarks

● Needles to say, more factors contribute towards a scalable pipeline:○ Performance tuning of the Spark cluster○ Use of a workflow manager (e.g. Luigi) for pipeline orchestration

● But each one of these is a topic for a separate talk (Carlos? Hint, hint!) :-)

Page 53: Scalable predictive pipelines with Spark and Scala

53

Q/A

Thank you!

Page 54: Scalable predictive pipelines with Spark and Scala

54

Shameless plug

We are hiring!

Across all our hubs

in London, Oslo, Stockholm, Barcelona

for Data Science, Engineering, UX and Product roles

https://jobs.lever.co/[email protected]