THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with...

54
THE FUTURE IS NOW

Transcript of THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with...

Page 1: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

THE FUTURE IS NOW

Page 2: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

Scalable Predictive Pipelines with Spark and ScalaDimitris Papadopoulos

Page 3: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

3

About Schibsted

Page 4: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

4

About Schibsted

Page 5: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

5

About Schibsted

Page 6: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

6

Event Tracking Data

Page 7: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

7

Event Tracking Data

Page 8: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

8

Event Tracking Data

Page 9: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

9

Event Tracking Data

Page 10: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

10

Data Science Tasks

DataModel

Results

Preprocessing

Page 11: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

1. Using Spark ML Pipelines

2. Scalable Pipelines

11

Outline

Page 12: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

12

Pipeline

Page 13: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

13

Pipeline

Page 14: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

14

Pipeline

Page 15: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

15

Not a pipe

Page 16: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

16

Pipeline Stage

● One or more inputs ● Strictly one output

Page 17: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

17

Pipeline Stage

● One or more inputs ● Strictly one output

● Closed under concatenation

Page 18: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

18

Pipeline Stage

● One or more inputs ● Strictly one output

● Closed under concatenation● Standalone and runnable● Spark™ ML inside

Page 19: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

19

Spark ML Pipelines

Page 20: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

20

Spark ML Pipelines

Using a Pipeline to train a model

Page 21: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

21

Spark ML Pipelines

Using a PipelineModel to get predictions

Page 22: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

22

Peek inside a Spark pipelineIt’s a Pipeline

Page 23: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

23

Peek inside a Spark pipelineIt’s a Pipeline

plain Spark API

Page 24: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

24

Peek inside a Spark pipelineIt’s a Pipeline

plain Spark API

From DataFrame to a Model

Page 25: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

25

Peek inside a Spark pipelineInstantiating a Pipeline

Running it!

Page 26: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

26

Example PipelineEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

EventPreprocessor:aggregates events per user

GenderPredictor:creates labels and features, trains classifier & computes predictions

Page 27: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

27

Example PipelineEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

EventPreprocessor:aggregates events per user

GenderPredictor:creates labels and features, trains classifier & computes predictions

GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC

Page 28: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

28

Scalable Pipelines: pain pointsEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

EventPreprocessor:aggregates events per user

GenderPredictor:creates labels and features, trains classifier & computes predictions

GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC

Page 29: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

29

Scalable Pipelines: pain pointsEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

EventPreprocessor:aggregates events per user

GenderPredictor:creates labels and features, trains classifier & computes predictions

GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC

Input: 1 day’s / 7 days’ worth of events data. Larger lookbacks needed for better accuracy.

Page 30: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

30

More data for better performance

Performance of three different pipelines, vs lookback length (1, 7, 30, 45)

Page 31: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

31

Scalable Pipelines: pain pointsEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

EventPreprocessor:aggregates events per user

GenderPredictor:creates labels and features, trains classifier & computes predictions

GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC

What will happen if we try to process30 days worth of data (e.g. 3B events) ???

Page 32: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

32

Scalable Pipelines: pain points

Memory and processing heavy:● In one use-case, for 7 days lookback (~7 x 100M events) we used to need 20 Spark

executors with 22G of memory each.

Not easily scalable● As the lookback increases ● As more and more sites are incorporated into our pipelines

Redundant processing● For K days of lookback, we are repeating processing of K - 2 days worth of data, when we

run the pipeline every day, in a rolling window fashion.

“What will happen if we try to process30 days worth of data (e.g. 3B events) ???”

Page 33: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

33

Saved by Algebra

● The operations (op) along with the corresponding data structures (S) that we are interested in are monoids.○ Associative:

■ for all A,B,C in S, (A op B) op C = A op (B op C) ○ Identity element:

■ there exists E in S such that for each A in S, E op A = A op E = A

● Examples:○ Summation: 1 + 2 + 3 + 4 = (1 + 2) + (3 + 4)○ String array concatenation: [“foo”] + [“bar”] + [“baz”] = [“foo”, “bar”] + [“baz”]

Page 34: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

34

Scalable Pipelines: in monoids fashion

● Split the aggregations in smaller chunks○ i.e. pre-process events per user and single day (not over the entire lookback)

Page 35: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

35

Scalable Pipelines: in monoids fashion

● Split the aggregations in smaller chunks○ i.e. pre-process events per user and single day (not over the entire lookback)

● Make one (or multiple) day aggregates and combine○ i.e. aggregate over the pre-preprocessed events per user and day

Page 36: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

36

Scalable Pipelines: in monoids fashion

● Split the aggregations in smaller chunks○ i.e. pre-process events per user and single day (not over the entire lookback)

● Make one (or multiple) day aggregates and combine○ i.e. aggregate over the pre-preprocessed events per user and day

● It’s like trying to ...eat an elephant: one piece at a time!

Page 37: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

37

Scalable Pipelines: building blocks

● Imagine we had a MapAggregator, for aggregating maps of [String->Double].

Page 38: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

38

Scalable Pipelines: building blocks

● Imagine we had a MapAggregator, for aggregating maps of [String->Double].

● The spec for such an aggregator implemented in Scala on Spark could look like this. :-)

Page 39: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

39

Scalable Pipelines: building blocks

● Imagine we had a MapAggregator, for aggregating maps of [String->Double].

● The spec for such an aggregator implemented in Scala on Spark could look like this. :-)

Page 40: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

40

Scalable Pipelines: building blocks

● In Spark we can define our own functions, also known as User Defined Functions (UDF)

● A UDF takes as arguments one or more columns, and returns some output.

● It gets executed for each row of the DataFrame.● It can also be parameterized.● e.g. val myUDF = udf((myArg: myType) => ...)

Page 41: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

● Since Spark 1.5, we can also define our own User Defined Aggregate Functions (UDAF).

● UDAFs can be used to compute custom calculations over groups of input data (in contrast, UDFs compute a value looking at a single input row)

● Examples: calculating geometric mean or calculating the product of values for every group.

● A UDAF maintains an aggregation buffer to store intermediate results for every group of input data.

● It updates this buffer for every input row. ● Once it has processed all input rows, it generates a result value based on

values of the aggregation buffer.

41

Scalable Pipelines: UDAF

Page 42: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

42

Scalable Pipelines: UDAFA User Defined Aggregate Function

Implementation of abstract methods

Page 43: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

43

Scalable Pipelines: adding a new stageEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

EventPreprocessor:aggregates events per user

GenderPredictor:creates labels and features, trains classifier & computes predictions

GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC

What will happen if we try to process30 days worth of data (e.g. 3B events) ???

Page 44: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

44

Scalable Pipelines: adding a new stageEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

EventPreprocessor:aggregates events per user and day

GenderPredictor:creates labels and features, trains classifier & computes predictions

GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC

EventAggregator:aggregates pre-processed events per user over multiple days (lookback)

Page 45: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

45

Scalable Pipelines: Aggregating Events

Page 46: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

46

Scalable Pipelines: Aggregating EventsIt’s a Transformer

Page 47: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

47

Scalable Pipelines: Aggregating EventsIt’s a Transformer

DataFrame in , DataFrame out

Page 48: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

48

Scalable Pipelines: Aggregating EventsIt’s a Transformer

DataFrame in , DataFrame out

Aggregating maps of feature frequency counts

Page 49: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

49

Scalable Pipelines: closing remarks

● With User Defined Aggregate Functions, we have reduced the workload of our pipelines by a factor of 20!

Page 50: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

50

Scalable Pipelines: closing remarks

● With User Defined Aggregate Functions, we have reduced the workload of our pipelines by a factor of 20!

● Obvious gains: freeing up resources that can be used for running even more pipelines, faster, over even more input data

Page 51: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

51

Scalable Pipelines: closing remarks

● Needles to say, more factors contribute towards a scalable pipeline:○ Performance tuning of the Spark cluster○ Use of a workflow manager (e.g. Luigi) for pipeline orchestration

Page 52: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

52

Scalable Pipelines: closing remarks

● Needles to say, more factors contribute towards a scalable pipeline:○ Performance tuning of the Spark cluster○ Use of a workflow manager (e.g. Luigi) for pipeline orchestration

● But each one of these is a topic for a separate talk (Carlos? Hint, hint!) :-)

Page 53: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

53

Q/A

Thank you!

Page 54: THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with Spark and... · Scalable Pipelines: pain points Memory and processing heavy: In one

54

Shameless plug

We are hiring!

Across all our hubs

in London, Oslo, Stockholm, Barcelona

for Data Science, Engineering, UX and Product roles

https://jobs.lever.co/[email protected]