Scalable predictive pipelines with Spark and Scala

Click here to load reader

download Scalable predictive pipelines with Spark and Scala

of 54

Embed Size (px)

Transcript of Scalable predictive pipelines with Spark and Scala

  • THE FUTURE IS NOW

  • Scalable Predictive Pipelines with Spark and ScalaDimitris Papadopoulos

  • 3

    About Schibsted

  • 4

    About Schibsted

  • 5

    About Schibsted

  • 6

    Event Tracking Data

  • 7

    Event Tracking Data

  • 8

    Event Tracking Data

  • 9

    Event Tracking Data

  • 10

    Data Science Tasks

    DataModel

    Results

    Preprocessing

  • 1. Using Spark ML Pipelines

    2. Scalable Pipelines

    11

    Outline

  • 12

    Pipeline

  • 13

    Pipeline

  • 14

    Pipeline

  • 15

    Not a pipe

  • 16

    Pipeline Stage

    One or more inputs Strictly one output

  • 17

    Pipeline Stage

    One or more inputs Strictly one output

    Closed under concatenation

  • 18

    Pipeline Stage

    One or more inputs Strictly one output

    Closed under concatenation Standalone and runnable Spark ML inside

  • 19

    Spark ML Pipelines

  • 20

    Spark ML Pipelines

    Using a Pipeline to train a model

  • 21

    Spark ML Pipelines

    Using a PipelineModel to get predictions

  • 22

    Peek inside a Spark pipelineIts a Pipeline

  • 23

    Peek inside a Spark pipelineIts a Pipeline

    plain Spark API

  • 24

    Peek inside a Spark pipelineIts a Pipeline

    plain Spark API

    From DataFrame to a Model

  • 25

    Peek inside a Spark pipelineInstantiating a Pipeline

    Running it!

  • 26

    Example PipelineEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

    UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

    EventPreprocessor:aggregates events per user

    GenderPredictor:creates labels and features, trains classifier & computes predictions

  • 27

    Example PipelineEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

    UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

    EventPreprocessor:aggregates events per user

    GenderPredictor:creates labels and features, trains classifier & computes predictions

    GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC

  • 28

    Scalable Pipelines: pain pointsEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

    UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

    EventPreprocessor:aggregates events per user

    GenderPredictor:creates labels and features, trains classifier & computes predictions

    GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC

  • 29

    Scalable Pipelines: pain pointsEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

    UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

    EventPreprocessor:aggregates events per user

    GenderPredictor:creates labels and features, trains classifier & computes predictions

    GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC

    Input: 1 days / 7 days worth of events data. Larger lookbacks needed for better accuracy.

  • 30

    More data for better performance

    Performance of three different pipelines, vs lookback length (1, 7, 30, 45)

  • 31

    Scalable Pipelines: pain pointsEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

    UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

    EventPreprocessor:aggregates events per user

    GenderPredictor:creates labels and features, trains classifier & computes predictions

    GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC

    What will happen if we try to process30 days worth of data (e.g. 3B events) ???

  • 32

    Scalable Pipelines: pain points

    Memory and processing heavy: In one use-case, for 7 days lookback (~7 x 100M events) we used to need 20 Spark

    executors with 22G of memory each.

    Not easily scalable As the lookback increases As more and more sites are incorporated into our pipelines

    Redundant processing For K days of lookback, we are repeating processing of K - 2 days worth of data, when we

    run the pipeline every day, in a rolling window fashion.

    What will happen if we try to process30 days worth of data (e.g. 3B events) ???

  • 33

    Saved by Algebra

    The operations (op) along with the corresponding data structures (S) that we are interested in are monoids. Associative:

    for all A,B,C in S, (A op B) op C = A op (B op C) Identity element:

    there exists E in S such that for each A in S, E op A = A op E = A

    Examples: Summation: 1 + 2 + 3 + 4 = (1 + 2) + (3 + 4) String array concatenation: [foo] + [bar] + [baz] = [foo, bar] + [baz]

  • 34

    Scalable Pipelines: in monoids fashion

    Split the aggregations in smaller chunks i.e. pre-process events per user and single day (not over the entire lookback)

  • 35

    Scalable Pipelines: in monoids fashion

    Split the aggregations in smaller chunks i.e. pre-process events per user and single day (not over the entire lookback)

    Make one (or multiple) day aggregates and combine i.e. aggregate over the pre-preprocessed events per user and day

  • 36

    Scalable Pipelines: in monoids fashion

    Split the aggregations in smaller chunks i.e. pre-process events per user and single day (not over the entire lookback)

    Make one (or multiple) day aggregates and combine i.e. aggregate over the pre-preprocessed events per user and day

    Its like trying to ...eat an elephant: one piece at a time!

  • 37

    Scalable Pipelines: building blocks

    Imagine we had a MapAggregator, for aggregating maps of [String->Double].

  • 38

    Scalable Pipelines: building blocks

    Imagine we had a MapAggregator, for aggregating maps of [String->Double].

    The spec for such an aggregator implemented in Scala on Spark could look like this. :-)

  • 39

    Scalable Pipelines: building blocks

    Imagine we had a MapAggregator, for aggregating maps of [String->Double].

    The spec for such an aggregator implemented in Scala on Spark could look like this. :-)

  • 40

    Scalable Pipelines: building blocks

    In Spark we can define our own functions, also known as User Defined Functions (UDF)

    A UDF takes as arguments one or more columns, and returns some output.

    It gets executed for each row of the DataFrame. It can also be parameterized. e.g. val myUDF = udf((myArg: myType) => ...)

  • Since Spark 1.5, we can also define our own User Defined Aggregate Functions (UDAF).

    UDAFs can be used to compute custom calculations over groups of input data (in contrast, UDFs compute a value looking at a single input row)

    Examples: calculating geometric mean or calculating the product of values for every group.

    A UDAF maintains an aggregation buffer to store intermediate results for every group of input data.

    It updates this buffer for every input row. Once it has processed all input rows, it generates a result value based on

    values of the aggregation buffer.

    41

    Scalable Pipelines: UDAF

  • 42

    Scalable Pipelines: UDAFA User Defined Aggregate Function

    Implementation of abstract methods

  • 43

    Scalable Pipelines: adding a new stageEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

    UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

    EventPreprocessor:aggregates events per user

    GenderPredictor:creates labels and features, trains classifier & computes predictions

    GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC

    What will happen if we try to process30 days worth of data (e.g. 3B events) ???

  • 44

    Scalable Pipelines: adding a new stageEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

    UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

    EventPreprocessor:aggregates events per user and day

    GenderPredictor:creates labels and features, trains classifier & computes predictions

    GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC

    EventAggregator:aggregates pre-processed events per user over multiple days (lookback)

  • 45

    Scalable Pipelines: Aggregating Events

  • 46

    Scalable Pipelines: Aggregating EventsIts a Transformer

  • 47

    Scalable Pipelines: Aggregating EventsIts a Transformer

    DataFrame in , DataFrame out

  • 48

    Scalable Pipelines: Aggregating EventsIts a Transformer

    DataFrame in , DataFrame out

    Aggregating maps of feature frequency counts

  • 49

    Scalable Pipelines: closing remarks

    With User Defined Aggregate Functions, we have reduced the