Scalable predictive pipelines with Spark and Scala

download Scalable predictive pipelines with Spark and Scala

of 46

Embed Size (px)

Transcript of Scalable predictive pipelines with Spark and Scala

  • THE FUTURE IS NOW

  • Scalable Predictive Pipelines with Spark and ScalaBased on a presentation from Dimitris PapadopoulosTom van der Weide

  • 3

    About Schibsted

  • 4

    About Schibsted

  • 5

    About Schibsted

  • 6

    Event Tracking Data

  • 7

    Event Tracking Data

  • 8

    Event Tracking Data

  • 9

    Event Tracking Data

  • 10

    Data Science Tasks

    DataModel

    Results

    Preprocessing

  • 1. Using Spark ML Pipelines

    2. Scalable Pipelines

    11

    Outline

  • 12

    Pipeline

  • 13

    Pipeline

  • 14

    Pipeline

  • 15

    Pipeline

    Check out:Machine Learning: The High-Interest Credit Card of Technical Debt

    by Google employees

  • 16

    Not a pipe

  • 17

    Pipeline Stage

    One or more inputs Strictly one output

  • 18

    Pipeline Stage

    Closed under concatenation

    One or more inputs Strictly one output

  • 19

    Pipeline Stage

    Closed under concatenation Standalone and runnable Spark ML inside

    One or more inputs Strictly one output

  • 20

    Spark ML Pipelines

  • 21

    Spark ML Pipelines

    Using a Pipeline to train a model

  • 22

    Spark ML Pipelines

    Using a PipelineModel to get predictions

  • 23

    Example: Gender Prediction PipelineGender PredictorFeaturizer

    Generates features from events

  • 24

    Example: Gender Prediction PipelinePreprocessorCleans up events

    Aggregationaggregates featuresper user

    Gender PredictorFeaturizerGenerates features from events

  • 25

    Example: Gender Prediction PipelinePreprocessorCleans up events

    User Account Preprocessorharmonises schemas of user accounts across sites. Provides ground truth data for training.

    Aggregationaggregates featuresper user

    Gender PredictorJoins features with users,trains classifier & computes predictions

    FeaturizerGenerates features from events

  • 26

    Example: Gender Prediction Pipeline

    User Account Preprocessorharmonises schemas of user accounts across sites. Provides ground truth data for training.

    Gender PredictorJoins features with users,trains classifier & computes predictions

    Performance Evaluatorcomputes performance metricse.g. accuracy and area under ROC

    PreprocessorCleans up events

    Aggregationaggregates featuresper user

    FeaturizerGenerates features from events

  • 27

    Scalability pain points

    User Account Preprocessorharmonises schemas of user accounts across sites. Provides ground truth data for training.

    Gender PredictorJoins features with users,trains classifier & computes predictions

    Performance Evaluatorcomputes performance metricse.g. accuracy and area under ROC

    30 days of input dataFor better accuracy

    PreprocessorCleans up events

    Aggregationaggregates featuresper user

    FeaturizerGenerates features from events

  • 28

    More data for better performance

  • Redundant processing

    29

    25/05 26/05 27/05 28/05 29/05 30/05 31/05 01/06

    Model for 28/05

    Model for 29/05

    Model for 30/05

    Model for 31/05

    Input data per day

    ...

  • 30

    Scalable Pipelines: pain points

    Memory and processing heavy In one use-case, for 7 days lookback (~7 x 200M events)

    we used to need 20 Spark executors with 22G of memory each.

    Not easily scalable As the lookback increases As more and more sites are incorporated into our pipelines

    Redundant processing For K days of lookback, we are repeating processing of K-1 days worth of data,

    when we run the pipeline every day, in a rolling window fashion.

    What will happen if we try to process30 days worth of data (e.g. 6B events) ???

  • 31

    Saved by Algebra

    The operations (op) along with the corresponding data structures (S) that we are interested in are monoids. Associative:

    for all A,B,C in S, (A op B) op C = A op (B op C) Identity element:

    there exists E in S such that for each A in S, E op A = A op E = A

    Examples: Summation: 1 + 2 + 3 + 4 = (1 + 2) + (3 + 4) = 1 + ((2 + 3) + 4) String array concatenation: [foo] + [bar] + [baz] = [foo, bar] + [baz]

  • 32

    Scalable Pipelines: in monoids fashion

    Split work into smaller chunks, cache intermediate results

    Split the aggregations into smaller chunks Persist aggregations per day

    Combine partial aggregations into final feature vectors

  • User Defined Functions (UDF)compute a value looking at a single row

    e.g. SELECT concat(user, user_id) FROM events

    User Defined Aggregate Functions (UDAF) since Spark 1.5computation over groups of rows

    e.g. SELECT user_id, AVG(time_spent) FROM events GROUP BY user_id

    33

    Scalable Pipelines: building blocks

  • For some user, we already aggregated events per day:

    34

    Example: map with term frequencies

    Monoid operation: MapAggregator i.e. m1 m2 m3 = (m1 m2) m3 = m1 (m2 m3) m0 = Map()

    Result:

  • 35

    Buffer for intermediate results

    Get content from buffer

    Merge buffers of two executors

  • 36

    Aggregation pipeline stage

    User Account Preprocessorharmonises schemas of user accounts across sites. Provides ground truth data for training.

    Gender PredictorJoins features with users,trains classifier & computes predictions

    Performance Evaluatorcomputes performance metricse.g. accuracy and area under ROC

    PreprocessorCleans up events

    Aggregationaggregates featuresper user

    FeaturizerGenerates features from events

  • 37

    Scalable Pipelines: Aggregating Events

  • 38

    Scalable Pipelines: Aggregating EventsIts a Transformer

  • 39

    Scalable Pipelines: Aggregating EventsIts a Transformer

    DataFrame in , DataFrame out

  • 40

    Scalable Pipelines: Aggregating EventsIts a Transformer

    DataFrame in , DataFrame out

    Aggregating maps of feature frequency counts

  • 41

    Scalable Pipelines: closing remarks

    With User Defined Aggregate Functions, we have reduced the workload of our pipelines by a factor of 20!

  • 42

    Scalable Pipelines: closing remarks

    With User Defined Aggregate Functions, we have reduced the workload of our pipelines by a factor of 20!

    Obvious gains: freeing up resources that can be used for running even more pipelines, faster, over even more input data

  • 43

    Scalable Pipelines: closing remarks

    Needless to say, more factors contribute towards a scalable pipeline: Performance tuning of the Spark cluster Use of a workflow manager (e.g. Luigi) for pipeline orchestration

  • 44

    Scalable Pipelines: closing remarks

    Needless to say, more factors contribute towards a scalable pipeline: Performance tuning of the Spark cluster Use of a workflow manager (e.g. Luigi) for pipeline orchestration

    But each one of these is a topic for a separate talk

  • 45

    Shameless plug

    We are hiring!

    Across all our hubs

    in London, Oslo, Stockholm, Barcelona

    for Data Science, Engineering, UX and Product roles

    https://jobs.lever.co/schibstedspt-recruiters@schibsted.com

    https://jobs.lever.co/schibstedhttps://jobs.lever.co/schibsted

  • 46

    Q/A

    Thank you!