THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with...

THE FUTURE IS NOW

Scalable Predictive Pipelines with Spark and ScalaDimitris Papadopoulos

About Schibsted

Event Tracking Data

Data Science Tasks

DataModel

Results

Preprocessing

1. Using Spark ML Pipelines

2. Scalable Pipelines

Outline

Pipeline

Not a pipe

Pipeline Stage

● One or more inputs ● Strictly one output

Pipeline Stage

● Closed under concatenation

Pipeline Stage

● Closed under concatenation● Standalone and runnable● Spark™ ML inside

Spark ML Pipelines

Using a Pipeline to train a model

Spark ML Pipelines

Using a PipelineModel to get predictions

Peek inside a Spark pipelineIt’s a Pipeline

plain Spark API

Peek inside a Spark pipelineIt’s a Pipeline

plain Spark API

From DataFrame to a Model

Peek inside a Spark pipelineInstantiating a Pipeline

Running it!

Example PipelineEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

UserAccountPreprocessor:harmonises schemas of user accounts across sites (if necessary). Provides ground truth data for training.

EventPreprocessor:aggregates events per user

GenderPredictor:creates labels and features, trains classifier & computes predictions

Example PipelineEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

GenderPerformanceEvaluator:computes performance metrics, e.g. accuracy and area under ROC

Scalable Pipelines: pain pointsEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

Input: 1 day’s / 7 days’ worth of events data. Larger lookbacks needed for better accuracy.

More data for better performance

Performance of three different pipelines, vs lookback length (1, 7, 30, 45)

What will happen if we try to process30 days worth of data (e.g. 3B events) ???

Scalable Pipelines: pain points

Memory and processing heavy:● In one use-case, for 7 days lookback (~7 x 100M events) we used to need 20 Spark

executors with 22G of memory each.

Not easily scalable● As the lookback increases ● As more and more sites are incorporated into our pipelines

Redundant processing● For K days of lookback, we are repeating processing of K - 2 days worth of data, when we

run the pipeline every day, in a rolling window fashion.

“What will happen if we try to process30 days worth of data (e.g. 3B events) ???”

Saved by Algebra

● The operations (op) along with the corresponding data structures (S) that we are interested in are monoids.○ Associative:

■ for all A,B,C in S, (A op B) op C = A op (B op C) ○ Identity element:

■ there exists E in S such that for each A in S, E op A = A op E = A

● Examples:○ Summation: 1 + 2 + 3 + 4 = (1 + 2) + (3 + 4)○ String array concatenation: [“foo”] + [“bar”] + [“baz”] = [“foo”, “bar”] + [“baz”]

Scalable Pipelines: in monoids fashion

● Split the aggregations in smaller chunks○ i.e. pre-process events per user and single day (not over the entire lookback)

● Make one (or multiple) day aggregates and combine○ i.e. aggregate over the pre-preprocessed events per user and day

● It’s like trying to ...eat an elephant: one piece at a time!

Scalable Pipelines: building blocks

● Imagine we had a MapAggregator, for aggregating maps of [String->Double].

● The spec for such an aggregator implemented in Scala on Spark could look like this. :-)

● In Spark we can define our own functions, also known as User Defined Functions (UDF)

● A UDF takes as arguments one or more columns, and returns some output.

● It gets executed for each row of the DataFrame.● It can also be parameterized.● e.g. val myUDF = udf((myArg: myType) => ...)

● Since Spark 1.5, we can also define our own User Defined Aggregate Functions (UDAF).

● UDAFs can be used to compute custom calculations over groups of input data (in contrast, UDFs compute a value looking at a single input row)

● Examples: calculating geometric mean or calculating the product of values for every group.

● A UDAF maintains an aggregation buffer to store intermediate results for every group of input data.

● It updates this buffer for every input row. ● Once it has processed all input rows, it generates a result value based on

values of the aggregation buffer.

Scalable Pipelines: UDAF

Scalable Pipelines: UDAFA User Defined Aggregate Function

Implementation of abstract methods

Scalable Pipelines: adding a new stageEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

What will happen if we try to process30 days worth of data (e.g. 3B events) ???

Scalable Pipelines: adding a new stageEventCoalescer: collects raw pulse events (Json) into substantially fewer files (Parquet)

EventPreprocessor:aggregates events per user and day

EventAggregator:aggregates pre-processed events per user over multiple days (lookback)

Scalable Pipelines: Aggregating Events

Scalable Pipelines: Aggregating EventsIt’s a Transformer

DataFrame in , DataFrame out

Scalable Pipelines: Aggregating EventsIt’s a Transformer

DataFrame in , DataFrame out

Aggregating maps of feature frequency counts

Scalable Pipelines: closing remarks

● With User Defined Aggregate Functions, we have reduced the workload of our pipelines by a factor of 20!

● Obvious gains: freeing up resources that can be used for running even more pipelines, faster, over even more input data

● Needles to say, more factors contribute towards a scalable pipeline:○ Performance tuning of the Spark cluster○ Use of a workflow manager (e.g. Luigi) for pipeline orchestration

● But each one of these is a topic for a separate talk (Carlos? Hint, hint!) :-)

Thank you!

Shameless plug

We are hiring!

Across all our hubs

in London, Oslo, Stockholm, Barcelona

for Data Science, Engineering, UX and Product roles

https://jobs.lever.co/schibstedspt-recruiters@schibsted.com

THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with...

Documents

Transcript of THE FUTURE IS NOW - files.meetup.comfiles.meetup.com/3183732/Scalable Predictive Pipelines with...

…A Scalable Systems Initiative Scalable Microsoft ...€¦ · A Scalable Systems Initiative Scalable Microsoft Application Re engineering Technology Scalable Systems Inc Web: systems.com

files.meetup.comfiles.meetup.com/10366/Arabic Mock DLPT.pdffiles.meetup.com

Gas Pipelines

BIG in Pipelines · BIG in Pipelines

1 Clockless Logic: Dynamic Logic Pipelines (contd.) Drawbacks of Williams’ PS0 Pipelines Lookahead Pipelines.

Scalable Pipelines

files.meetup.comfiles.meetup.com/206766/Programma di Urbanistica e... · Web viewfiles.meetup.com ... 43

Resources Pipelines andResources, Pipelines, and Hydraulic ...

MQTT KAFKA BRIDGE · WHAT IS APACHE KAFKA? • A distributed streaming platform used for building real-time data pipelines and streaming apps. • Open-source • Horizontally scalable,

files.meetup.comfiles.meetup.com/1782829/BCAM Songbook - MS-Wor… · Web viewfiles.meetup.com

ATG pipelines

XGBoost: A Scalable Tree Boosting Systemdmlc.cs.washington.edu/data/pdf/XGBoostArxiv.pdf · 2016-04-05 · also incorporated into real-world production pipelines for ad click through

Pipelines Presentation

Scalable predictive pipelines with Spark and Scala

files.meetup.comfiles.meetup.com/19057432/BluesRock Blues Master S… · Web viewfiles.meetup.com

files.meetup.comfiles.meetup.com/508445/Manituana. Wuming.pdffiles.meetup.com

Scalable Streaming Data Pipelines with Redis

files.meetup.comfiles.meetup.com/14539652/The Ending of Time.docx · Web viewfiles.meetup.com

Pipelines Tokyo

Pipelines Project