Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I...

66
Scalable Stream Processing MillWheel and Cloud Dataflow Amir H. Payberah [email protected] KTH Royal Institute of Technology Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 1 / 54

Transcript of Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I...

Page 1: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Scalable Stream ProcessingMillWheel and Cloud Dataflow

Amir H. [email protected]

KTH Royal Institute of Technology

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 1 / 54

Page 2: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

MillWheel

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 2 / 54

Page 3: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Motivation

I Google’s Zeitgeist pipeline: tracking trends in web queries

I Ingests a continuous input of search queries and performs anomalydetection.

I Builds a historical model of each query, so that expected changes intraffic.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 3 / 54

Page 4: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Requirements

I Persistent storage: shortterm and longterm

I Low watermarks: distinguish late records

I Duplicate prevention

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 4 / 54

Page 5: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

MillWheel Dataflow

I A graph of user-defined transformations (computations) on inputdata that produces output data.

I Computation actions include:• Contacting external systems• Manipulating other MillWheel primitives• Outputting data

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 5 / 54

Page 6: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Data Model (1/3)

I Stream: the delivery mechanism between computations.

I Inputs and outputs are represented by (key, value, timestamp)triples.

I Key: a metadata field with semantic meaning in the system.

I Value: an arbitrary byte string, corresponding to the entire record.

I Timestamp: typically wall clock time when the event occurred.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 6 / 54

Page 7: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Data Model (1/3)

I Stream: the delivery mechanism between computations.

I Inputs and outputs are represented by (key, value, timestamp)triples.

I Key: a metadata field with semantic meaning in the system.

I Value: an arbitrary byte string, corresponding to the entire record.

I Timestamp: typically wall clock time when the event occurred.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 6 / 54

Page 8: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Data Model (2/3)

I Keys are abstraction for record aggregation and comparison.

I Key extraction function: specified by the stream consumer to assignkeys to records.

I Computation can only access state for the specific key.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 7 / 54

Page 9: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Data Model (2/3)

I Keys are abstraction for record aggregation and comparison.

I Key extraction function: specified by the stream consumer to assignkeys to records.

I Computation can only access state for the specific key.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 7 / 54

Page 10: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Data Model (3/3)

I A computation subscribes to zero or more input streams and pub-lishes one or more output streams.

I Multiple computations can extract different keys from the samestream.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 8 / 54

Page 11: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Data Model (3/3)

I A computation subscribes to zero or more input streams and pub-lishes one or more output streams.

I Multiple computations can extract different keys from the samestream.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 8 / 54

Page 12: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Computation (1/3)

I Application logic lives in computations.

I Users can add and remove computations from a topology dynami-cally.

I Runs in the context of a single key.

I Parallel per-key processing

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 9 / 54

Page 13: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Computation (2/3)

class Computation {

// Hooks called by the system.

void ProcessRecord(Record data);

void ProcessTimer(Timer timer);

...

};

I ProcessRecord• Triggered when receiving a record

I ProcessTimer• Triggered at a specific value or low

watermark value• Optional

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 10 / 54

Page 14: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Computation (3/3)

// Upon receipt of a record, update the running total for its timestamp bucket,

// and set a timer to fire when we have received all of the data for that bucket.

void Windower::ProcessRecord(Record input) {

WindowState state(MutablePersistentState());

state.UpdateBucketCount(input.timestamp());

string id = WindowID(input.timestamp())

SetTimer(id, WindowBoundary(input.timestamp()));

}

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 11 / 54

Page 15: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Persistent State

I Managed on per-key basis

I Stored in Bigtable or Spanner

I Common use: aggregation, buffered data for joins, ...

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 12 / 54

Page 16: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Low Watermarks (1/3)

I Low watermark: provides a bound on the timestamps of futurerecords arriving at that computation.

I Late records: records behind the low watermark.• Process them according to application, e.g., discard or correct the

result.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 13 / 54

Page 17: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Low Watermarks (1/3)

I Low watermark: provides a bound on the timestamps of futurerecords arriving at that computation.

I Late records: records behind the low watermark.• Process them according to application, e.g., discard or correct the

result.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 13 / 54

Page 18: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Low Watermarks (3/3)

I min(oldest work of A, low watermark of C: C outputs to A)

I Low watermark values are seeded by injectors that send data intoMillWheel from external systems.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 14 / 54

Page 19: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Low Watermarks (3/3)

I min(oldest work of A, low watermark of C: C outputs to A)

I Low watermark values are seeded by injectors that send data intoMillWheel from external systems.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 14 / 54

Page 20: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Low Watermarks (3/3)

I Example: a file injector reports a low watermark value that corre-sponds to the oldest unfinished file.

// Upon finishing a file or receiving a new one, we update the low watermark

// to be the minimum creation time.

void OnFileEvent() {

int64 watermark = kint64max;

for (file : files) {

if (!file.AtEOF())

watermark = min(watermark, file.GetCreationTime());

}

if (watermark != kint64max)

UpdateInjectorWatermark(watermark);

}

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 15 / 54

Page 21: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Fault Tolerance

I Delivery guarantees

I State manipulation

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 16 / 54

Page 22: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Delivery Guarantees - Exactly-One Delivery

I Upon receipt of an input record in a computation:

• The duplicated records are discarded.

• User code is run for the input record.

• Pending changes are committed to the backing store.

• Senders are ACKed.

• Pending downstream productions are sent.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 17 / 54

Page 23: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Delivery Guarantees - Strong Productions

I Inputs are not necessarily ordered or deterministic: emitted recordsare checkpointed before delivery.

• If an ACK is not received, the record can be re-sent.• Duplicates are discarded by MillWheel at the recipient.

I The checkpoints allow fault-tolerance.• If a processor crashes and is restarted somewhere else any intermedi-

ate computations can be recovered.

I When a delivery is ACKed the checkpoints can be garbage collected.

I The Checkpoint→Delivery→ACK→GC sequence is called a strongproduction.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 18 / 54

Page 24: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Delivery Guarantees - Strong Productions

I Inputs are not necessarily ordered or deterministic: emitted recordsare checkpointed before delivery.

• If an ACK is not received, the record can be re-sent.• Duplicates are discarded by MillWheel at the recipient.

I The checkpoints allow fault-tolerance.• If a processor crashes and is restarted somewhere else any intermedi-

ate computations can be recovered.

I When a delivery is ACKed the checkpoints can be garbage collected.

I The Checkpoint→Delivery→ACK→GC sequence is called a strongproduction.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 18 / 54

Page 25: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Delivery Guarantees - Strong Productions

I Inputs are not necessarily ordered or deterministic: emitted recordsare checkpointed before delivery.

• If an ACK is not received, the record can be re-sent.• Duplicates are discarded by MillWheel at the recipient.

I The checkpoints allow fault-tolerance.• If a processor crashes and is restarted somewhere else any intermedi-

ate computations can be recovered.

I When a delivery is ACKed the checkpoints can be garbage collected.

I The Checkpoint→Delivery→ACK→GC sequence is called a strongproduction.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 18 / 54

Page 26: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Delivery Guarantees - Strong Productions

I Inputs are not necessarily ordered or deterministic: emitted recordsare checkpointed before delivery.

• If an ACK is not received, the record can be re-sent.• Duplicates are discarded by MillWheel at the recipient.

I The checkpoints allow fault-tolerance.• If a processor crashes and is restarted somewhere else any intermedi-

ate computations can be recovered.

I When a delivery is ACKed the checkpoints can be garbage collected.

I The Checkpoint→Delivery→ACK→GC sequence is called a strongproduction.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 18 / 54

Page 27: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Delivery Guarantees - Weak Productions (1/2)

I Some computation may be idempotent, regardless of the presenceof strong production and exactly-once delivery.

I Disable the exactly-once and/or strong production guarantee forapplications that do not need it.

I Weak production is when Millwheel users can allow events to besent before the checkpoint is committed to persistent storage.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 19 / 54

Page 28: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Delivery Guarantees - Weak Productions (2/2)

I Weak production checkpointing prevents straggler productions fromoccupying undue resources in the sender (Computation A) by savinga checkpoint for receiver (Computation B).

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 20 / 54

Page 29: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

State Manipulation (1/2)

I Hard state: persisted to the backing store.

I Soft state: in-memory caches or aggregates.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 21 / 54

Page 30: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

State Manipulation (2/2)

I To ensure consistency in hard state, only one bulk write is permittedper event.

• Wrap all per-key updates in a single atomic operation.

I To avoid zombie writers (where work has been moved elsewherethrough failure detection or through load balancing), every writerhas a lease or sequencer that ensures only they may write.

I The single-writer for a key at a particular point in time is critical tothe maintenance of soft state.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 22 / 54

Page 31: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

State Manipulation (2/2)

I To ensure consistency in hard state, only one bulk write is permittedper event.

• Wrap all per-key updates in a single atomic operation.

I To avoid zombie writers (where work has been moved elsewherethrough failure detection or through load balancing), every writerhas a lease or sequencer that ensures only they may write.

I The single-writer for a key at a particular point in time is critical tothe maintenance of soft state.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 22 / 54

Page 32: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

State Manipulation (2/2)

I To ensure consistency in hard state, only one bulk write is permittedper event.

• Wrap all per-key updates in a single atomic operation.

I To avoid zombie writers (where work has been moved elsewherethrough failure detection or through load balancing), every writerhas a lease or sequencer that ensures only they may write.

I The single-writer for a key at a particular point in time is critical tothe maintenance of soft state.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 22 / 54

Page 33: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Google Cloud Dataflow

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 23 / 54

Page 34: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Google Cloud Dataflow (1/4)

I Google managed service for batch and stream data processing.

I A programming model and execution framework.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 24 / 54

Page 35: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Google Cloud Dataflow (2/4)

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 25 / 54

Page 36: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Google Cloud Dataflow (3/4)

I MapReduce: batch processing

I FlumeJava: dataflow programming model

I MillWheel: handling streaming data

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 26 / 54

Page 37: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Google Cloud Dataflow (4/4)

I Open source Cloud Dataflow SDK

I Express your data processing pipeline using FlumeJava.

I If you run your Cloud Dataflow program in batch mode, it is con-verted to MapReduce operations and run on Google’s MapReduceframework.

I If you run the same program in streaming mode, it is executed onthe MillWheel stream processing engine.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 27 / 54

Page 38: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Google Cloud Dataflow (4/4)

I Open source Cloud Dataflow SDK

I Express your data processing pipeline using FlumeJava.

I If you run your Cloud Dataflow program in batch mode, it is con-verted to MapReduce operations and run on Google’s MapReduceframework.

I If you run the same program in streaming mode, it is executed onthe MillWheel stream processing engine.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 27 / 54

Page 39: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Google Cloud Dataflow (4/4)

I Open source Cloud Dataflow SDK

I Express your data processing pipeline using FlumeJava.

I If you run your Cloud Dataflow program in batch mode, it is con-verted to MapReduce operations and run on Google’s MapReduceframework.

I If you run the same program in streaming mode, it is executed onthe MillWheel stream processing engine.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 27 / 54

Page 40: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Programming Model

I Pipeline, a directed graph of data processing transformations

I Optimized and executed as a unit

I May include multiple inputs andmultiple outputs

I May encompass many logicalMapReduce or Millwheel operations

I PCollections conceptually flowthrough the pipeline

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 28 / 54

Page 41: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Dataflow Main Components

I Pipelines

I PCollections

I Transforms

I I/O sources and sinks

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 29 / 54

Page 42: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Pipelines (1/2)

I A pipeline represents a data processing job

I Directed graph of steps operating on data

I A pipeline consists of two parts:• Data (PCollection)• Transforms applied to that data

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 30 / 54

Page 43: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Pipelines (2/2)

public static void main(String[] args) {

// Create a pipeline parameterized by commandline flags.

Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(arg));

p.apply(TextIO.Read.from("gs://...")) // Read input.

.apply(new CountWords()) // Do some processing.

.apply(TextIO.Write.to("gs://...")); // Write output.

// Run the pipeline.

p.run();

}

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 31 / 54

Page 44: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

PCollections - Overview(1/2)

I A specialized class to represent data in a pipeline.

I A parallel collection of records

I Immutable

I No random access

I Must specify bounded or unbounded

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 32 / 54

Page 45: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

PCollections - Overview (2/2)

// Create a Java Collection, in this case a List of Strings.

static final List<String> LINES = Arrays.asList(

"To be, or not to be: that is the question: ",

"Whether ’tis nobler in the mind to suffer ",

"The slings and arrows of outrageous fortune, ",

"Or to take arms against a sea of troubles, ");

PipelineOptions options = PipelineOptionsFactory.create();

Pipeline p = Pipeline.create(options);

// Create the PCollection

p.apply(Create.of(LINES)).setCoder(StringUtf8Coder.of())

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 33 / 54

Page 46: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

PCollections - Windowing (1/6)

I Logically divide up or groups the elements of a PCollection intofinite windows.

I Each element in a PCollection is assigned to one or more windows.

I Windowing functions:• Fixed time windows• Sliding time windows• Per-session windows

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 34 / 54

Page 47: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

PCollections - Windowing (2/6)

I Fixed time windows

I Represents the time interval in the data stream to define bundles ofdata, e.g., hourly

PCollection<String> items = ...;

PCollection<String> fixed_windowed_items = items.apply(

Window.<String>into(FixedWindows.of(1, TimeUnit.MINUTES)));

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 35 / 54

Page 48: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

PCollections - Windowing (3/6)

I Sliding time windows

I Uses time intervals in the data stream to define bundles of data,however the windows overlap.

PCollection<String> items = ...;

PCollection<String> sliding_windowed_items = items.apply(

Window.<String>into(SlidingWindows

.of(Duration.standardMinutes(30))

.every(Duration.standardSeconds(5))));

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 36 / 54

Page 49: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

PCollections - Windowing (4/6)

I Session windows

I Defines windows around areas ofconcentration in the data.

I Useful for data that is irregularlydistributed with respect to time,e.g., user mouse activity

I Applies on a per-key basis

PCollection<String> items = ...;

PCollection<String> session_windowed_items = items.apply(

Window.<String>into(Sessions

.withGapDuration(Duration.standardMinutes(10))));

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 37 / 54

Page 50: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

PCollections - Windowing (5/6)

I Time skew and late data

I Dataflow tracks a watermark: the system’s notion of when all datain a certain window can be expected to have arrived in the pipeline.

I Data that arrives with a timestamp after the watermark is consideredlate data.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 38 / 54

Page 51: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

PCollections - Windowing (6/6)

I Allow late data by invoking the withAllowedLateness operation.

PCollection<String> items = ...;

PCollection<String> fixed_windowed_items = items.apply(

Window.<String>into(FixedWindows.of(1, TimeUnit.MINUTES))

.withAllowedLateness(Duration.standardDays(2)));

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 39 / 54

Page 52: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

PCollections - Triggers (1/3)

I Determine when to emit elements into an aggregated window.

I Provide flexibility for dealing with time skew and data lag.• Example: deal with late-arriving data.• Example: get early results, before all the data in a given window has

arrived.

I Three main types of triggers:• Time-based triggers• Data-driven triggers• Composit triggers

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 40 / 54

Page 53: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

PCollections - Triggers (2/3)

I Time-base triggers

I Operate on a time reference• Event time: as indicated by the timestamp on each data element• Processing time: the time when the data element is processed at

any given stage in the pipeline

PCollection<String> pc = ...;

pc.apply(Window<String>.into(FixedWindows.of(1, TimeUnit.MINUTES))

.triggering(AfterProcessingTime

.pastFirstElementInPane()

.plusDelayOf(Duration.standardMinutes(1))));

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 41 / 54

Page 54: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

PCollections - Triggers (3/3)

I Data-driven triggers• Operate by examining the data as it arrives in each window and firing

when a data condition that you specify is met.• Example: emit results from a window when that window has received

a certain number of data elements.

I Composit triggers• Combine multiple time-based or data-driven triggers in some logical

way.• You can set a composite trigger to fire when all triggers are met

(logical AND), when any trigger is met (logical OR), etc.

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 42 / 54

Page 55: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Transformations - Overview

I A processing operation that transforms data

I Each transform accepts one (or multiple) PCollections as input, per-forms an operation on the elements in the input PCollection(s), andproduces one (or multiple) new PCollections as output.

I Core transforms: ParDo, GroupByKey, Combine, Flatten

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 43 / 54

Page 56: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Transformations - ParDo

I Processes each element of a PCollection independently using a user-provided DoFn.

// The input PCollection of Strings.

PCollection<String> words = ...;

// The DoFn to perform on each element in the input PCollection.

static class ComputeWordLengthFn extends DoFn<String, Integer> { ... }

// Apply a ParDo to the PCollection "words" to compute lengths for each word.

PCollection<Integer> wordLengths = words.apply(

ParDo.of(new ComputeWordLengthFn()));

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 44 / 54

Page 57: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Transformations - GroupByKey

I Takes a PCollection of key-value pairs and gathers up all values withthe same key.

// A PCollection of key/value pairs: words and line numbers.

PCollection<KV<String, Integer>> wordsAndLines = ...;

// Apply a GroupByKey transform to the PCollection "wordsAndLines".

PCollection<KV<String, Iterable<Integer>>> groupedWords = wordsAndLines.apply(

GroupByKey.<String, Integer>create());

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 45 / 54

Page 58: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Transformations - Join and CoGroubByKey

I Groups together the values from multiple PCollections of key-valuepairs, where each PCollection in the input has the same key type.

// Each data set is represented by key-value pairs in separate PCollections.

// Both data sets share a common key type ("K").

PCollection<KV<K, V1>> pc1 = ...;

PCollection<KV<K, V2>> pc2 = ...;

// Create tuple tags for the value types in each collection.

final TupleTag<V1> tag1 = new TupleTag<V1>();

final TupleTag<V2> tag2 = new TupleTag<V2>();

// Merge collection values into a CoGbkResult collection.

PCollection<KV<K, CoGbkResult>> coGbkResultCollection =

KeyedPCollectionTuple.of(tag1, pc1)

.and(tag2, pc2)

.apply(CoGroupByKey.<K>create());

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 46 / 54

Page 59: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Example: HashTag Autocompletion (1/3)

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 47 / 54

Page 60: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Example: HashTag Autocompletion (2/3)

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 48 / 54

Page 61: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Example: HashTag Autocompletion (3/3)

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 49 / 54

Page 62: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Example: Word Count (1/2)

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 50 / 54

Page 63: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Example: Word Count (2/2)

Pipeline p = Pipeline.create(...);

p.apply(TextIO.Read.from("gs://..."))

// Apply a ParDo transform to our PCollection of text lines.

.apply(ParDo.of(new DoFn<String, String>() {

public void processElement(ProcessContext c) { ... }}))

// Apply the Count transform to our PCollection of individual words.

.apply(Count.<String>perElement())

// Formats our PCollection of word counts into a printable string

.apply("FormatResults", MapElements...))

// Apply a write transform

.apply(TextIO.Write.to("gs://..."));

// Run the pipeline.

p.run();

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 51 / 54

Page 64: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Summary

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 52 / 54

Page 65: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Summary

I MillWheel• DAG of computations• Persistent state: per-key• Low watermark• Exactly-one delivery

I Google cloud dataflow• Pipeline• PCollection: windows and triggers• Transforms

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 53 / 54

Page 66: Scalable Stream Processing MillWheel and Cloud Dataflow · 2020-06-16 · Motivation I Google’sZeitgeistpipeline:tracking trendsin web queries I Ingests acontinuous inputof search

Questions?

Amir H. Payberah (KTH) MillWheel and Cloud Dataflow 2016/09/27 54 / 54