Google Dataflow Intro

Sep 2015

Google Dataflow introduction

[email protected]

mailto:[email protected]

What is Google Dataflow

❖ Data processing system: batch and streaming

❖ Set of SDKs

❖ Google Cloud Platform managed services:

❖ Google Compute Engine (VMs)

❖ Google Cloud Storage (r/w data)

❖ BigQuery (r/w data)

Programming Model

❖ Pipeline - entire series of computations

❖ PCollection - set of data in a pipeline

❖ Transform - any data processing operation

❖ Pipeline I/O - data source and data sink APIs

Pipeline

❖ Data + Transforms

❖ Branching + merging

❖ Multiple sources

❖ Unit testing + Integration testing

❖ Pipeline Execution Parameters (local/prod)

❖ Where from, what it looks like, what to do, where store

PCollection

❖ Represent data in a pipeline from any source

❖ Potentially unlimited (stream)

❖ Serializable, immutable, no random access to elements

❖ Deferred data (may have yet to be computed)

❖ Windowing, triggers

Windowing

❖ Window - subdivided logical parts of a PCollection

❖ Each element is assigned to one or more windows

❖ Fixed time windows

❖ Sliding time windows

❖ Per-session windows

❖ Single global windows

Late Data

❖ Event time / Processing time

❖ No order guarantee

❖ No consistent delta b/w Event and Processing time

❖ Watermark

❖ Late data

❖ Triggers to refine windowing, data reporting time

Triggers❖ Enough data for the window -> aggregate result: “pane”

❖ Help handle late data

❖ Time-based triggers

❖ Data-driven triggers (e.g. certain amount is enough)

❖ Composite triggers: OR, AND - operations on triggers

❖ Window Accumulation modes: accumulate/discard the previous “panes”

Transforms

❖ Math, convert format, grouping, filtering, combining

❖ [PCollection] -> [PCollection]

❖ Core Transforms: ParDo, GroupByKey, Combine, …

❖ Functions with business logic to apply: Serializable, Thread-compatible, Idempotent

❖ Composite Transforms

Pipeline I/O

❖ Read/Write from/to external sources

❖ Text Files in Google Cloud Storage or local FS

❖ BigQuery tables

❖ Google Cloud PubSub

❖ Custom Sources and Sinks

Extra

❖ Parallelization, distribution, optimization, scaling

❖ Dataflow monitoring UI and CLI

❖ Logging

❖ Unit testing (locally) any Fn, end-to-end

❖ Introspection toolchain

❖ Update toolchain: for code, windowing configs

Questions?

Google Dataflow Intro

Software

Transcript of Google Dataflow Intro