Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now....

62
Google for Work Sales Engineer Google Todd Reedy Processing Data Like Google Using the Dataflow/Beam Model

Transcript of Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now....

Page 1: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Google for Work Sales EngineerGoogle

Todd Reedy

Processing Data Like GoogleUsing the Dataflow/Beam Model

Page 2: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Goals:

Write interesting

computations

Run in both batch &

streaming

Use custom timestamps

Handle late data

https://commons.wikimedia.org/wiki/File:Globe_centered_in_the_Atlantic_Ocean_(green_and_grey_globe_scheme).svg

Page 3: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Weave provides the fastest way

to connect smart devices to

the network, the cloud, and an

ecosystem of (business) apps &

smart devices.

Weave setup works on Android,

iOS, and desktop.

Want more details? Visit: https://developers.google.com/weave/

CRM

Page 4: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Telemetry

Brillo is a free OS, tightly integrated with

Android, tailored for IoT.

Brillo makes it cost-effective to build a

secure connected device, with verified

boot to ensure it runs only trusted code.

Brillo comes with services to help

understand how customers interact

with it, and keep it updated over time.

Have Brillo running on your device itself

or use it as a hub to aggregate data from

nearby (micro) processors.

OTA Updates Weave Metrics

Android HAL

Kernel

Hardware

Page 5: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Data Shapes

Google’s Data Processing Story

The Beam Model

Agenda

Demo: Google Cloud Dataflow

1

2

3

4

Page 6: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Data Shapes1

Page 7: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Data...

Page 8: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

...can be big...

Page 9: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

...really, really big...

Tuesday

Wednesday

Thursday

Page 10: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

...maybe even infinitely big...

9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00

Page 11: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

… with unknown delays.

9:008:00 14:0013:0012:0011:0010:00

8:00

8:008:00

8:00

Page 12: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

1 + 1 = 2

Completeness Latency Cost

$$$

Data Processing Tradeoffs

Page 13: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Requirements: Billing Pipeline

Completeness Latency Cost

Important

Not Important

Page 14: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Requirements: Live Cost Estimate Pipeline

Completeness Latency Cost

Important

Not Important

Page 15: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Requirements: Abuse Detection Pipeline

Completeness Latency Cost

Important

Not Important

Page 16: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Requirements: Abuse Detection Backfill Pipeline

Completeness Latency Cost

Important

Not Important

Page 17: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Google’s Data Processing Story2

Page 18: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

20122002 2004 2006 2008 2010

MapReduce

GFS Big Table

Dremel

Pregel

FlumeJava

Colossus

Spanner

2014

MillWheel

Dataflow

2016

Google’s Data-Related Papers

Page 19: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

MapReduce: Batch Processing

R R

M M M

Page 20: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

FlumeJava: Easy and Efficient MapReduce Pipelines

● Higher-level API with simple data

processing abstractions.

○ Focus on what you want to do to

your data, not what the

underlying system supports.

● A graph of transformations is

automatically transformed into an

optimized series of MapReduces.

Page 21: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

MapReduce

Batch Patterns: Creating Structured Data

Page 22: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

MapReduce

Batch Patterns: Repetitive Runs

TuesdayWednesday

Thursday

Page 23: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

MapReduce

Tuesday [11:00 - 12:00)

[12:00 - 13:00)

[13:00 - 14:00)

[14:00 - 15:00)

[15:00 - 16:00)

[16:00 - 17:00)

[18:00 - 19:00)

[19:00 - 20:00)

[21:00 - 22:00)

[22:00 - 23:00)

[23:00 - 0:00)

Batch Patterns: Time Based Windows

Page 24: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

MapReduce

TuesdayWednesday

Batch Patterns: Sessions

Jose

Lisa

Ingo

Asha

Cheryl

Ari

WednesdayTuesday

Page 25: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

MillWheel: Streaming Computations

● Framework for building low-latency

data-processing applications

● User provides a DAG of

computations to be performed

● System manages state and

persistent flow of elements

Page 26: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Streaming Patterns: Element-wise transformations

13:00 14:008:00 9:00 10:00 11:00 12:00Processing

Time

Page 27: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Streaming Patterns: Aggregating Time Based Windows

13:00 14:008:00 9:00 10:00 11:00 12:00Processing

Time

Page 28: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

11:0010:00 15:0014:0013:0012:00Event Time

11:0010:00 15:0014:0013:0012:00Processing

Time

Input

Output

Streaming Patterns: Event Time Based Windows

Page 29: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Streaming Patterns: Session Windows

Event Time

Processing Time 11:0010:00 15:0014:0013:0012:00

11:0010:00 15:0014:0013:0012:00

Input

Output

Page 30: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Pro

ce

ssin

g T

ime

Event Time

Skew

Event-Time Skew

Page 31: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Pro

ce

ssin

g T

ime

Event Time

Skew

Event-Time Skew

Watermark Watermarks describe event

time progress.

"No timestamp earlier than the

watermark will be seen"

Often heuristic-based.

Too Slow? Results are delayed.

Too Fast? Some data is late.

Page 32: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Streaming or Batch?

1 + 1 = 2 $$$Completeness Latency Cost

Why not both?

Page 33: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

The Dataflow Model3

Page 34: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

The Apache Beam Model3

Page 35: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

What are you computing?

Where in event time?

When in processing time?

How do refinements relate?

Page 36: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

What are you computing?

• A Pipeline represents a graph

of data processing

transformations

• PCollections flow through the

pipeline

• Optimized and executed as a

unit for efficiency

Page 37: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

What are you computing?

• A PCollection<T> is a collection of

data of type T

• Maybe be bounded or unbounded

in size

• Each element has an implicit

timestamp

• Initially created from backing data

stores

Page 38: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

What are you computing?

PTransforms transform PCollections into other

PCollections.

What Where When How

Element-Wise Aggregating Composite

Page 39: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Example: Computing Integer Sums

// Collection of raw log linesPCollection<String> raw = ...;

// Element-wise transformation into team/score pairsPCollection<KV<String, Integer>> input =

raw.apply(ParDo.of(new ParseFn()))

// Composite transformation containing an aggregationPCollection<KV<String, Integer>> output = input.apply(Sum.integersPerKey());

What Where When How

Page 40: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Example: Computing Integer Sums

What Where When How

Page 41: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Example: Computing Integer Sums

What Where When How

Page 42: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Example: Computing Integer Sums

What Where When How

Page 43: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Key 2

Key 1

Key 3

1

Fixed

2

3

4

Key 2

Key 1

Key 3

Sliding

123

54

Key 2

Key 1

Key 3

Sessions

2

4

3

1

Where in Event Time?

Windowing divides

data into event-

time-based finite

chunks.

Required when doing

aggregations over

unbounded data.

What Where When How

Page 44: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

PCollection<KV<String, Integer>> output = input.apply(Window.into(FixedWindows.of(Minutes(2)))).apply(Sum.integersPerKey());

Example: Fixed 2-minute Windows

What Where When How

Page 45: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Example: Fixed 2-minute Windows

What Where When How

Page 46: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

When in Processing Time?

Triggers control when

results are emitted.

Triggers are often

relative to the

watermark.Pro

ce

ssin

g T

ime

Event Time

Watermark

What Where When How

Page 47: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

PCollection<KV<String, Integer>> output = input.apply(Window.into(FixedWindows.of(Minutes(2)))

.trigger(AtWatermark())).apply(Sum.integersPerKey());

Example: Triggering at the Watermark

What Where When How

Page 48: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Example: Triggering at the Watermark

What Where When How

Page 49: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Example: Triggering for Speculative & Late Data

PCollection<KV<String, Integer>> output = input.apply(Window.into(FixedWindows.of(Minutes(2)))

.trigger(AtWatermark().withEarlyFirings(AtPeriod(Minutes(1))).withLateFirings(AtCount(1))))

.apply(Sum.integersPerKey());

What Where When How

Page 50: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Example: Triggering for Speculative & Late Data

What Where When How

Page 51: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

How do Refinements Relate?

How should multiple outputs per window accumulate?Appropriate choice depends on consumer.

Firing Elements

Speculative 3

Watermark 5, 1

Late 2

Total Observ 11

Discarding

3

6

2

11

What Where When How

Page 52: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

How do Refinements Relate?

How should multiple outputs per window accumulate?Appropriate choice depends on consumer.

Firing Elements

Speculative 3

Watermark 5, 1

Late 2

Total Observ 11

Discarding

3

6

2

11

Accumulating

3

9

11

23

Acc. & Retracting

3

9, -3

11, -9

11

What Where When How

Page 53: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

PCollection<KV<String, Integer>> output = input.apply(Window.into(Sessions.withGapDuration(Minutes(1)))

.trigger(AtWatermark().withEarlyFirings(AtPeriod(Minutes(1))).withLateFirings(AtCount(1)))

.accumulatingAndRetracting()).apply(new Sum());

Example: Add Newest, Remove Previous

What Where When How

Page 54: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Example: Add Newest, Remove Previous

What Where When How

Page 55: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

1. Classic Batch 2. Batch with Fixed

Windows

3. Streaming 5. Streaming with

Retractions

4. Streaming with

Speculative + Late Data

Customizing What Where When How

What Where When How

Page 56: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Demo: Google Cloud Dataflow4

Page 57: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Cloud DataflowA fully-managed cloud service and

programming model for batch and

streaming big data processing.

Google Cloud Dataflow

Page 58: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Open Source SDKs

● Used to construct a Dataflow pipeline.

● Java available now. Python in the works.

● Pipelines can run…

○ On your development machine

○ On the Dataflow Service on Google Cloud Platform

○ On third party environments like Spark or Flink.

Page 59: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Fully Managed Dataflow Service

Runs the pipeline on Google Cloud Platform. Includes:

● Graph optimization: Modular code, efficient execution

● Smart Workers: Lifecycle management, Autoscaling, and

Smart task rebalancing

● Easy Monitoring: Dataflow UI, Restful API and CLI,

Integration with Cloud Logging, etc.

Page 60: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Demo Time!

Page 61: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Wrote Reusable code

(PTransforms)

Built upon the same code

(PTransform) in batch and

streaming

Used event time, not

processing time

Adjusted windowing and

late data settings

Page 62: Processing Data Like Google€¦ · Used to construct a Dataflow pipeline. Java available now. Python in the works. Pipelines can run… On your development machine On the Dataflow

Learn More!

tec2016.toddreedy.com