Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at...

56
Google Cloud Dataflow Cosmin Arad, Senior Software Engineer [email protected] August 7, 2015

Transcript of Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at...

Page 1: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Google Cloud Dataflow

Cosmin Arad, Senior Software Engineer

[email protected]

August 7, 2015

Page 2: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Dataflow Overview

Dataflow SDK Concepts (Programming Model)

Cloud Dataflow Service

Demo: Counting Words!

Questions and Discussion

1

2

3

4

5

Agenda

Page 3: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

History of Big Data at Google

2012 20132002 2004 2006 2008 2010

Cloud Dataflow

MapReduce

GFS Big Table

Dremel

Pregel

Flume

Colossus

SpannerMillWheel

Page 4: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

StoreCapture Analyze

BigQuery LargerHadoop

Ecosystem

HadoopSpark (on

GCE)

Pub/SubLogs

App EngineBigQuery streaming

Process

Dataflow(stream

and batch)

CloudStorage(objects)

CloudDatastore(NoSQL)

Cloud SQL(mySQL)

BigQuery StorageBigTable

(structured)

HadoopSpark (on

GCE)

Big Data on Google Cloud Platform

Page 5: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Cloud Dataflow is a collection of

SDKs for building parallelized data

processing pipelines

Cloud Dataflow is a managed service

for executing parallelized data

processing pipelines

What is Cloud Dataflow?

Page 6: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

ETL

Where might you use Cloud Dataflow?

Analysis Orchestration

Page 7: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Movement

Filtering

Enrichment

Shaping

Where might you use Cloud Dataflow?

ReductionBatch computationContinuous computation

Composition

External orchestration

Simulation

Page 8: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Dataflow SDK Concepts(Programming Model)

Page 9: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table
Page 10: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Dataflow SDK(s)

● Easily construct parallelized data processing pipelines using an intuitive set of programming abstractions○ Do what the user expects.○ No knobs whenever possible. ○ Build for extensibility.○ Unified batch & streaming semantics.

● Google supported and open sourced○ Java 7 (public) @ github.com/GoogleCloudPlatform/DataflowJavaSDK○ Python 2 (in progress)

● Community sourced○ Scala @ github.com/darkjh/scalaflow○ Scala @ github.com/jhlch/scala-dataflow-dsl

Page 11: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Dataflow Java SDK Release Process

weekly monthly

Page 12: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

• A directed graph of data processing transformations

• Optimized and executed as a unit

• May include multiple inputs and multiple outputs

• May encompass many logical MapReduce or Millwheel operations

• PCollections conceptually flow through the pipeline

Pipeline

Page 13: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Runners

● Specify how a pipeline should run

● Direct Runner○ For local, in-memory execution. Great for developing and unit tests

● Cloud Dataflow Service○ batch mode: GCE instances poll for work items to execute.○ streaming mode: GCE instances are set up in a semi-permanent topology

● Community sourced○ Spark from Cloudera @ github.com/cloudera/spark-dataflow ○ Flink from dataArtisans @ github.com/dataArtisans/flink-dataflow

Page 14: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Example: #HashTag Autocompletion

Page 15: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

{d->[deflategate, desafiodatransa, djokovic], ... de->[deflategate, desafiodatransa, dead50],...}

Count

ExpandPrefixes

Top(3)

Write

Read

{d->(deflategate, 10M), d->(denver, 2M), …, sea->(seahawks, 5M), sea->(seaside, 2M), ...}

ExtractTags

{Go Hawks #Seahawks!, #Seattle works museum pass. Free! Go #PatriotsNation! Having fun at #seaside, … }

{seahawks, seattle, patriotsnation, lovemypats, ...}

{seahawks->5M, seattle->2M, patriots->9M, ...}

Tweets

Predictions

Page 16: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Count

ExpandPrefixes

Top(3)

Write

Read

ExtractTags

Tweets

Predictions

Pipeline p = Pipeline.create();

p.begin()

p.run();

.apply(ParDo.of(new ExtractTags()))

.apply(Top.largestPerKey(3))

.apply(Count.perElement())

.apply(ParDo.of(new ExpandPrefixes())

.apply(TextIO.Write.to(“gs://…”));

.apply(TextIO.Read.from(“gs://…”))

Page 17: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Pipeline• Directed graph of steps operating on data

Pipeline p = Pipeline.create();

p.run();

Dataflow Basics

Page 18: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Pipeline• Directed graph of steps operating on data

Data• PCollection

• Immutable collection of same-typed elements that can be encoded

• PCollectionTuple, PCollectionList

Pipeline p = Pipeline.create();

p.begin()

.apply(TextIO.Read.from(“gs://…”))

.apply(TextIO.Write.to(“gs://…”));

p.run();

Dataflow Basics

Page 19: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Pipeline• Directed graph of steps operating on data

Data• PCollection

• Immutable collection of same-typed elements that can be encoded

• PCollectionTuple, PCollectionList

Transformation• Step that operates on data• Core transforms

• ParDo, GroupByKey, Combine, Flatten• Composite and custom transforms

Pipeline p = Pipeline.create();

p.begin()

.apply(TextIO.Read.from(“gs://…”))

.apply(ParDo.of(new ExtractTags()))

.apply(Count.perElement())

.apply(ParDo.of(new ExpandPrefixes())

.apply(Top.largestPerKey(3))

.apply(TextIO.Write.to(“gs://…”));

p.run();

Dataflow Basics

Page 20: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Pipeline p = Pipeline.create();

p.begin()

.apply(TextIO.Read.from(“gs://…”))

.apply(ParDo.of(new ExtractTags()))

.apply(Count.perElement())

.apply(ParDo.of(new ExpandPrefixes())

.apply(Top.largestPerKey(3))

.apply(TextIO.Write.to(“gs://…”));

p.run();

class ExpandPrefixes … { ... public void processElement(ProcessContext c) { String word = c.element().getKey(); for (int i = 1; i <= word.length(); i++) { String prefix = word.substring(0, i); c.output(KV.of(prefix, c.element())); } }}

Dataflow Basics

Page 21: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

• A collection of data of type T in a pipeline

• Maybe be either bounded or unbounded in size

• Created by using a PTransform to:• Build from a java.util.Collection• Read from a backing data store• Transform an existing PCollection

• Often contain key-value pairs using KV<K, V>

{Seahawks, NFC, Champions, Seattle, ...}

{..., “NFC Champions #GreenBay”, “Green Bay #superbowl!”, ... “#GoHawks”, ...}

PCollections

Page 22: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

• Read from standard Google Cloud Platform data sources

• GCS, Pub/Sub, BigQuery, Datastore, ...

• Write your own custom source by teaching Dataflow how to read it in parallel

• Write to standard Google Cloud Platform data sinks

• GCS, BigQuery, Pub/Sub, Datastore, …

• Can use a combination of text, JSON, XML, Avro formatted data

YourSource/Sink

Here

Inputs & Outputs

Page 23: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

• A Coder<T> explains how an element of type T can be written to disk or communicated between machines

• Every PCollection<T> needs a valid coder in case the service decides to communicate those values between machines.

• Encoded values are used to compare keys -- need to be deterministic.

• Avro Coder inference can infer a coder for many basic Java objects.

Coders

Page 24: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

• Processes each element of a PCollection independently using a user-provided DoFn

LowerCase

ParDo (“Parallel Do”)

{Seahawks, NFC, Champions, Seattle, ...}

{seahawks, nfc, champions, seattle, ...}

Page 25: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

• Processes each element of a PCollection independently using a user-provided DoFn

LowerCase

{seahawks, nfc, champions, seattle, ...}

{Seahawks, NFC, Champions, Seattle, ...}

ParDo (“Parallel Do”)

PCollection<String> tweets = …;tweets.apply(ParDo.of( new DoFn<String, String>() { @Override public void processElement( ProcessContext c) { c.output(c.element().toLowerCase()); }));

Page 26: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

• Processes each element of a PCollection independently using a user-provided DoFn

FilterOutSWords

{NFC, Champions, ...}

ParDo (“Parallel Do”)

{Seahawks, NFC, Champions, Seattle, ...}

Page 27: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

• Processes each element of a PCollection independently using a user-provided DoFn

ExpandPrefixes

{s, se, sea, seah, seaha, seahaw, seahawk, seahawks, n, nf, nfc, c, ch, cha, cham, champ, champi, champio, champion, champions, s, se, sea, seat, seatt, seattl, seattle, ...}

ParDo (“Parallel Do”)

{Seahawks, NFC, Champions, Seattle, ...}

Page 28: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

• Processes each element of a PCollection independently using a user-provided DoFn

KeyByFirstLetter

{KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...}

ParDo (“Parallel Do”)

{Seahawks, NFC, Champions, Seattle, ...}

Page 29: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

• Processes each element of a PCollection independently using a user-provided DoFn

• Elements are processed in arbitrary ‘bundles’ e.g. “shards”

• startBundle(), processElement()*, finishBundle()

• supports arbitrary amounts of parallelization

• Corresponds to both the Map and Reduce phases in Hadoop i.e. ParDo->GBK->ParDo

KeyByFirstLetter

{KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...}

ParDo (“Parallel Do”)

{Seahawks, NFC, Champions, Seattle, ...}

Page 30: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

• Takes a PCollection of key-value pairs and gathers up all values with the same key

• Corresponds to the shuffle phase in Hadoop

How do you do a GroupByKey on an unbounded PCollection?

GroupByKey

{KV<S, Seahawks>, KV<C,Champions>, <KV<S, Seattle>, KV<N, NFC>, ...}

{KV<S, {Seahawks, Seattle, …}, KV<N, {NFC, …} KV<C, {Champion, …}}

GroupByKey

Page 31: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

• Logically divide up or groups the elements of a PCollection into finite windows

• Fixed Windows: hourly, daily, …• Sliding Windows• Sessions

• Required for GroupByKey-based transforms on an unbounded PCollection, but can also be used for bounded PCollections

• Window.into() can be called at any point in the pipeline and will be applied when needed

• Can be tied to arrival time or custom event time

Nighttime Mid-Day Nighttime

Windows

Page 32: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Event Time Skew

Proc

essi

ng T

ime

Event Time

Watermark Skew

Page 33: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

• Determine when to emit elements into an aggregated Window.

• Provide flexibility for dealing with time skew and data lag.

• Example use: Deal with late-arriving data. (Someone was in the woods playing Candy Crush offline.)

• Example use: Get early results, before all the data in a given window has

arrived. (Want to know # users per hour, with updates every 5 minutes.)

Triggers

Page 34: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Late & Speculative ResultsPCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(FixedWindows.of(Duration.standardMinutes(2)) .triggering( AfterEach.inOrder( Repeatedly.forever( AfterProcessingTime.pastFirstElementInPane() .alignedTo(Duration.standardMinutes(1))) .orFinally(AfterWatermark.pastEndOfWindow()), Repeatedly.forever( AfterPane.elementCountAtLeast(1))) .orFinally(AfterWatermark.pastEndOfWindow() .plusDelayOf(Duration.standardDays(7)))) .accumulatingFiredPanes()) .apply(new Sum());

Page 35: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

• Define new PTransforms by building up subgraphs of existing transforms

• Some utilities are included in the SDK• Count, RemoveDuplicates, Join,

Min, Max, Sum, ...

• You can define your own:• modularity, code reuse• better monitoring experience

GroupByKey

Pair With Ones

Sum Values

Count

Composite PTransforms

Page 36: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Composite PTransforms

Page 37: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Cloud Dataflow Service

Page 38: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Google Cloud Platform

Managed Service

User Code & SDK

Work Manager

Monitoring UI

Job Manager

Life of a Pipeline

Deploy & Schedule

Progress & Logs

Page 39: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

• Pipeline optimization: Modular code, efficient execution

• Smart Workers: Lifecycle management, Auto-Scaling, and Dynamic Work Rebalancing

• Easy Monitoring: Dataflow UI, Restful API and CLI, Integration with Cloud Logging, Cloud Debugger, etc.

Cloud Dataflow Service Fundamentals

Page 40: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Graph Optimization

ParDo fusionProducer ConsumerSiblingIntelligent fusion boundaries

Combiner lifting e.g. partial aggregations before reductionFlatten unzippingReshard placement...

Page 41: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Optimizer: ParallelDo Fusion= ParallelDo

GBK = GroupByKey

+ = CombineValues

C D

C+D

consumer-producer sibling

C D

C+D

Page 42: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Optimizer: Combiner Lifting= ParallelDo

GBK = GroupByKey

+ = CombineValues

A GBK + B

A+ GBK + B

Page 43: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table
Page 44: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table
Page 45: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Deploy Schedule & Monitor Tear Down

Worker Lifecycle Management

Page 46: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

800 RPS 1,200 RPS 5,000 RPS 50 RPS

Worker Pool Auto-Scaling

Page 47: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

100 mins. 65 mins.

Dynamic Work Rebalancing

vs.

Page 48: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Monitoring

Page 49: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table
Page 50: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Pipeline management• Validation• Pipeline execution graph optimizations• Dynamic and adaptive sharding of computation stages• Monitoring UI

Cloud resource management• Spin worker VMs• Set up logging• Manages exports• Teardown

Fully-managed Service

Page 51: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Ease of use

• No performance tuning required• Highly scalable, performant out of the box• Novel techniques to lower e2e execution time

• Intuitive programming model + Java SDK• No dichotomy between batch and streaming processing

• Integrated with GCP (VMs, GCS, BigQuery, Datastore, …)

Total Cost of Ownership

• Benefits from GCE’s cost model

Benefits of Dataflow on Google Cloud Platform

Page 52: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

More time to dig into your data

Programming

Resource provisioning

Performance tuning

Monitoring

ReliabilityDeployment & configuration

Handling Growing ScaleUtilization

improvements

Data Processing with Google Cloud Dataflow

Typical Data Processing

Programming

Optimizing your time: no-ops, no-knobs, zero-config

Page 53: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Demo: Counting Words!

Page 54: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

• WordCount Code: See the SDK Concepts in action

• Running on the Dataflow Service• Monitoring job progress in the Dataflow

Monitoring UI• Looking at worker logs in Cloud Logging• Using the CLI

Highlights from the live demo...

Page 55: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Questions and Discussion

Page 56: Google Cloud Dataflowlinc.ucy.ac.cy/isocial/images/stories/Events/EIT... · History of Big Data at Google 2002 2004 2006 2008 2010 2012 2013 Cloud Dataflow MapReduce GFS Big Table

Getting Started

❯ cloud.google.com/dataflow/getting-started

❯ github.com/GoogleCloudPlatform/DataflowJavaSDK

❯ stackoverflow.com/questions/tagged/google-cloud-dataflow