to process data “the Cloud way” Using Cloud Dataflow ... William VAMBENEPE.pdf · Big Data on...

38
Big Data on Google Cloud William Vambenepe, Lead Product Manager for Big Data, Google Cloud Platform @vambenepe / [email protected] Using Cloud Dataflow, BigQuery, and friends to process data “the Cloud way”

Transcript of to process data “the Cloud way” Using Cloud Dataflow ... William VAMBENEPE.pdf · Big Data on...

Big Data on Google Cloud

William Vambenepe, Lead Product Manager for Big Data, Google Cloud Platform

@vambenepe / [email protected]

Using Cloud Dataflow, BigQuery, and friendsto process data “the Cloud way”

Big Data at Google

Managing data through its lifecycle

Google Cloud Dataflow

Friends of Cloud Dataflow: BigQuery, Pub/Sub, Hadoop/Spark...

Optimizing your time

References and follow-up

1

2

3

4

5

Agenda

6

Big Data at Google

Building on Google’s infrastructure

1.5 million devices activatedevery day (over a billion devices)

6 billion hours watchedevery month (10h uploaded every minute)

20 billion pages crawledevery day

Hardware and data center innovation

Spanner

Dremel

MapReduce

Big Table

Colossus

2012 20132002 2004 2006 2008 2010

GFS

MillWheel

Flume

Pregel

Software innovation

Cloud DataflowBigQuery

Managing data through its lifecycle

Data lifecycle

Stream

Batch

Cloud Pub/Sub

Cloud Logs

Google Analytics Premium

Google Cloud

Storage

Google App

Engine

Cloud Dataflow

BigQuery Storage

(tables)

Cloud Storage

(files)

Cloud Dataflow

BigQuery Analytics

(SQL)Re

al ti

me

ana

lytics

&

aler

ts

Descriptive

Exploratory

+ Descriptive

Predictive

+ Exploratory Descriptive

Prescriptive

+Predictive

ExploratoryDescriptive

Data usage organization maturity lifecycle

● no administration

● most powerful tools in the easiest way

● constant experimentation with low risks & cost

● easy collaboration across teams and organizations

● low costs without requiring usage commitments

● best performance & virtually unlimited scale

● always on

Supporting organizations with operational ease of use

Google Cloud Dataflow

Data lifecycle

Stream

Batch

Cloud Pub/Sub

Cloud Logs

Google Analytics Premium

Google Cloud

Storage

Google App

Engine

Cloud Dataflow

BigQuery Storage

(tables)

Cloud Storage

(files)

Cloud Dataflow

BigQuery Analytics

(SQL)Re

al ti

me

ana

lytics

&

aler

ts

Cloud Dataflow is a collection of

SDKs for building parallelized data

processing pipelines

Cloud Dataflow is a managed service

for executing parallelized data

processing pipelines

What is Cloud Dataflow?

↳ Download from GitHub:https://github.com/GoogleCloudPlatform/DataflowJavaSDK

↳ Use on Google Cloud:https://cloud.google.com/dataflow/

Cloud Dataflow SDK - Logical Model

Pipeline{

Who => Inputs

What => Transforms

Where => Windows

When => Watermarks + Triggers

To => Outputs

}

Unified programming model for both batch & stream processing.

• A Direct Acyclic Graph of data processing transformations

• Can be submitted to the Dataflow Service for optimization and execution or executed on an alternate runner e.g. Spark

• May include multiple inputs and multiple outputs

• May encompass many logical MapReduce operations

• PCollections flow through the pipeline

Cloud Dataflow Pipeline

Google Cloud Platform

Managed Service

User Code & SDK

Work Manager

Dep

loy

& S

ched

ule

Pro

gres

s &

Lo

gs

Monitoring UI

Job Manager

Life of a Dataflow Pipeline

Graph

optimiza

tion

Deploy Schedule & Monitor Tear Down

Worker Lifecycle Management throughout batch execution

100 mins. 65 mins.

Worker Optimization

vs.

800 RPS 1,200 RPS 5,000 RPS 50 RPS

Continuous worker scaling for long-lived streaming pipelines

time

• Run the same code in multiple modes using different runners• Direct Runner

• For local, in-memory execution.• Great for developing and unit tests

• Cloud Dataflow Service Runner• Runs on the fully-manage Dataflow Service• Your code runs distributed across GCE instances

• Community sourced• Spark runner @ github.com/cloudera/spark-dataflow• Flink runner coming soon from dataArtisans

Portability: Cloud Dataflow Runners

The most productive and portable Data pipeline SDK.

Friends of Cloud Dataflow: BigQuery, Pub/Sub, Hadoop/Spark...

Data lifecycle

Stream

Batch

Cloud Pub/Sub

Cloud Logs

Google Analytics Premium

Google Cloud

Storage

Google App

Engine

Cloud Dataflow

BigQuery Storage

(tables)

Cloud Storage

(files)

Cloud Dataflow

BigQuery Analytics

(SQL)Re

al ti

me

ana

lytics

&

aler

ts

Many-to-many asynchronous messaging

Fast and reliable

Cloud Pub/Sub

BigQuery

● Ingest data via streaming (100K rows/second/table) or file loader

● Process interactive SQL queries on TB or PB of data

● Zero administration; just upload data and send queries

● Pay for storage and query separately, based on actual usage

● Non-technical analysts can drive queries on massive datasets using BI tools (e.g. Tableau)

● Highly Available: Data replication in multiple geographies.

● Secure and easy collaboration: access to data is controlled using customer-owned ACLs

Hadoop and Spark

HDFS(optional)

Work NodesWork Nodes HDFS

(optional)

Name Node

(optional)

LocalSSD

PDSSD

PDstandard

GCSConnector

BigQueryConnector

Connectors

bdutil orchestration

Master Node

Work Nodes

Optimizing your time

Optimizing Your Time

Optimizing Your Time

Optimizing Your Time

Optimizing Your Time

Optimizing Your Time

Optimizing Your Time

Optimizing Your Time

Optimizing Your Time

Optimizing Your Time

References and follow-up

Cloud Dataflow● Service: https://cloud.google.com/dataflow ● Questions: https://stackoverflow.com/questions/tagged/google-cloud-dataflow ● SDK: https://github.com/GoogleCloudPlatform/DataflowJavaSDK

BigQuery● https://cloud.google.com/bigquery/

Cloud Pub/Sub● https://cloud.google.com/pubsub/

Hadoop and Spark● https://cloud.google.com/hadoop/

Getting Started

Contact me● Twitter: @vambenepe● email: [email protected]

Thank You!cloud.google.com