Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData...

Post on 21-May-2020

9 views 0 download

Transcript of Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData...

Google Cloud Platform Reference Architecture (Streaming)

Reza Rokni

Data .. Introduction

GB's

...can be Big Introduction

TB's

... really really big! ... but at least always batch?Introduction

TuesdayWednesday

Thursday

PB's

... well... but at least it's on time..Introduction

9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00

... it's doesn't even have the courtesy to be on time!Introduction

9:008:00 14:0013:0012:0011:0010:00

8:00

8:008:00

Google confidential │ Do not distribute

Let’s process some dataReference ArchitectureProcesses

1,000,000'ssec

10sec

Cloud Pub/SubAsync Messaging

Massive Scale NoSqlNoSQL Database Service

Cloud DataflowParallel data processing

BigQueryAnalytics Engine

CloudMLMachine Learning

File

Cloud StorageObject Store Exports

Cloud DataprocManaged Spark Hadoop

Google confidential │ Do not distribute

Let’s process some dataReference ArchitectureProcesses

1,000,000'ssec

100sec

Cloud Pub/SubAsync Messaging

Cloud StorageObject Store

Capture

• Globally redundant• Batched read/write• Custom labels• Push & Pull• Auto expiration• 10 MB Message Size• 7 Days storage for

unack Messages

Publisher A Publisher B Publisher C

Message 1

Topic A Topic B Topic C

Subscription XA Subscription XB Subscription YC

Subscription ZC

Cloud Pub/Sub

Subscriber X Subscriber Y

Message 2 Message 3

Subscriber Z

Message 1

Message 2

Message 3

Message 3

Cloud Pub/Sub API

Google confidential │ Do not distribute

Let’s process some dataReference ArchitectureProcesses

1,000,000'ssec

10sec

Cloud Pub/SubAsync Messaging

Cloud DataflowParallel data processing

File

Cloud StorageObject Store

Google Cloud Dataflow ( Apache Beam ) Introduction

Apache Beam (incubating) Google Cloud Dataflow

Extra Reading : FlumeJava Combined with MillWheel Dataflow explained

FlumeJava - The What not the HowDataflow explained

FlumeJava

TextIO.Read(MarketData)

ParDo(enrichData(bidsize,ask,bid,trade)

ParDo(filterData(bidsize>x))

BigQueryIO.Write

Code shown is sudo code only

MillWheel - Framework for low latency data processing Dataflow explained

Google confidential │ Do not distribute

C D

C+D

consumer-producer sibling

C D

C+D

Optimizer fusion Optimizer fusionProcesses

Google confidential │ Do not distribute

100 mins. 65 mins.vs.

Dynamic Worker OptimizationProcesses

Google confidential │ Do not distribute

Count

Stream

Parse Message

BigQuery BigQuery

Window

Detect Anomaly

Building a clickstream processing pipeline● In this example we will

○ Read Data from Pub/Sub○ Window and Aggregate the Data○ Do something programmatically with the data

Google confidential │ Do not distribute

Batch Read

Parse Message

Clickstream

BigQuery

Pipeline p = Pipeline.create();

p.begin()

PCollection<String> dataCollection = p.apply(TextIO.Read.from(“gs://…”))

dataCollection.apply(new ParseMessage())

ParDo.of(new TokenizesMessage())

ParDo.of(new CreateRecords())

.apply(BigQueryIO.Write.to(...))

STEP 1 - Transport

Code shown is sudo code only

Google confidential │ Do not distribute

Count

Batch Read

Parse Message

Clickstream

BigQuery BigQuery

Window

Detect Anomaly

Pipeline p = Pipeline.create();

p.begin()

.apply(Window.<Record>into(FixedWindows.of(Duration.standardSecounds(60)))

.apply(ParDo.of(new CreateEventKey()))

.apply(Count)

.apply(ParDo.of(new DetectAnomaly()))

STEP 2 - Detect

Code shown is sudo code only

Google confidential │ Do not distribute

Count

Stream

Parse Message

BigQuery BigQuery

Window

Detect Anomaly

Pipeline p = Pipeline.create();

p.begin()

.apply(PubsubIO.Write.topic(...))

STEP 3 - Stream

.apply(TextIO.Read.from(“gs://…”))

.apply(PubsubIO.Read.topic(...))

Code shown is sudo code only

1 + 1 = 2Completeness Latency Cost

$$$

Data Processing Tradeoffs

Requirements: Billing Pipeline

Completeness Low Latency Low Cost

Important

Not Important

Requirements: Live Cost Estimate Pipeline

Completeness Low Latency Low Cost

Important

Not Important

Requirements: Abuse Detection Pipeline

Completeness Low Latency Low Cost

Important

Not Important

Requirements: Abuse Detection Backfill Pipeline

Completeness Low Latency Low Cost

Important

Not Important

Dataflow explained

Inherent issues when dealing with streams

Watermarks

Watermark triggers

PCollection<KV<String, Integer>> scores = input

.apply(Window

.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(AtWatermark())

.apply(Sum.integersPerKey());

Code shown is sudo code only

Approximate Triggers

PCollection<KV<String, Integer>> scores = input

.apply(Window

.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))

.withLateFirings(AtCount(1))))

.apply(Sum.integersPerKey());

Code shown is sudo code only

Requirements: Live Cost Estimate Pipeline

Completeness Low Latency Low Cost

Important

Not Important

Google confidential │ Do not distribute

GCP

Managed Service

User Code & SDK Work Manager

Dep

loy

& S

ched

ule

Pro

gres

s &

Lo

gsMonitoring UI

Job Manager

Google confidential │ Do not distribute

Let’s process some dataReference ArchitectureProcesses

1,000,000'ssec

10sec

Cloud Pub/SubAsync Messaging

Massive Scale NoSqlNoSQL Database Service

Cloud DataflowParallel data processing

BigQueryAnalytics Engine

File

Cloud StorageObject Store Exports

Google confidential │ Do not distribute

BigQuery Or BigTable... Or Both??Pipeline Consumers

Massive Scale NoSqlNoSQL Database Service

BigQueryAnalytics Engine

Google confidential │ Do not distribute

Let’s process some dataReference ArchitectureProcesses

1,000,000'ssec

10sec

Cloud Pub/SubAsync Messaging

Massive Scale NoSqlNoSQL Database Service

Cloud DataflowParallel data processing

BigQueryAnalytics Engine

CloudMLMachine Learning

File

Cloud StorageObject Store Exports

Cloud DataprocManaged Spark Hadoop

Google confidential │ Do not distribute

CloudML - Data pre-processing stagesMachine Learning

If Machine learning is the new rocket ship...

Data is the fuel!

Google confidential │ Do not distribute

Let’s process some dataCloudML - API'sProcesses

Speech APIVision API

Google confidential │ Do not distribute

It is well known that a vital

ingredient of success is not

knowing that what you're

attempting can't be done

Terry Pratchett