Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData...

Google Cloud Platform Reference Architecture (Streaming)

Reza Rokni

Data .. Introduction

...can be Big Introduction

... really really big! ... but at least always batch?Introduction

TuesdayWednesday

Thursday

... well... but at least it's on time..Introduction

9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00

... it's doesn't even have the courtesy to be on time!Introduction

9:008:00 14:0013:0012:0011:0010:00

8:008:00

Google confidential │ Do not distribute

Let’s process some dataReference ArchitectureProcesses

1,000,000'ssec

Cloud Pub/SubAsync Messaging

Massive Scale NoSqlNoSQL Database Service

Cloud DataflowParallel data processing

BigQueryAnalytics Engine

CloudMLMachine Learning

Cloud StorageObject Store Exports

Cloud DataprocManaged Spark Hadoop

1,000,000'ssec

100sec

Cloud StorageObject Store

Capture

• Globally redundant• Batched read/write• Custom labels• Push & Pull• Auto expiration• 10 MB Message Size• 7 Days storage for

unack Messages

Publisher A Publisher B Publisher C

Message 1

Topic A Topic B Topic C

Subscription XA Subscription XB Subscription YC

Subscription ZC

Cloud Pub/Sub

Subscriber X Subscriber Y

Message 2 Message 3

Subscriber Z

Message 1

Message 2

Message 3

Cloud Pub/Sub API

1,000,000'ssec

Cloud StorageObject Store

Google Cloud Dataflow ( Apache Beam ) Introduction

Apache Beam (incubating) Google Cloud Dataflow

Extra Reading : FlumeJava Combined with MillWheel Dataflow explained

FlumeJava - The What not the HowDataflow explained

FlumeJava

TextIO.Read(MarketData)

ParDo(enrichData(bidsize,ask,bid,trade)

ParDo(filterData(bidsize>x))

BigQueryIO.Write

Code shown is sudo code only

MillWheel - Framework for low latency data processing Dataflow explained

consumer-producer sibling

Optimizer fusion Optimizer fusionProcesses

100 mins. 65 mins.vs.

Dynamic Worker OptimizationProcesses

Stream

Parse Message

BigQuery BigQuery

Window

Detect Anomaly

Building a clickstream processing pipeline● In this example we will

○ Read Data from Pub/Sub○ Window and Aggregate the Data○ Do something programmatically with the data

Batch Read

Parse Message

Clickstream

BigQuery

Pipeline p = Pipeline.create();

p.begin()

PCollection<String> dataCollection = p.apply(TextIO.Read.from(“gs://…”))

dataCollection.apply(new ParseMessage())

ParDo.of(new TokenizesMessage())

ParDo.of(new CreateRecords())

.apply(BigQueryIO.Write.to(...))

STEP 1 - Transport

Batch Read

Parse Message

Clickstream

BigQuery BigQuery

Window

Detect Anomaly

p.begin()

.apply(Window.<Record>into(FixedWindows.of(Duration.standardSecounds(60)))

.apply(ParDo.of(new CreateEventKey()))

.apply(Count)

.apply(ParDo.of(new DetectAnomaly()))

STEP 2 - Detect

Stream

Parse Message

BigQuery BigQuery

Window

Detect Anomaly

p.begin()

.apply(PubsubIO.Write.topic(...))

STEP 3 - Stream

.apply(TextIO.Read.from(“gs://…”))

.apply(PubsubIO.Read.topic(...))

1 + 1 = 2Completeness Latency Cost

Data Processing Tradeoffs

Requirements: Billing Pipeline

Completeness Low Latency Low Cost

Important

Not Important

Requirements: Live Cost Estimate Pipeline

Important

Not Important

Requirements: Abuse Detection Pipeline

Important

Not Important

Requirements: Abuse Detection Backfill Pipeline

Important

Not Important

Dataflow explained

Inherent issues when dealing with streams

Watermarks

Watermark triggers

PCollection<KV<String, Integer>> scores = input

.apply(Window

.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(AtWatermark())

.apply(Sum.integersPerKey());

Approximate Triggers

PCollection<KV<String, Integer>> scores = input

.apply(Window

.into(FixedWindows.of(Duration.standardMinutes(2))

.triggering(AtWatermark()

.withEarlyFirings(AtPeriod(Duration.standardMinutes(1)))

.withLateFirings(AtCount(1))))

.apply(Sum.integersPerKey());

Requirements: Live Cost Estimate Pipeline

Important

Not Important

Managed Service

User Code & SDK Work Manager

gsMonitoring UI

Job Manager

1,000,000'ssec

BigQuery Or BigTable... Or Both??Pipeline Consumers

1,000,000'ssec

CloudMLMachine Learning

Cloud DataprocManaged Spark Hadoop

CloudML - Data pre-processing stagesMachine Learning

If Machine learning is the new rocket ship...

Data is the fuel!

Let’s process some dataCloudML - API'sProcesses

Speech APIVision API

It is well known that a vital

ingredient of success is not

knowing that what you're

attempting can't be done

Terry Pratchett

Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData...

Documents

Transcript of Google Cloud Platformfiles.meetup.com/18404940/Big Data Reference Architecture - Reza Rokni.pdfData...

The BIG FOUR! ….well really the BIG 2 with a Little 2.

Really Big Flash Card Abraham. Really Big Answer He and his followers left Ur and traveled to Canaan at the request of Yahweh.

Nanotechnology: The Next Really Big Small Thing

A QUICK WAY TO WRITE REALLY, REALLY BIG OR REALLY, REALLY SMALL NUMBERS. Scientific Notation.

Can Big Data really transform travel?

Is Greenland Really That Big?

GE Profile™ “this is really big” rebate

Really Big Theory Fourth Ed. (1)

Solving the Really Big Tech Problems with IoT

Why Big Data is Really about Small Data

Thinking really big cxr version

Big Data: What's it Really About?

The Really Big Fun Bible Quiz! - Razor Planetmedia1.razorplanet.com/.../281501_TheReallyBigFunBibleQuiz.pdfThe Really Big Fun Bible Quiz! 100 questions taking you from Genesis all

Dream BIG & Get What YOU Really Want

Hashtag a Really Big Chair - ProfitCenterCoach.com

OUR REALLY BIG ISSUE CONVERSATION QUESTION

DISCOUNTS Where the really big

10 Reasons Leadership is a Really Big Deal

Does big data really mean better decisions?

Really Big Auction at the A Big Thanks to All Who Page 4 ...