Data Architectures for Robust Decision Making

54
Designing Data Architectures for Robust Decision Making Gwen Shapira / Software Engineer

Transcript of Data Architectures for Robust Decision Making

Page 1: Data Architectures for Robust Decision Making

Designing Data Architectures for Robust Decision Making

Gwen Shapira / Software Engineer

Page 2: Data Architectures for Robust Decision Making

2©2014 Cloudera, Inc. All rights reserved.

• 15 years of moving data around

• Formerly consultant

• Now Cloudera Engineer:– Sqoop Committer

– Kafka

– Flume

• @gwenshap

About Me

Page 3: Data Architectures for Robust Decision Making

3©2014 Cloudera, Inc. All rights reserved.

There’s a book on that!

Page 4: Data Architectures for Robust Decision Making

4

About you:

You know Hadoop

Page 5: Data Architectures for Robust Decision Making

“Big Data” is stuck at The Lab.

Page 6: Data Architectures for Robust Decision Making

6

We want to move to The Factory

Page 7: Data Architectures for Robust Decision Making

7Click to enter confidentiality information

Page 8: Data Architectures for Robust Decision Making

8

What does it mean to “Systemize”?

• Ability to easily add new data sources

• Easily improve and expend analytics

• Ease data access by standardizing metadata and storage

• Ability to discover mistakes and to recover from them

• Ability to safely experiment with new approaches

Click to enter confidentiality information

Page 9: Data Architectures for Robust Decision Making

9

We will discuss:

• Actual decision making

• Data Science

• Machine learning

• Algorithms

Click to enter confidentiality information

We will not discuss:

• Architectures

• Patterns

• Ingest

• Storage

• Schemas

• Metadata

• Streaming

• Experimenting

• Recovery

Page 10: Data Architectures for Robust Decision Making

10

So how do we build real data architectures?

Click to enter confidentiality information

Page 11: Data Architectures for Robust Decision Making

11

The Data Bus

Page 12: Data Architectures for Robust Decision Making

1212

Client Source

Data Pipelines Start like this.

Page 13: Data Architectures for Robust Decision Making

1313

Client Source

Client

Client

Client

Then we reuse them

Page 14: Data Architectures for Robust Decision Making

1414

Client Backend

Client

Client

Client

Then we add consumers to the

existing sources

Another

Backend

Page 15: Data Architectures for Robust Decision Making

1515

Client Backend

Client

Client

Client

Then it starts to look like this

Another

Backend

Another

Backend

Another

Backend

Page 16: Data Architectures for Robust Decision Making

1616

Client Backend

Client

Client

Client

With maybe some of this

Another

Backend

Another

Backend

Another

Backend

Page 17: Data Architectures for Robust Decision Making

17

Adding applications should be easier

We need:

• Shared infrastructure for sending records

• Infrastructure must scale

• Set of agreed-upon record schemas

Page 18: Data Architectures for Robust Decision Making

18

Kafka Based Ingest Architecture

18

Source System Source System Source System Source System

Kafka decouples Data Pipelines

HadoopSecurity

Systems

Real-time

monitoring

Data

Warehouse

Kafka

Producer

s

Brokers

Consume

rs

Kafka decouples Data Pipelines

Page 19: Data Architectures for Robust Decision Making

19

Retain All Data

Click to enter confidentiality information

Page 20: Data Architectures for Robust Decision Making

20

Data Pipeline – Traditional View

Raw data

Raw data Clean data

Aggregated dataClean data Enriched data

Input OutputWaste of

diskspace

Page 21: Data Architectures for Robust Decision Making

21©2014 Cloudera, Inc. All rights reserved.

It is all valuable data

Raw data

Raw data Clean data

Aggregated dataClean data Enriched data

Filtered dataDash

boardReport

Data

scientis

t

Alerts

OMG

Page 22: Data Architectures for Robust Decision Making

22

Hadoop Based ETL – The FileSystem is the

DB

/user/…

/user/gshapira/testdata/orders

/data/<database>/<table>/<partition>

/data/<biz unit>/<app>/<dataset>/partition

/data/pharmacy/fraud/orders/date=20131101

/etl/<biz unit>/<app>/<dataset>/<stage>

/etl/pharmacy/fraud/orders/validated

Page 23: Data Architectures for Robust Decision Making

23

Store intermediate data

/etl/<biz unit>/<app>/<dataset>/<stage>/<dataset_id>

/etl/pharmacy/fraud/orders/raw/date=20131101

/etl/pharmacy/fraud/orders/deduped/date=20131101

/etl/pharmacy/fraud/orders/validated/date=20131101

/etl/pharmacy/fraud/orders_labs/merged/date=20131101

/etl/pharmacy/fraud/orders_labs/aggregated/date=20131101

/etl/pharmacy/fraud/orders_labs/ranked/date=20131101

Click to enter confidentiality information

Page 24: Data Architectures for Robust Decision Making

24

Batch ETL is old news

Click to enter confidentiality information

Page 25: Data Architectures for Robust Decision Making

25

Small Problem!

• HDFS is optimized for large chunks of data

• Don’t write individual events of micro-batches

• Think 100M-2G batches

• What do we do with small events?

Click to enter confidentiality information

Page 26: Data Architectures for Robust Decision Making

26

Well, we have this data bus…

Click to enter confidentiality information

0 1 2 3 4 5 6 7 8 91

0

1

1

1

2

1

3

0 1 2 3 4 5 6 7 8 91

0

1

1

0 1 2 3 4 5 6 7 8 91

0

1

1

1

2

1

3

Partition 1

Partition 2

Partition 3

Writes

Old New

Page 27: Data Architectures for Robust Decision Making

27

Kafka has topics

How about?

<biz unit>.<app>.<dataset>.<stage>

pharmacy.fraud.orders.raw

pharmacy.fraud.orders.deduped

pharmacy.fraud.orders.validated

pharmacy.fraud.orders_labs.merged

Click to enter confidentiality information

Page 28: Data Architectures for Robust Decision Making

28©2014 Cloudera, Inc. All rights reserved.

It’s (almost) all topics

Raw data

Raw data Clean data

Aggregated dataClean data

Filtered dataDash

boardReport

Data

scientis

t

Alerts

OMG

Enriched Data

Page 29: Data Architectures for Robust Decision Making

29

Benefits

• Recover from accidents

• Debug suspicious results

• Fix algorithm errors

• Experiment with new algorithms

• Expend pipelines

• Jump-start expended pipelines

Click to enter confidentiality information

Page 30: Data Architectures for Robust Decision Making

30

Kinda Lambda

Page 31: Data Architectures for Robust Decision Making

31

Lambda Architecture

• Immutable events

• Store intermediate stages

• Combine Batches and Streams

• Reprocessing

Click to enter confidentiality information

Page 32: Data Architectures for Robust Decision Making

32

What we don’t like

Maintaining two applications

Often in two languages

That do the same thing

Click to enter confidentiality information

Page 33: Data Architectures for Robust Decision Making

33

Pain Avoidance #1 – Use Spark +

SparkStreaming

• Spark is awesome for batch, so why not?– The New Kid that isn’t that New Anymore

– Easily 10x less code

– Extremely Easy and Powerful API

– Very good for machine learning

– Scala, Java, and Python

– RDDs

– DAG Engine

Click to enter confidentiality information

Page 34: Data Architectures for Robust Decision Making

34

Spark Streaming

• Calling Spark in a Loop

• Extends RDDs with DStream

• Very Little Code Changes from ETL to Streaming

Confidentiality Information Goes Here

Page 35: Data Architectures for Robust Decision Making

35

Spark Streaming

Confidentiality Information Goes Here

Single Pass

Source Receiver RDD

Source Receiver RDD

RDD

Filter Count Print

Source Receiver RDD

RDD

RDD

Single Pass

Filter Count Print

Pre-first

Batch

First

Batch

Second

Batch

Page 36: Data Architectures for Robust Decision Making

36

Small Example

val sparkConf = new SparkConf()

.setMaster(args(0)).setAppName(this.getClass.getCanonicalName)

val ssc = new StreamingContext(sparkConf, Seconds(10))

// Create the DStream from data sent over the network

val dStream = ssc.socketTextStream(args(1), args(2).toInt, StorageLevel.MEMORY_AND_DISK_SER)

// Counting the errors in each RDD in the stream

val errCountStream = dStream.transform(rdd => ErrorCount.countErrors(rdd))

val stateStream = errCountStream.updateStateByKey[Int](updateFunc)

errCountStream.foreachRDD(rdd => {

System.out.println("Errors this minute:%d".format(rdd.first()._2))

})

Click to enter confidentiality information

Page 37: Data Architectures for Robust Decision Making

37

Pain Avoidance #2 – Split the Stream

Why do we even need stream + batch?

• Batch efficiencies

• Re-process to fix errors

• Re-process after delayed arrival

What if we could re-play data?

Click to enter confidentiality information

Page 38: Data Architectures for Robust Decision Making

38

Lets Re-Process with new algorithm

Click to enter confidentiality information

0 1 2 3 4 5 6 7 8 91

0

1

1

1

2

1

3

Streaming App v1

Streaming App v2

Result set 1

Result set 2

App

Page 39: Data Architectures for Robust Decision Making

39

Lets Re-Process with new algorithm

Click to enter confidentiality information

0 1 2 3 4 5 6 7 8 91

0

1

1

1

2

1

3

Streaming App v1

Streaming App v2

Result set 1

Result set 2

App

Page 40: Data Architectures for Robust Decision Making

40

Oh no, we just got a bunch of data for

yesterday!

Click to enter confidentiality information

0 1 2 3 4 5 6 7 8 91

0

1

1

1

2

1

3

Streaming App

Streaming App

Today

Yesterday

Page 41: Data Architectures for Robust Decision Making

41

Note:

No need to choose between the approaches.

There are good reasons to do both.

Click to enter confidentiality information

Page 42: Data Architectures for Robust Decision Making

42

Prediction:

Batch vs. Streaming distinction is going away.

Click to enter confidentiality information

Page 43: Data Architectures for Robust Decision Making

43

Yes, you really need a Schema

Click to enter confidentiality information

Page 44: Data Architectures for Robust Decision Making

44

Schema is a MUST HAVE for data integration

Click to enter confidentiality information

Page 45: Data Architectures for Robust Decision Making

4545

Client Backend

Client

Client

Client

Another

Backend

Another

Backend

Another

Backend

Page 46: Data Architectures for Robust Decision Making

46

Remember that we want this?

46

Source System Source System Source System Source System

HadoopSecurity

Systems

Real-time

monitoring

Data

Warehouse

Kafka

Producer

s

Brokers

Consume

rs

Page 47: Data Architectures for Robust Decision Making

47

This means we need this:

Click to enter confidentiality information

Source System Source System Source System Source System

HadoopSecurity

Systems

Real-time

monitoring

Data

Warehouse

KafkaSchema

Repository

Page 48: Data Architectures for Robust Decision Making

48

We can do it in few ways

• People go around asking each other:“So, what does the 5th field of the messages in topic Blah contain?”

• There’s utility code for reading/writing messages that everyone reuses

• Schema embedded in the message

• A centralized repository for schemas– Each message has Schema ID

– Each topic has Schema ID

Click to enter confidentiality information

Page 49: Data Architectures for Robust Decision Making

49

I Avro

• Define Schema

• Generate code for objects

• Serialize / Deserialize into Bytes or JSON

• Embed schema in files / records… or not

• Support for our favorite languages… Except Go.

• Schema Evolution– Add and remove fields without breaking anything

Click to enter confidentiality information

Page 50: Data Architectures for Robust Decision Making

50

Schemas are Agile

• Leave out MySQL and your favorite DBA for a second

• Schemas allow adding readers and writers easily

• Schemas allow modifying readers and writers independently

• Schemas can evolve as the system grows

• Allows validating data soon after its written– No need to throw away data that doesn’t fit!

Click to enter confidentiality information

Page 51: Data Architectures for Robust Decision Making

51Click to enter confidentiality information

Page 52: Data Architectures for Robust Decision Making

52

Woah, that was lots of stuff!

Click to enter confidentiality information

Page 53: Data Architectures for Robust Decision Making

53

Recap – if you remember nothing else…

• After the POC, its time for production

• Goal: Evolve fast without breaking things

For this you need:

• Keep all data

• Design pipeline for error recovery – batch or stream

• Integrate with a data bus

• And Schemas

Page 54: Data Architectures for Robust Decision Making

Thank you