Building a Big Data DWH - GOTO...

56
Go DataDriven PROUDLY PART OF THE XEBIA GROUP @fzk [email protected] Building a Big Data DWH Friso van Vollenhoven CTO Data Warehousing on Hadoop

Transcript of Building a Big Data DWH - GOTO...

Page 1: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

GoDataDrivenPROUDLY PART OF THE XEBIA GROUP

@[email protected]

Building a Big Data DWH

Friso van VollenhovenCTO

Data Warehousing on Hadoop

Page 3: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 4: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 5: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 6: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

ETL

Page 7: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 8: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

How to:• Add a column to the facts table?• Change the granularity of dates from day

to hour?• Add a dimension based on some

aggregation of facts?

Page 9: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

Schema’s are designed with questions in mind.

Changing it requires to redo the ETL.

Page 10: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

Schema’s are designed with questions in mind.

Changing it requires to redo the ETL.

Push things to the facts level.

Keep all source data available all times.

Page 11: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 12: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

And now?• MPP databases?• Faster / better / more SAN?• (RAC?)

Page 13: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 14: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

distributed storage

distributed processing

metadata + query engine

Page 15: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 16: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

EXTRACTTRANSFORM

LOAD

Page 17: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 18: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 19: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 20: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 21: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 22: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

• No JVM startup overhead for Hadoop API usage• Relatively concise syntax (Python)• Mix Python standard library with any Java libs

Page 23: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 24: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

• Flexible scheduling with dependencies• Saves output• E-mails on errors• Scales to multiple nodes• REST API• Status monitor• Integrates with version control

Page 25: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 26: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

Deployment

git push jenkins master

Page 27: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

•Scheduling•Simple deployment of ETL code•Scalable•Developer friendly

Page 28: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 29: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 30: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 31: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 32: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 33: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

'februari-22 2013'

Page 34: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 35: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

A: Yes, sometimes as often as 1 in every 10K calls. Or about once a week at 3K files / day.

Page 36: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 37: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 38: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

þ

Page 39: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

þ

Page 40: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

TSV ==thorn separated values?

Page 41: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

þ == 0xFE

Page 42: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

or -2, in HiveCREATE TABLE browsers ( browser_id STRING, browser STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-2';

Page 43: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 44: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 45: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 46: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 47: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 48: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise
Page 49: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

• The format will change• Faulty deliveries will occur• Your parser will break• Records will be mistakingly produced (over-logging)• Other people test in production too (and you get the

data from it)• Etc., etc.

Page 50: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

•Simple deployment of ETL code•Scheduling•Scalable• Independent jobs•Fixable data store• Incremental where possible•Metrics

Page 51: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

Independent jobs

source (external)

staging (HDFS)

hive-staging (HDFS)

Hive

HDFS upload + move in place

MapReduce + HDFS move

Hive map external table + SELECT INTO

Page 52: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

Out of order jobs

• At any point, you don’t really know what ‘made it’ to Hive•Will happen anyway, because some days the data

delivery is going to be three hours late• Or you get half in the morning and the other half

later in the day• It really depends on what you do with the data• This is where metrics + fixable data store help...

Page 53: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

Fixable data store

• Using Hive partitions• Jobs that move data from staging create partitions•When new data / insight about the data arrives,

drop the partition and re-insert• Be careful to reset any metrics in this case• Basically: instead of trying to make everything

transactional, repair afterwards• Use metrics to determine whether data is fit for

purpose

Page 54: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

Metrics

Page 55: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

Metrics service

• Job ran, so may units processed, took so much time• e.g. 10GB imported, took 1 hr• e.g. 60M records transformed, took 10 minutes• Dropped partition• Inserted X records into partition

Page 56: Building a Big Data DWH - GOTO Conferencegotocon.com/dl/goto-amsterdam-2013/slides/FrisovanVollenhoven_Building... · -- Wikipedia “In computing, a data warehouse or enterprise

GoDataDriven

We’re hiring / Questions? / Thank you!

@[email protected]

Friso van VollenhovenCTO