Building a Big Data DWH - GOTO...

GoDataDrivenPROUDLY PART OF THE XEBIA GROUP

@fzkfrisovanvollenhoven@godatadriven.com

Building a Big Data DWH

Friso van VollenhovenCTO

Data Warehousing on Hadoop

-- Wikipedia

“In computing, a data warehouse or enterprise data warehouse (DW, DWH, or EDW) is a database used for reporting and data analysis.”

How to:• Add a column to the facts table?• Change the granularity of dates from day

to hour?• Add a dimension based on some

aggregation of facts?

Schema’s are designed with questions in mind.

Changing it requires to redo the ETL.

Schema’s are designed with questions in mind.

Changing it requires to redo the ETL.

Push things to the facts level.

Keep all source data available all times.

And now?• MPP databases?• Faster / better / more SAN?• (RAC?)

distributed storage

distributed processing

metadata + query engine

EXTRACTTRANSFORM

• No JVM startup overhead for Hadoop API usage• Relatively concise syntax (Python)• Mix Python standard library with any Java libs

• Flexible scheduling with dependencies• Saves output• E-mails on errors• Scales to multiple nodes• REST API• Status monitor• Integrates with version control

Deployment

git push jenkins master

•Scheduling•Simple deployment of ETL code•Scalable•Developer friendly

'februari-22 2013'

A: Yes, sometimes as often as 1 in every 10K calls. Or about once a week at 3K files / day.

TSV ==thorn separated values?

þ == 0xFE

or -2, in HiveCREATE TABLE browsers ( browser_id STRING, browser STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-2';

• The format will change• Faulty deliveries will occur• Your parser will break• Records will be mistakingly produced (over-logging)• Other people test in production too (and you get the

data from it)• Etc., etc.

•Simple deployment of ETL code•Scheduling•Scalable• Independent jobs•Fixable data store• Incremental where possible•Metrics

Independent jobs

source (external)

staging (HDFS)

hive-staging (HDFS)

HDFS upload + move in place

MapReduce + HDFS move

Hive map external table + SELECT INTO

Out of order jobs

• At any point, you don’t really know what ‘made it’ to Hive•Will happen anyway, because some days the data

delivery is going to be three hours late• Or you get half in the morning and the other half

later in the day• It really depends on what you do with the data• This is where metrics + fixable data store help...

Fixable data store

• Using Hive partitions• Jobs that move data from staging create partitions•When new data / insight about the data arrives,

drop the partition and re-insert• Be careful to reset any metrics in this case• Basically: instead of trying to make everything

transactional, repair afterwards• Use metrics to determine whether data is fit for

purpose

Metrics

Metrics service

• Job ran, so may units processed, took so much time• e.g. 10GB imported, took 1 hr• e.g. 60M records transformed, took 10 minutes• Dropped partition• Inserted X records into partition

GoDataDriven

We’re hiring / Questions? / Thank you!

@fzkfrisovanvollenhoven@godatadriven.com

Friso van VollenhovenCTO

Building a Big Data DWH - GOTO...

Documents

Transcript of Building a Big Data DWH - GOTO...

Scaling Devops - GOTO Conferencegotocon.com/dl/goto-aarhus-2011/slides/JezHumble_ScalingDevops.… · Scaling Devops Breaking Down the Barriers between Development and IT Operations

Apache Cassandra and DataStax - GOTO Conferencegotocon.com/dl/goto-berlin-2014/slides/ChristianJohannsen_Building... · • Feel free to learn more about data modeling online: Part

Connectivity for today’s complex world - GOTO Conferencegotocon.com/.../GregFlurry_ConnectivityForTodaysComplexWorld.pdf · Connectivity for today’s complex world ... –Dynamic

Bioluminescent creature of the deep sea - GOTO Conferencegotocon.com/dl/goto-berlin-2014/slides/CarloZapponi... · 2014-11-08 · Bioluminescent creatures of the deep sea GOTO Berlin

GOTO Fast Delivery - GOTO Conferencegotocon.com/dl/goto-aar-2014/slides/AdrianCockcroft_FastDelivery.pdfNon-Destructive Production Updates “Immutable Code” Service Pattern Existing

Scala - The Simple Parts - GOTO Conferencegotocon.com/.../slides/MartinOdersky_ScalaTheSimpleParts.pdf · Controversies Internal controversies: Different communities don’t agree

@briandorsey #kubernetes #GOTOber - GOTO Conferencegotocon.com/dl/goto-berlin-2015/slides/BrianDorsey_KubernetesCha… · PaaS. 18 @briandorsey #kubernetes #GOTOber ... Can be used

DDD & Microservices - GOTO Conferencegotocon.com/.../EricEvans_DDDAndMicroservicesAtLastSomeBounda… · DDD & Microservices At last, some boundaries! Eric Evans @ericevans0 domainlanguage.com

EXITING VACUUM - GOTO Conferencegotocon.com/dl/goto-aar-2013/slides/SaschaBates_ExitingVacuum... · INTEGRATING CONFIGURATION MANAGEMENT Sascha Bates Opscode Wednesday, October 2,

Lisp&Cancer - GOTO Conferencegotocon.com/dl/goto-chicago-2013/slides/OlaBini_LispAndCancer.pdf · Ola$Bini computationalmetalinguist ola.bini@gmail.com$ 698E2885C1DE74E32CD503AD295C746984AF7F0C

Hibernate Search - GOTO Conferencegotocon.com/dl/...SearchedAndFoundFullTextSearchWithHibernate.pdf · Hibernate + Hibernate Search Search request Index update Architecture. Database

Web Services, Web Apps, and REST - GOTO Conferencegotocon.com/dl/goto-cph-2011/slides/StefanTilkov... · Web Services, Web Apps, and REST Stefan Tilkov, @stilkov | GOTO Copenhagen

Event Sourcing - GOTO Conferencegotocon.com › dl › goto-aar-2014 › slides › GregYoung_EventSourcing.… · Event Sourced systems must be object oriented. What is the “bestest”

Rapid Application Development - GOTO Conferencegotocon.com/dl/goto-zurich-2013/slides/ChristofOberholzer_Rapid... · Rapid Application Development with Bison Technology 11.4.2013

Introducing Elixir - GOTO Conferencegotocon.com/dl/goto-chicago-2014/slides/DaveThomas_ElixirThePower... · Introducing Elixir Functional |> Concurrent ... The future is concurrent.

Why Zalando trusts in PostgreSQL - GOTO Conferencegotocon.com/dl/goto-berlin-2013/...WhyZalandoTrustsInPostgreSQL.pdf · Why Zalando trusts in PostgreSQL A developer’s view on using

Dr Streamlove - GOTO Conferencegotocon.com/dl/goto-aar-2014/slides/ViktorKlang_HowI...Opportunity: Distributed Streams • Encode Reactive Streams as a transport protocol • Possibility

Director of Technlogy KAAZING - GOTO Conferencegotocon.com/dl/goto-aarhus-2011/slides/BradDrysdale_HTML5Web...Director of Technology ... • Frames have a few header bytes ... web

implementing programmer anarchy - GOTO Conferencegotocon.com/dl/goto-chicago-2014/slides/FredGeorge_Implementing... · BA 9 Fate of Roles:! Agile Roles business development management

lean enterprise 1h - GOTO Conferencegotocon.com/dl/goto-berlin-2014/slides/JezHumble_TheLean... · 2014-11-10 · technology adoption lifecycle Geoffrey Moore, Crossing the Chasm.