Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

36
Agile Data Warehouse Modeling: Introduction to Data Vault Modeling Kent Graziano Data Warrior LLC Twitter @KentGraziano

description

This is a presentation I gave at OUGF14 in Helsinki, Finland. Data Vault Data Modeling is an agile data modeling technique for designing highly flexible, scalable, and adaptable data structures for enterprise data warehouse repositories. It is a hybrid approach using the best of 3NF and dimensional modeling. It is not a replacement for star schema data marts (and should not be used as such). This approach has been used in projects around the world (Europe, Australia, USA) for the last 10 years but is still not widely known or understood. The purpose of this presentation is to provide attendees with a detailed introduction to the components of the Data Vault Data Model, what they are for and how to build them. The examples will give attendees the basics for how to build, and design structures incrementally, without constant refactoring, when using the Data Vault modeling technique. This technique works well for: • Building the Enterprise Data Warehouse repository in a CIF architecture • Building a Persistent Staging Area (PSA) in a Kimball Bus Architecture • Building your data model incrementally, one sprint at a time using a repeatable technique • Providing a model that is easily extensible without need to re-engineer existing structure or load processes

Transcript of Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Page 1: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Agile Data Warehouse Modeling:Introduction to Data Vault Modeling

Kent Graziano

Data Warrior LLC

Twitter @KentGraziano

Page 2: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Agenda

Bio What do we mean by Agile? What is a Data Vault? Where does it fit in an DW/BI architecture How to design a Data Vault model Being “agile”

#OUGF14

Page 3: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

My Bio

Oracle ACE Director Certified Data Vault Master and DV 2.0 Architect Member: Boulder BI Brain Trust Data Architecture and Data Warehouse Specialist

● 30+ years in IT● 25+ years of Oracle-related work● 20+ years of data warehousing experience

Co-Author of ● The Business of Data Vault Modeling ● The Data Model Resource Book (1st Edition)

Past-President of ODTUG and Rocky Mountain Oracle User Group

#OUGF14

Page 4: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Manifesto for Agile Software Development

● “We are uncovering better ways of developing software by doing it and helping others do it.

● Through this work we have come to value:

● Individuals and interactions over processes and tools

● Working software over comprehensive documentation

● Customer collaboration over contract negotiation

● Responding to change over following a plan

● That is, while there is value in the items on the right, we value the items on the left more.”

● http://agilemanifesto.org/

#OUGF14

Page 5: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Applying the Agile Manifesto to DW

● User Stories instead of requirements documents

● Time-boxed iterations● Iteration has a standard length● Choose one or more user stories to fit in that

iteration

● Rework is part of the game● There are no “missed requirements”... only

those that haven’t been delivered or discovered yet.

(C) Kent Graziano#OUGF14

Page 6: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Data Vault Definition

The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business.

It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise.

Dan Linstedt: Defining the Data VaultTDAN.com Article

Architected specifically to meet the needs of today’s enterprise data warehouses

#OUGF14

Page 7: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

What is Data Vault Trying to Solve?

● What are our other Enterprise

Data Warehouse options?● Third-Normal Form (3NF): Complex

primary keys (PK’s) with cascading

snapshot dates

● Star Schema (Dimensional): Difficult to

reengineer fact tables for granularity

changes

● Difficult to get it right the first

time

● Not adaptable to rapid

business change

● NOT AGILE!

(C) Kent Graziano#OUGF14

Page 8: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Data Vault Time Line

20001960 1970 1980 1990

E.F. Codd invented relational modeling

Chris Date and Hugh Darwen Maintained and Refined Modeling

1976 Dr Peter ChenCreated E-R Diagramming

Early 70’s Bill Inmon Began Discussing Data Warehousing

Mid 60’s Dimension & Fact Modeling presented by General Mills and Dartmouth University

Mid 70’s AC Nielsen PopularizedDimension & Fact Terms

Mid – Late 80’s Dr Kimball Popularizes Star Schema

Mid 80’s Bill InmonPopularizes Data Warehousing

Late 80’s – Barry Devlin and Dr Kimball Release “Business Data Warehouse”

1990 – Dan Linstedt Begins R&D on Data Vault Modeling

2000 – Dan Linstedt releases first 5 articles on Data Vault Modeling

#OUGF14

Page 9: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Data Vault Evolution

The work on the Data Vault approach began in the early 1990s, and completed around 1999.

Throughout 1999, 2000, and 2001, the Data Vault design was tested, refined, and deployed into specific customer sites.

In 2002, the industry thought leaders were asked to review the architecture.

● This is when I attend my first DV seminar in Denver and met Dan!

In 2003, Dan began teaching the modeling techniques to the mass public.

Now in 2014, Dan introduced DV 2.0!(C) Kent Graziano

#OUGF14

Page 10: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Where does a Data Vault Fit?

Staging

EDWDATA VAULT

Data Marts

(Star Schemas)

Data Marts

(Star Schemas)

Data Marts

(Star Schemas)

#OUGF14

Page 11: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Where does Data Vault fit?

Data Vault goes here#OUGF14

Page 12: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

How to be Agile using DV

● Model iteratively● Use Data Vault data modeling technique● Create basic components, then add over time

● Virtualize the Access Layer● Don’t waste time building facts and dimensions up front● ETL and testing takes too long● “Project” objects using pattern-based DV model with

database views (or BI meta layer)

● Users see real reports with real data● Can always build out for performance in

another iteration

(C) Kent Graziano#OUGF14

Page 13: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Data Vault: 3 Simple Structures

1 Hub

2 Link

3Satellite

#OUGF14

Page 14: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Data Vault Core Architecture

Hubs = Unique List of Business Keys Links = Unique List of Relationships across

keys Satellites = Descriptive Data

Satellites have one and only one parent table Satellites cannot be “Parents” to other tables Hubs cannot be child tables

© LearnDataVault.com#OUGF14

Page 15: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Common Attributes

Required – all structures● Primary key – PK● Load date time stamp – DTS● Record source – REC_SRC

Required – Satellites only● Load end date time stamp – LEDTS● Optional in DV 2.0

Optional – Extract Dates –Extrct_DTS Optional – Hubs & Links only

● Last seen dates – LSDTs● MD5KEY

Optional – Satellites only● Load sequence ID – LDSEQ_ID● Update user – UPDT_USER● Update DTS – UPDT_DTS● MD5DIFF

© LearnDataVault.com#OUGF14

Page 16: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

1. Hub = Business Keys

Hubs = Unique Lists of Business KeysBusiness Keys are used to TRACK and IDENTIFY key informationNew: DV 2.0 includes MD5 of the BK to link to Hadoop/NoSQL

(C) Kent Graziano #OUGF14

Page 17: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

2: Links = Associations

Links = Transactions and Associations

They are used to hook together multiple sets of information

In DV 2.0 the BK attributes migrate to the Links for faster query

(C) Kent Graziano#OUGF14

Page 18: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Modeling Links - 1:1 or 1:M?

Today:

● Relationship is a 1:1 so why model a Link?

Tomorrow:

● The business rule can change to a 1:M.

● You discover new data later.

With a Link in the Data Vault:

● No need to change the EDW structure.

● Existing data is fine.

● New data is added.

(C) Kent Graziano#OUGF14

Page 19: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

3. Satellites = Descriptors

• Satellites provide context for the Hubs and the Links

• Tracks changes over time

• Like SCD 2

(C) Kent Graziano#OUGF14

Page 20: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

This model is partially compliant with Hadoop.

The Hash Keys can be used to join to Hadoop data sets.

Note: Business Keys replicated to the Link structure for “join” capabilities on the way out to Data Marts.

What’s New in DV2.0?

© LearnDataVault.com

#OUGF14

Page 21: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Data Vault Model Flexibility (Agility)

Goes beyond standard 3NF• Hyper normalized

● Hubs and Links only hold keys and meta data● Satellites split by rate of change and/or source

• Enables Agile data modeling● Easy to add to model without having to change existing structures

and load routines• Relationships (links) can be dropped and created on-demand.

● No more reloading history because of a missed requirement

Based on natural business keys• Not system surrogate keys• Allows for integrating data across functions and source

systems more easily● All data relationships are key driven.

#OUGF14

Page 22: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Data Vault Extensibility

Adding new components to the EDW has NEAR ZERO impact to:• Existing Loading

Processes• Existing Data Model• Existing Reporting & BI

Functions• Existing Source Systems• Existing Star Schemas

and Data Marts

(C) LearnDataVault.com #OUGF14

Page 23: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

● Standardized modeling rules• Highly repeatable and learnable modeling technique• Can standardize load routines

● Delta Driven process● Re-startable, consistent loading patterns.

• Can standardize extract routines● Rapid build of new or revised Data Marts

• Can be automated‣ Can use a BI-meta layer to virtualize the reporting

structures‣ Example: OBIEE Business Model and Mapping tool‣ Example: BOBJ Universe Business Layer

‣ Can put views on the DV structures as well‣ Simulate ODS/3NF or Star Schemas

Data Vault Productivity

(C) Kent Graziano#OUGF14

Page 24: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

• The Data Vault holds granular historical relationships.

• Holds all history for all time, allowing any source system feeds to be reconstructed on-demand

• Easy generation of Audit Trails for data lineage and compliance.

• Data Mining can discover new relationships between elements

• Patterns of change emerge from the historical pictures and linkages.

• The Data Vault can be accessed by power-users

Data Vault Adaptability

(C) Kent Graziano#OUGF14

Page 25: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Other Benefits of a Data Vault

Modeling it as a DV forces integration of the Business Keys upfront.• Good for organizational alignment.

An integrated data set with raw data extends it’s value beyond BI:• Source for data quality projects• Source for master data • Source for data mining • Source for Data as a Service (DaaS) in an SOA (Service Oriented Architecture).

Upfront Hub integration simplifies the data integration routines required to load data marts.

• Helps divide the work a bit.

It is much easier to implement security on these granular pieces. Granular, re-startable processes enable pin-point failure correction. It is designed and optimized for real-time loading in its core

architecture (without any tweaks or mods).

#OUGF14

Page 26: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

#OUGF14

Page 27: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Worlds Smallest Data Vault

The Data Vault doesn’t have to be “BIG”.

An Data Vault can be built incrementally.

Reverse engineering one component of the existing models is not uncommon.

Building one part of the Data Vault, then changing the marts to feed from that vault is a best practice.

The smallest Enterprise Data Warehouse consists of two tables: ● One Hub, ● One Satellite

Hub_Cust_Seq_ID

Hub_Cust_Num

Hub_Cust_Load_DTS

Hub_Cust_Rec_Src

Hub Customer

Hub_Cust_Seq_ID

Sat_Cust_Load_DTS

Sat_Cust_Load_End_DTS

Sat_Cust_Name

Sat_Cust_Rec_Src

Satellite Customer Name

#OUGF14

Page 28: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Notably…

In 2008 Bill Inmon stated that the “Data Vault is the optimal approach for modeling the EDW in the DW2.0 framework.” (DW2.0)

The number of Data Vault users in the US surpassed 500 in 2010 and grows rapidly (http://danlinstedt.com/about/dv-customers/)

#OUGF14

Page 29: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Organizations using Data Vault

WebMD Health Services Anthem Blue-Cross Blue Shield MD Anderson Cancer Center Denver Public Schools Independent Purchasing Cooperative (IPC, Miami)

• Owner of Subway

Kaplan US Defense Department Colorado Springs Utilities State Court of Wyoming Federal Express US Dept. Of Agriculture

#OUGF14

Page 30: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

What’s New in DV2.0?

Modeling Structure Includes…● NoSQL, and Non-Relational DB systems, Hybrid Systems● Minor Structure Changes to support NoSQL

New ETL Implementation Standards● For true real-time support● For NoSQL support

New Architecture Standards● To include support for NoSQL data management systems

New Methodology Components● Including CMMI, Six Sigma, and TQM● Including Project Planning, Tracking, and Oversight● Agile Delivery Mechanisms● Standards, and templates for Projects

© LearnDataVault.com

#OUGF14

Page 31: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

This model is fully compliant with Hadoop, needs NO changes to work properly

RISK: Key Collision

What’s New in DV2.0?

© LearnDataVault.com

#OUGF14

Page 32: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Summary

• Data Vault provides a data modeling technique that allows:

‣ Model Agility‣ Enabling rapid changes and additions

‣ Productivity‣ Enabling low complexity systems with high

value output at a rapid pace‣ Easy projections of dimensional models

‣ So? Agile Data Warehousing?

#OUGF14

Page 33: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Super Charge Your Data Warehouse

Available on Amazon.comSoft Cover or Kindle Format

Now also available in PDF at LearnDataVault.com

Hint: Kent is the Technical Editor

#OUGF14

Page 34: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Data Vault References

www.learndatavault.comwww.danlinstedt.com

On YouTube: www.youtube.com/LearnDataVault

On Facebook: www.facebook.com/learndatavault

Page 35: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling
Page 36: Agile Data Warehouse Modeling: Introduction to Data Vault Data Modeling

Contact Information

Kent GrazianoThe Oracle Data Warrior

Data Warrior [email protected]

Visit my blog at

http://kentgraziano.com