Data Vault 2.0: Using MD5 Hashes for Change Data Capture

14
Data Vault 2.0: Using MD5 Hashes for Change Data Capture Kent Graziano Data Warrior LLC Twitter @KentGraziano

description

This presentation was given at OakTable World 2014 (#OTW14) in San Francisco as a short Ted-style 10 minute talk. In it I introduce Data Vault 2.0 and its innovative approach to doing change data capture in a data warehouse by using MD5 Hash columns.

Transcript of Data Vault 2.0: Using MD5 Hashes for Change Data Capture

Page 1: Data Vault 2.0: Using MD5 Hashes for Change Data Capture

Data Vault 2.0: Using MD5 Hashes for

Change Data Capture

Kent Graziano

Data Warrior LLC

Twitter @KentGraziano

Page 2: Data Vault 2.0: Using MD5 Hashes for Change Data Capture

Data Vault Definition

The Data Vault is a detail oriented, historical tracking

and uniquely linked set of normalized tables that

support one or more functional areas of business.

It is a hybrid approach encompassing the best of

breed between 3rd normal form (3NF) and star

schema. The design is flexible, scalable, consistent

and adaptable to the needs of the enterprise.

Dan Linstedt: Defining the Data Vault TDAN.com Article

Architected specifically to meet the needs

of today’s enterprise data warehouses

Page 3: Data Vault 2.0: Using MD5 Hashes for Change Data Capture

Data Vault Time Line

2000 1960 1970 1980 1990

E.F. Codd invented

relational modeling

Chris Date and

Hugh Darwen

Maintained and

Refined

Modeling

1976 Dr Peter Chen

Created E-R

Diagramming

Early 70’s Bill

Inmon Began

Discussing Data

Warehousing

Mid 60’s Dimension & Fact

Modeling presented by

General Mills and Dartmouth

University

Mid 70’s AC Nielsen

Popularized

Dimension & Fact Terms

Mid – Late 80’s Dr Kimball

Popularizes Star Schema

Mid 80’s Bill Inmon

Popularizes Data

Warehousing

Late 80’s – Barry

Devlin and Dr Kimball

Release “Business

Data Warehouse”

1990 – Dan Linstedt

Begins R&D on Data

Vault Modeling

2000 – Dan Linstedt

releases first 5

articles on Data Vault

Modeling

© LearnDataVault.com

Page 4: Data Vault 2.0: Using MD5 Hashes for Change Data Capture

2014 - Next Evolution

Page 5: Data Vault 2.0: Using MD5 Hashes for Change Data Capture

What’s New in DV2.0?

Modeling Structure Includes… ● NoSQL, and Non-Relational DB systems, Hybrid Systems

● Minor Structure Changes to support NoSQL

New ETL Implementation Standards ● For true real-time support

● For NoSQL support

New Architecture Standards ● To include support for NoSQL data management systems

New Methodology Components ● Including CMMI, Six Sigma, and TQM

● Including Project Planning, Tracking, and Oversight

● Agile Delivery Mechanisms

● Standards, and templates for Projects

© LearnDataVault.com

Page 6: Data Vault 2.0: Using MD5 Hashes for Change Data Capture

This model is fully

compliant with Hadoop,

needs NO changes to

work properly.

The Hash Keys can be

used to join to Hadoop

data sets.

MD5 PK – replaces

surrogate keys

MD5DIFF – used for

change detection

Use of MD5 Hash in DV2.0

© LearnDataVault.com

Page 7: Data Vault 2.0: Using MD5 Hashes for Change Data Capture

MD5-based Change Detection

Think Type 2 SCD

Old Way:

● Compare column by column

● Source value != Current value in DW table

● 20 columns, then 20 compares

New Way:

● Concatenate all columns to one string

● Convert to one char(32) string with hash function

● Compare to hashed value (MD5DIFF) in target table

● Does not matter how many columns

© Data Warrior LLC

Page 8: Data Vault 2.0: Using MD5 Hashes for Change Data Capture

What does it look like?

Encode using standard MD5 hash function

● rawtohex(sys.utl_raw.cast_to_raw(

dbms_obfuscation_toolkit.md5 (input_string => ...)

Need to minimize chance of duplicates

● 12||3||45 and 1||2||345 hash to same value

● Need a separator between each

● Also handles case of null values

● Example: Col1||’^’||Col2||’^’||Col3

© Data Warrior LLC

Page 9: Data Vault 2.0: Using MD5 Hashes for Change Data Capture

Other considerations

To generate most consistent string: standardize!

Convert data types

If 'NUMBER', 'NVARCHAR2', 'NVARCHAR',

'NCHAR‘ ● THEN 'TO_CHAR(' || column_name || ')‘

If 'RAW‘ ● THEN 'ENC_BASE64(' || column_name || ')‘

If 'DATE‘ ● THEN 'TO_CHAR(' || column_name || ', ''YYYY-MM-DD'')‘

If LIKE 'TIME%‘ ● THEN 'TO_CHAR(' || column_name || ', ''YYYY-MM-DD

HH24:MI:SS'')' © Data Warrior LLC

Page 10: Data Vault 2.0: Using MD5 Hashes for Change Data Capture

Final Input String

(UPPER(TRIM(T1.GENERICNAME))

||'^'||

UPPER(TRIM(

TO_CHAR(T1.MED_STRNG_AMT)))

||'^'||

UPPER(TRIM(T1.UOM_CD))

||'^'||

UPPER(TRIM(T1.MED_FORM_NM))

||'^')

© Data Warrior LLC

Page 11: Data Vault 2.0: Using MD5 Hashes for Change Data Capture

So what?

MD5 hash is consistent cross-platform

Changes multi-column compares to a single column

All compares take the same time during load process

Can use with any DW architecture that requires change detections

Virtually no limit ● Think Big Data/Hadoop/NoSQL

Can generate the input string automatically ● But that is another talk!

© Data Warrior LLC

Page 12: Data Vault 2.0: Using MD5 Hashes for Change Data Capture

Learn more about Data Vault

www.LearnDataVault.com

www.danlinstedt.com

On YouTube:

www.youtube.com/LearnDataVault

On Facebook:

www.facebook.com/learndatavault

Page 13: Data Vault 2.0: Using MD5 Hashes for Change Data Capture

Super Charge Your Data Warehouse

Available on Amazon.com

Soft Cover or Kindle Format

Now also available in PDF at

LearnDataVault.com

Page 14: Data Vault 2.0: Using MD5 Hashes for Change Data Capture

Contact Information

Kent Graziano

The Oracle Data Warrior

Data Warrior LLC

[email protected]

On Twitter @KentGraziano

Visit my blog at

http://kentgraziano.com