Data Vault 2.0: Using MD5 Hashes for Change Data Capture
-
Upload
kent-graziano -
Category
Data & Analytics
-
view
728 -
download
6
description
Transcript of Data Vault 2.0: Using MD5 Hashes for Change Data Capture
Data Vault 2.0: Using MD5 Hashes for
Change Data Capture
Kent Graziano
Data Warrior LLC
Twitter @KentGraziano
Data Vault Definition
The Data Vault is a detail oriented, historical tracking
and uniquely linked set of normalized tables that
support one or more functional areas of business.
It is a hybrid approach encompassing the best of
breed between 3rd normal form (3NF) and star
schema. The design is flexible, scalable, consistent
and adaptable to the needs of the enterprise.
Dan Linstedt: Defining the Data Vault TDAN.com Article
Architected specifically to meet the needs
of today’s enterprise data warehouses
Data Vault Time Line
2000 1960 1970 1980 1990
E.F. Codd invented
relational modeling
Chris Date and
Hugh Darwen
Maintained and
Refined
Modeling
1976 Dr Peter Chen
Created E-R
Diagramming
Early 70’s Bill
Inmon Began
Discussing Data
Warehousing
Mid 60’s Dimension & Fact
Modeling presented by
General Mills and Dartmouth
University
Mid 70’s AC Nielsen
Popularized
Dimension & Fact Terms
Mid – Late 80’s Dr Kimball
Popularizes Star Schema
Mid 80’s Bill Inmon
Popularizes Data
Warehousing
Late 80’s – Barry
Devlin and Dr Kimball
Release “Business
Data Warehouse”
1990 – Dan Linstedt
Begins R&D on Data
Vault Modeling
2000 – Dan Linstedt
releases first 5
articles on Data Vault
Modeling
© LearnDataVault.com
2014 - Next Evolution
What’s New in DV2.0?
Modeling Structure Includes… ● NoSQL, and Non-Relational DB systems, Hybrid Systems
● Minor Structure Changes to support NoSQL
New ETL Implementation Standards ● For true real-time support
● For NoSQL support
New Architecture Standards ● To include support for NoSQL data management systems
New Methodology Components ● Including CMMI, Six Sigma, and TQM
● Including Project Planning, Tracking, and Oversight
● Agile Delivery Mechanisms
● Standards, and templates for Projects
© LearnDataVault.com
This model is fully
compliant with Hadoop,
needs NO changes to
work properly.
The Hash Keys can be
used to join to Hadoop
data sets.
MD5 PK – replaces
surrogate keys
MD5DIFF – used for
change detection
Use of MD5 Hash in DV2.0
© LearnDataVault.com
MD5-based Change Detection
Think Type 2 SCD
Old Way:
● Compare column by column
● Source value != Current value in DW table
● 20 columns, then 20 compares
New Way:
● Concatenate all columns to one string
● Convert to one char(32) string with hash function
● Compare to hashed value (MD5DIFF) in target table
● Does not matter how many columns
© Data Warrior LLC
What does it look like?
Encode using standard MD5 hash function
● rawtohex(sys.utl_raw.cast_to_raw(
dbms_obfuscation_toolkit.md5 (input_string => ...)
Need to minimize chance of duplicates
● 12||3||45 and 1||2||345 hash to same value
● Need a separator between each
● Also handles case of null values
● Example: Col1||’^’||Col2||’^’||Col3
© Data Warrior LLC
Other considerations
To generate most consistent string: standardize!
Convert data types
If 'NUMBER', 'NVARCHAR2', 'NVARCHAR',
'NCHAR‘ ● THEN 'TO_CHAR(' || column_name || ')‘
If 'RAW‘ ● THEN 'ENC_BASE64(' || column_name || ')‘
If 'DATE‘ ● THEN 'TO_CHAR(' || column_name || ', ''YYYY-MM-DD'')‘
If LIKE 'TIME%‘ ● THEN 'TO_CHAR(' || column_name || ', ''YYYY-MM-DD
HH24:MI:SS'')' © Data Warrior LLC
Final Input String
(UPPER(TRIM(T1.GENERICNAME))
||'^'||
UPPER(TRIM(
TO_CHAR(T1.MED_STRNG_AMT)))
||'^'||
UPPER(TRIM(T1.UOM_CD))
||'^'||
UPPER(TRIM(T1.MED_FORM_NM))
||'^')
© Data Warrior LLC
So what?
MD5 hash is consistent cross-platform
Changes multi-column compares to a single column
All compares take the same time during load process
Can use with any DW architecture that requires change detections
Virtually no limit ● Think Big Data/Hadoop/NoSQL
Can generate the input string automatically ● But that is another talk!
© Data Warrior LLC
Learn more about Data Vault
www.LearnDataVault.com
www.danlinstedt.com
On YouTube:
www.youtube.com/LearnDataVault
On Facebook:
www.facebook.com/learndatavault
Super Charge Your Data Warehouse
Available on Amazon.com
Soft Cover or Kindle Format
Now also available in PDF at
LearnDataVault.com
Contact Information
Kent Graziano
The Oracle Data Warrior
Data Warrior LLC
On Twitter @KentGraziano
Visit my blog at
http://kentgraziano.com