PDI data vault framework #pcmams 2012

Post on 11-Sep-2014

1.211 views 4 download

Tags:

description

Presentation given by Edwin Weber at #pcmams 2012

Transcript of PDI data vault framework #pcmams 2012

Introductionn

eacweber@gmail.com

Data Vault Definition

Source: Dan Linstedthttp://www.tdan.com/view-articles/5054/

The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of enterprise data warehouses.

Data Vault Building Blocks

Source: Dan Linstedthttp://www.slideshare.net/dlinstedt/introduction-to-data-vault-dama-oregon-2012

different sources/rate of change

Data Vault Fundamentals: Hub

Source: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren

Data Vault Fundamentals: Link

Source: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren

Data Vault Fundamentals: Satellite

Source: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren

Data Vault Fundamentals: Model

Source: data-vault-modeling-guideGENESEE ACADEMY, LLC, Hans Hultgren

Data Vault ETL

Many objects to load, standardized procedures

This screams for a generic solution!

I don't want to:

throw ETL tool away and code it all myself

manage too many ETL objects

connect similar columns in mappings by hand

I do want to:

generate ETL (Kettle) objects? No

Take it one step further: there's only 1 parameterised hub load object. Don't need to know xml structure of PDI objects

Tools

Version Control

Database

Virtualization

Data Integration

Operating System

'Productivity'

Sql Development

Place of framework in architecture

StagingArea

CSVFiles

ETL

ERP

DBMS

Sources ETL Process Data Warehouse EUL

MySQL

Files

ETL:KettleDataVault Framework

Central DWH & Data Marts

MySQLDataVault

ETL

What has to be taken care of?

Data Vault designed and implemented in database

Staging tables and loading procedures in place(can also be generic, we use PDI Metadata Injection step for loading files)

Mapping from source to Data Vault specified (now in an Excel sheet)

What

Framework components

PDI repository (file based), jobs and transformations

Configuration files:kettle.properties

shared.xml

repositories.xml

Excel sheet that contains the specifications

MySQL database for metadata

Virtual machine with Ubuntu 12.04 Server

Design decisions

Updateable views with generic column names

(MySQL more lenient than PostgreSQL)

Compare satellite attributes via string comparison (concatenate all columns, with | (pipe) as delimiter)

'inject' the metadata using Kettle parameters

Generate and use an error table for each Data Vault table

Metadata tables

All have history tables

Metadata in Excel

Data Vault

connections

source systems

source tables

Metadata in Excel (hub + sat)

x 200 (max)

Metadata in Excel (link)

link attributes

x 10

Metadata in Excel (link satellite)

x 10

x 5

x 200 (max)

Last seen date

applicable for hubs and links

existing hubs and links: update 'last_seen_dts'!

Link validity satellite

Link has 'business key': not all hub id's

Loading the metadata

'design errors'

Checks to avoid debugging:(compares design metadata with Data Vault DB information_schema)

hubs, links, satellites that don't exist in the DV

key columns that do not exist in the DV

missing connection data (source db)

missing attribute columns

A complete run

Metadata needed for a hub

name

key column

business key column

source table

source table business key column(can be expression, e.g. concatenate for composite key)

Job for hub

Transformation for hub

Metadata needed for a linkname

key column

for each hub (maximum 10, can be a ref-table)

hub name

column name for the hub key in the link (roles!)

column in the source table → business key of hub

link 'attributes' (part of key, no hub, maximum 5)

link validity satellite needed?

last seen date needed?

source table

Job for link

Transformation for link

Run table needed for validity sat ?

Lookup hubs

Remove columns not in link

Last seen?

Metadata needed for a hub satellite

name

key column

hub name

column in the source table → business key of hub

for each attribute (maximum 200)

source column target column

source table

Job for hub satellite

Transformation for hub satellite

Metadata needed for a link satellite

name

key column

link name

for each hub of the link:

column in the source table → business key of hub

for each key attribute: source column

for each attribute: source column → target column

source table

Job for link satellite

Transformation for link satellite

Executing in a loop ..

.. and parallel

Logging

Configuring log tablesfor concurrent access

PDI logging

Custom logging

Version Control: PDI objects

Version Control: database objects

Some points of interest

Easy to make mistake in design sheet

Generic → a bit harder to maintain and debug

Application/tool to maintain metadata?

Data Vault generators (e.g. Quipu)?

Spinoff using Informatica and Oracle: Sander Robijns

Thanks to: Jos van Dongen Kasper de Graaf

Sourceforge!