Conceptual Modeling for ETL processes Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos...

41
Conceptual Modeling for ETL processes Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos {pvassil,asimi,spiros}@dblab.ece.ntua.gr National Technical University of Athens KDBS Laboratory http://www.dbnet.ece.ntua.gr

Transcript of Conceptual Modeling for ETL processes Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos...

Conceptual Modeling

for ETL processes

Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos{pvassil,asimi,spiros}@dblab.ece.ntua.gr

National Technical University of AthensKDBS Laboratory

http://www.dbnet.ece.ntua.gr

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 2

General Idea The problem:

The conceptual part of the definition of ETL process in the early stages of a DW project

The key idea: The mapping of the attributes of the data

sources to the attributes of the DW tables

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 3

Outline Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the usage of the

conceptual model Conclusions and Future Work

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 4

Extract-Transform-Load (ETL)

Sources DSA DW

Extract Transform& Clean

Load

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 5

Motivation Practical necessity

e.g., 80% of the development time in a DW project In-house development, ad-hoc solutions

Lack of related work The front end of the DW has monopolized the research

on the conceptual part of DW modeling

Thus, the design, development and deployment of ETL processes, needs modeling, design and methodological foundations

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 6

Motivation Early stages of the DW design :

Concepts are still fuzzy and changing frequently

Lots of interviews with people No time for a full, clean-cut definition of the

DW and the ETL workflow

Still, we can: Trace the mapping of the attributes of the

data sources to the attributes of the DW tables

Trace necessary constraints and transformations for the ETL process

S1.A PK DW.A

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 7

Outline Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the usage of the

conceptual model Conclusions and Future Work

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 8

Conceptual Model Entities of our model:

Concepts Attributes Part-of Relationships Transformations Serial Composition of Transformations Provider Relationships Notes ETL Constraints Candidate Relationships

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 9

Conceptual Model Concepts

a name, finite set of attributes represent an entity in the source

database or in the DW Attributes

same role as in ER/dimensional models

a granular module of information

attribute

concept

We do not employ standard UML notation for concepts and attributes, for the reason that we need to treat attributes as first class citizens of our model

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 10

Conceptual Model Part-of Relationships

finite set of attributes emphasize the fact that

a concept is composed of a set of attributes

part of

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 11

Conceptual Model Example

Source 1 S1.PARTSUPP {PKEY, SUPPKEY, QTY, COST}

Data Warehouse DW.PARTSUPP {PKEY, SUPPKEY, DATE, QTY, COST}

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 12

Conceptual ModelS1.PARTSUPP DW.PARTSUPP

Cost

Qty

PKey

SuppKey

Cost

Date

Qty

PKey

SuppKey

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 13

Conceptual Model Transformations

finite set of input/output attributes, a symbol

abstractions that represent parts, or full modules of code, executing a single task

transformation

two categories: filtering or data cleaning operations

(e.g., foreign key violations) transformation operations

(e.g., aggregation)

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 14

Conceptual Model Provider Relationships

finite set of input/output attributes, an appropriate transformation

map a set of input attributes to a set of output attributes through a relevant transformation*

providerN:M

provider1:1

* If the attributes are semantically and physically compatible, no transformation is required

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 15

Conceptual ModelS1.PARTSUPP DW.PARTSUPP

Cost

Qty

PKey

SuppKey

Cost

Date

Qty

PKey

SuppKey

f

SK

NN

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 16

Conceptual Model Notes

informal tags, exactly as in UML modeling

used for: simple comments explaining

design decisions explanation of the semantics

of the applied transformation tracing of runtime constraints

Note

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 17

Conceptual ModelS1.PARTSUPP DW.PARTSUPP

Cost

Qty

PKey

SuppKey

Cost

Date

Qty

PKey

SuppKey

Date = SysDate()

f

SK

NN

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 18

Conceptual Model ETL Constraints

finite set of attributes, a single transformation

express the fact that the data of a certain concept fulfill several requirements

ETL_constraint

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 19

Conceptual ModelS1.PARTSUPP DW.PARTSUPP

Cost

Qty

PKey

SuppKey

Cost

Date

Qty

PKey

SuppKey

Date = SysDate()

f

SK

PK

NN

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 20

Conceptual Model Candidate Relationships

a single candidate concept, a single target concept used when a certain DW concept is populated by a

finite set of more than one candidate source concepts

Active Candidate Relationship a certain candidate that has been selected for the

population of the target concept a specialization of candidate relationships

target

active canditate

{XOR}

candidate1

candidaten

...

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 21

Conceptual Model

AnnualPartSupp’s

RecentPartSupp’s

{XOR}

S1.PartSupp

S2.PartSupp

DW.PartSupp

Necessary providers:S1 and S2

Due to acccuracyand small size

(< update window)

U

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 22

Conceptual Model

S1.PARTSUPPS2.PARTSUPP DW.PARTSUPP

American toEuropean Date

$2€ Date = SysDate()

SK

γ

f

SUM(S2.Cost)

SUM(S2.Qty)

S2.Date

S2.PKey

S2.SuppKey

f

NN

f

SK

PK

Cost

Qty

Date

Department

PKey

SuppKey

Cost

Qty

PKey

SuppKey

Cost

Date

Qty

PKey

SuppKey

AnnualPartSupp’s

RecentPartSupp’s

{XOR}

Due to acccuracyand small size

(< update window)

Necessary providers:S1 and S2

{Duration<4h}

U

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 23

Outline Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the usage of the

conceptual model Conclusions and Future Work

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 24

Instantiation & Specialization Layers The key issues:

generecity identification of a small set of generic constructs to

capture all cases usability

construction of a ‘palette’ of frequently used types

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 25

Instantiation & Specialization Layers Metamodel layer

a set of generic entities, able to represent any ETL scenario

involves classes: Concept, Attribute, Transformation, ETL Constraint and Relationship

Template layer a set of ‘built-in’ specializations of the entities of the

Metamodel layer, specifically tailored for the most frequent elements of ETL scenarios

Schema layer a specific ETL scenario all the entities of the Schema layer are instances of the

classes of the Metamodel layer

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 26

Instantiation & Specialization Layers

InstanceOf

IsA

Concept Transformation RelationshipAttribute

Fact Table

ER EntityER

Relationship

DimensionAmerican to

European Date$2€

Surrogate KeyAssignment

AggregationProvider

CandidatePart Of

SerialComposition

S2.PartSupp

MetamodelLayer

TemplateLayer

ETL_Constraint

DW.PartSupp

Candidate1

Candidate2

SchemaLayer

γ

SK

f

f

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 27

Instantiation & Specialization Layers Template layer

Four groups of logical transformations Filters Unary transformations Binary transformations Composite transformations

Two groups of physical transformations Transfer operations File operations

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 28

Instantiation & Specialization LayersFiltersSelection (σ)Not null (NN)Primary key violation (PK)Foreign key violation (FK)Unique value (UN)Domain mismatch DM)

Unary transformationsPushAggregation (γ)Projection (π)Function application (f)Surrogate key assignment(SK)Tuple normalization (N)Tuple denormalization (DN)

Binary transformationsUnion (U)Join ()Diff (Δ)Update Detection (ΔUPD)

Composite transformationsSlowly changing dimension (Type

1,2,3) (SDC-1/2/3)Format mismatch (FM)Data type conversion (DTC)Switch (σ*)Extended union (U)

File operationsEBCDIC to ASCII conversion (EB2AS)Sort file (Sort)

Transfer operationsFtp (FTP)Compress/Decompress (Z/dZ)Encrypt/Decrypt (Cr/dCr)

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 29

Outline Introduction Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the usage of the

conceptual model Conclusions and Future Work

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 30

Methodology Step 1

Identification of the proper data stores Step 2

Candidates and active candidates for the involved data stores

Step 3 Attribute mapping between the providers and

the consumers Step 4

Annotating the diagram with runtime constraints

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 31

Outline Introduction Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the usage of the

conceptual model Conclusions and Future Work

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 32

Conclusions Our contributions lies in:

The proposal of a novel conceptual model which is customized for the tracing of inter-attribute relationships and the respective ETL activities

A customizable and extensible construction The introduction of a 'palette' of a set of

frequently used ETL activities

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 33

On-going/Future WorkThe Arktos II project is aimed towards the

Conceptual modeling Logical modeling Optimization What-if analysis

of ETL scenarios

http://www.dblab.ece.ntua.gr/ ~pvassil/projects/arktos_II

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 34

Thank you

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 35

Back-up slides

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 36

Logical Model [DMDW’02]

Add_SPK1

SUPPKEY=1

SK1

DS.PS1.PKEY, LOOKUP_PS.SKEY,

SUPPKEY

$2€

COST DATE

DS.PS2

Add_SPK2

SUPPKEY=2

SK2

DS.PS2.PKEY, LOOKUP_PS.SKEY,

SUPPKEYCOST DATE=SYSDATE

AddDate CheckQTY

QTY>0

UDS.PS1

Log

rejected

Log

rejected

A2EDate

NotNULL

Log

rejected

Log

rejected

Log

rejected

DIFF1

DS.PS_NEW1.PKEY,DS.PS_OLD1.PKEY

DS.PS_NEW1

DS.PS_OLD1

DW.PARTSUPPAggregate1

PKEY, DAYMIN(COST)

Aggregate2

PKEY, MONTHAVG(COST)

V2

V1

TIME

DW.PARTSUPP.DATE,DAY

FTP1S1_PARTSUPP

S2_PARTSUPPFTP2

DS.PS_NEW2

DIFF2

DS.PS_OLD2

DS.PS_NEW2.PKEY,DS.PS_OLD2.PKEY

Sources DW

DSA

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 37

Conceptual Model

concept

active canditate

provider1:1

part of

attribute

{XOR}

candidate1

candidaten

...

Note

providerN:M

target

ETL_constraint

transformation

serialcomposition

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 38

The lifecycle of a Data Warehouse and its ETL processes

Conceptual Model for DW, Sources & Activities

Logical Design Tuning – Full Activity Description

Software Construction

Administration of DW

Reverse Engineering of Sources & Requirements Collection

Software & SW Metrics

Physical Model for DW, Sources & Activities

Logical Model for DW, Sources & Activities

Metrics

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 39

Conceptual Model

+name

«metaclass»Attribute

+name

«metaclass»Concept

+schema

1

*

+name+symbol

«metaclass»Transformation

+input

1

*+output1

*

«metaclass»Relationship

«metaclass»PartOf

«metaclass»Provider

+input1

*

+output1

*

+transformation

1

*

«metaclass»Candidate

-candidate

11

-target1

1«metaclass»

Active Candidate

«metaclass»ETL_Constraint

+attributes

1

*

+transformation

1

1 «metaclass»Serial Composition

+initiating

1

1

+consequent

1

1

+content

Tag

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 40

Conceptual Model General Notes

It is not a process/workflow model It is orthogonal to the conceptual models which

are available for the modeling of DW star schemata

It is specifically tailored for the back end of the DW

Any of the proposals for the DW front end can be combined with our approach

Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 41

Conceptual Model Serial Composition of

Transformations a single initiating

transformation, a single subsequent transformation

combine several transformations in a single provider relationship

serialcomposition