Conceptual Modeling for ETL processes Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos...
-
Upload
caren-lucas -
Category
Documents
-
view
225 -
download
2
Transcript of Conceptual Modeling for ETL processes Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos...
Conceptual Modeling
for ETL processes
Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos{pvassil,asimi,spiros}@dblab.ece.ntua.gr
National Technical University of AthensKDBS Laboratory
http://www.dbnet.ece.ntua.gr
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 2
General Idea The problem:
The conceptual part of the definition of ETL process in the early stages of a DW project
The key idea: The mapping of the attributes of the data
sources to the attributes of the DW tables
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 3
Outline Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the usage of the
conceptual model Conclusions and Future Work
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 4
Extract-Transform-Load (ETL)
Sources DSA DW
Extract Transform& Clean
Load
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 5
Motivation Practical necessity
e.g., 80% of the development time in a DW project In-house development, ad-hoc solutions
Lack of related work The front end of the DW has monopolized the research
on the conceptual part of DW modeling
Thus, the design, development and deployment of ETL processes, needs modeling, design and methodological foundations
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 6
Motivation Early stages of the DW design :
Concepts are still fuzzy and changing frequently
Lots of interviews with people No time for a full, clean-cut definition of the
DW and the ETL workflow
Still, we can: Trace the mapping of the attributes of the
data sources to the attributes of the DW tables
Trace necessary constraints and transformations for the ETL process
S1.A PK DW.A
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 7
Outline Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the usage of the
conceptual model Conclusions and Future Work
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 8
Conceptual Model Entities of our model:
Concepts Attributes Part-of Relationships Transformations Serial Composition of Transformations Provider Relationships Notes ETL Constraints Candidate Relationships
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 9
Conceptual Model Concepts
a name, finite set of attributes represent an entity in the source
database or in the DW Attributes
same role as in ER/dimensional models
a granular module of information
attribute
concept
We do not employ standard UML notation for concepts and attributes, for the reason that we need to treat attributes as first class citizens of our model
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 10
Conceptual Model Part-of Relationships
finite set of attributes emphasize the fact that
a concept is composed of a set of attributes
part of
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 11
Conceptual Model Example
Source 1 S1.PARTSUPP {PKEY, SUPPKEY, QTY, COST}
Data Warehouse DW.PARTSUPP {PKEY, SUPPKEY, DATE, QTY, COST}
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 12
Conceptual ModelS1.PARTSUPP DW.PARTSUPP
Cost
Qty
PKey
SuppKey
Cost
Date
Qty
PKey
SuppKey
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 13
Conceptual Model Transformations
finite set of input/output attributes, a symbol
abstractions that represent parts, or full modules of code, executing a single task
transformation
two categories: filtering or data cleaning operations
(e.g., foreign key violations) transformation operations
(e.g., aggregation)
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 14
Conceptual Model Provider Relationships
finite set of input/output attributes, an appropriate transformation
map a set of input attributes to a set of output attributes through a relevant transformation*
providerN:M
provider1:1
* If the attributes are semantically and physically compatible, no transformation is required
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 15
Conceptual ModelS1.PARTSUPP DW.PARTSUPP
Cost
Qty
PKey
SuppKey
Cost
Date
Qty
PKey
SuppKey
f
SK
NN
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 16
Conceptual Model Notes
informal tags, exactly as in UML modeling
used for: simple comments explaining
design decisions explanation of the semantics
of the applied transformation tracing of runtime constraints
Note
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 17
Conceptual ModelS1.PARTSUPP DW.PARTSUPP
Cost
Qty
PKey
SuppKey
Cost
Date
Qty
PKey
SuppKey
Date = SysDate()
f
SK
NN
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 18
Conceptual Model ETL Constraints
finite set of attributes, a single transformation
express the fact that the data of a certain concept fulfill several requirements
ETL_constraint
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 19
Conceptual ModelS1.PARTSUPP DW.PARTSUPP
Cost
Qty
PKey
SuppKey
Cost
Date
Qty
PKey
SuppKey
Date = SysDate()
f
SK
PK
NN
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 20
Conceptual Model Candidate Relationships
a single candidate concept, a single target concept used when a certain DW concept is populated by a
finite set of more than one candidate source concepts
Active Candidate Relationship a certain candidate that has been selected for the
population of the target concept a specialization of candidate relationships
target
active canditate
{XOR}
candidate1
candidaten
...
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 21
Conceptual Model
AnnualPartSupp’s
RecentPartSupp’s
{XOR}
S1.PartSupp
S2.PartSupp
DW.PartSupp
Necessary providers:S1 and S2
Due to acccuracyand small size
(< update window)
U
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 22
Conceptual Model
S1.PARTSUPPS2.PARTSUPP DW.PARTSUPP
American toEuropean Date
$2€ Date = SysDate()
SK
γ
f
SUM(S2.Cost)
SUM(S2.Qty)
S2.Date
S2.PKey
S2.SuppKey
f
NN
f
SK
PK
Cost
Qty
Date
Department
PKey
SuppKey
Cost
Qty
PKey
SuppKey
Cost
Date
Qty
PKey
SuppKey
AnnualPartSupp’s
RecentPartSupp’s
{XOR}
Due to acccuracyand small size
(< update window)
Necessary providers:S1 and S2
{Duration<4h}
U
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 23
Outline Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the usage of the
conceptual model Conclusions and Future Work
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 24
Instantiation & Specialization Layers The key issues:
generecity identification of a small set of generic constructs to
capture all cases usability
construction of a ‘palette’ of frequently used types
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 25
Instantiation & Specialization Layers Metamodel layer
a set of generic entities, able to represent any ETL scenario
involves classes: Concept, Attribute, Transformation, ETL Constraint and Relationship
Template layer a set of ‘built-in’ specializations of the entities of the
Metamodel layer, specifically tailored for the most frequent elements of ETL scenarios
Schema layer a specific ETL scenario all the entities of the Schema layer are instances of the
classes of the Metamodel layer
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 26
Instantiation & Specialization Layers
InstanceOf
IsA
Concept Transformation RelationshipAttribute
Fact Table
ER EntityER
Relationship
DimensionAmerican to
European Date$2€
Surrogate KeyAssignment
AggregationProvider
CandidatePart Of
SerialComposition
S2.PartSupp
MetamodelLayer
TemplateLayer
ETL_Constraint
DW.PartSupp
Candidate1
Candidate2
SchemaLayer
γ
SK
f
f
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 27
Instantiation & Specialization Layers Template layer
Four groups of logical transformations Filters Unary transformations Binary transformations Composite transformations
Two groups of physical transformations Transfer operations File operations
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 28
Instantiation & Specialization LayersFiltersSelection (σ)Not null (NN)Primary key violation (PK)Foreign key violation (FK)Unique value (UN)Domain mismatch DM)
Unary transformationsPushAggregation (γ)Projection (π)Function application (f)Surrogate key assignment(SK)Tuple normalization (N)Tuple denormalization (DN)
Binary transformationsUnion (U)Join ()Diff (Δ)Update Detection (ΔUPD)
Composite transformationsSlowly changing dimension (Type
1,2,3) (SDC-1/2/3)Format mismatch (FM)Data type conversion (DTC)Switch (σ*)Extended union (U)
File operationsEBCDIC to ASCII conversion (EB2AS)Sort file (Sort)
Transfer operationsFtp (FTP)Compress/Decompress (Z/dZ)Encrypt/Decrypt (Cr/dCr)
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 29
Outline Introduction Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the usage of the
conceptual model Conclusions and Future Work
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 30
Methodology Step 1
Identification of the proper data stores Step 2
Candidates and active candidates for the involved data stores
Step 3 Attribute mapping between the providers and
the consumers Step 4
Annotating the diagram with runtime constraints
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 31
Outline Introduction Motivation Conceptual Model Instantiation and Specialization Layers Methodology for the usage of the
conceptual model Conclusions and Future Work
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 32
Conclusions Our contributions lies in:
The proposal of a novel conceptual model which is customized for the tracing of inter-attribute relationships and the respective ETL activities
A customizable and extensible construction The introduction of a 'palette' of a set of
frequently used ETL activities
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 33
On-going/Future WorkThe Arktos II project is aimed towards the
Conceptual modeling Logical modeling Optimization What-if analysis
of ETL scenarios
http://www.dblab.ece.ntua.gr/ ~pvassil/projects/arktos_II
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 36
Logical Model [DMDW’02]
Add_SPK1
SUPPKEY=1
SK1
DS.PS1.PKEY, LOOKUP_PS.SKEY,
SUPPKEY
$2€
COST DATE
DS.PS2
Add_SPK2
SUPPKEY=2
SK2
DS.PS2.PKEY, LOOKUP_PS.SKEY,
SUPPKEYCOST DATE=SYSDATE
AddDate CheckQTY
QTY>0
UDS.PS1
Log
rejected
Log
rejected
A2EDate
NotNULL
Log
rejected
Log
rejected
Log
rejected
DIFF1
DS.PS_NEW1.PKEY,DS.PS_OLD1.PKEY
DS.PS_NEW1
DS.PS_OLD1
DW.PARTSUPPAggregate1
PKEY, DAYMIN(COST)
Aggregate2
PKEY, MONTHAVG(COST)
V2
V1
TIME
DW.PARTSUPP.DATE,DAY
FTP1S1_PARTSUPP
S2_PARTSUPPFTP2
DS.PS_NEW2
DIFF2
DS.PS_OLD2
DS.PS_NEW2.PKEY,DS.PS_OLD2.PKEY
Sources DW
DSA
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 37
Conceptual Model
concept
active canditate
provider1:1
part of
attribute
{XOR}
candidate1
candidaten
...
Note
providerN:M
target
ETL_constraint
transformation
serialcomposition
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 38
The lifecycle of a Data Warehouse and its ETL processes
Conceptual Model for DW, Sources & Activities
Logical Design Tuning – Full Activity Description
Software Construction
Administration of DW
Reverse Engineering of Sources & Requirements Collection
Software & SW Metrics
Physical Model for DW, Sources & Activities
Logical Model for DW, Sources & Activities
Metrics
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 39
Conceptual Model
+name
«metaclass»Attribute
+name
«metaclass»Concept
+schema
1
*
+name+symbol
«metaclass»Transformation
+input
1
*+output1
*
«metaclass»Relationship
«metaclass»PartOf
«metaclass»Provider
+input1
*
+output1
*
+transformation
1
*
«metaclass»Candidate
-candidate
11
-target1
1«metaclass»
Active Candidate
«metaclass»ETL_Constraint
+attributes
1
*
+transformation
1
1 «metaclass»Serial Composition
+initiating
1
1
+consequent
1
1
+content
Tag
Vassiliadis,Simitsis,Skiadopoulos - DOLAP'02 40
Conceptual Model General Notes
It is not a process/workflow model It is orthogonal to the conceptual models which
are available for the modeling of DW star schemata
It is specifically tailored for the back end of the DW
Any of the proposals for the DW front end can be combined with our approach