Data Warehouse/Data Mart Conceptual Modeling and … · Bernard ESPINASSE - Data Warehouse...
Transcript of Data Warehouse/Data Mart Conceptual Modeling and … · Bernard ESPINASSE - Data Warehouse...
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 1
Data Warehouse/Data Mart Conceptual Modeling and Design
(4)
Bernard ESPINASSE Professeur à Aix-Marseille Université (AMU)
Ecole Polytechnique Universitaire de Marseille
September 2013
Methodological Framework Conceptual Modelling: the Dimensionnal Fact Model (DFM) Conceptual Design: from Relational schema to DFM
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 2
1. Methodological Framework • Conceptual Design & Logical Design • Top-Down Versus Botton-Up Approach • Design Phases and schemata derivations
2. Conceptual Modelling: The Dimensionnal Fact Model (DFM) • Fact schema • Dimension hierarchies • Additive, semi-additive and non-additive attributes • Overlapping compatible fact schemata • Representing query patterns on a fact schema
3. Conceptual Design : From Relationnal schema to DFM of Data Mart • Building the attribute tree, pruning and grafting the attribute tree • Defining dimensions, measures and granularity of data
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 3
• Books • Golfarelli M., Rizzi S., “Data Warehouse Design : Modern Principles and
Methodologies”, McGrawHill, 2009. • Kimball R., Ross, M., “Entrepôts de données : guide pratique de
modélisation dimensionnelle”, 2°édition, Ed. Vuibert, 2003. • S. Rizzi. “Conceptual modeling solutions for the data warehouse”. In Data
Warehousing and Mining: Concepts, Methodologies, Tools, and Applications, J. Wang (Ed.), Information Science Reference, pp. 208-227, 2008.
• M. Golfarelli, D. Maio, S. Rizzi. “Conceptual Design of Data Warehouses from E/R Schemes”. Proceedings 31st Hawaii International Conference on System Sciences (HICSS-31), vol. VII, Kona, Hawaii, pp. 334-343, 1998.
• Courses • Course of M. Golfarelli M. and S. Rizzi, University of Bologna • Courses of M. Böhlen and J. Gamper J., Free University of Bolzano
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 4
• Conceptual Design & Logical Design • Life-Cycle • Top-Down, Botton-Up and Mixed Strategies • Design Phases • Schemata derivations for DMs design
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 5
• Entite-Relation models are not very useful in modeling DWs • Is now universally recognized that a DW is based on a
multidimensional view of data : ! But there is still no agreement on HOW to implement its
conceptual design ! • Most of the time, DW design is at the logical level : a
multidimensional model (star/snowflake schema) is directly designed : ! But a star schema is nothing but a relational schema: it
contains only the definition of a set of relations and integrity constraints !
• A better approach: ! 1) design first a conceptual model : Conceptual Design ! 2) which is then translated into a logical model : Logical
Design
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 6
• Building a DW is a very complex task, which requires an accurate planning aimed at devising satisfactory answers to organizational and architectural questions
• A large number of organizations lack experience and skills that are required to meet the challenges involved in DW projects
• Major cause of DW failures lies in the absence of a global view of the design process, of a design methodology
• Design Methodologies are necessary to minimizing the risks for failure
• Tree main strategies for DW design: ! Top-Down strategy ! Botton-Up strategy ! Mixed strategy
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 7
Top-Down Approach: 1. Design of DW 2. Design of DMs Bottom-Up Approach: 1. Design of DMs 2. Integration of DMs in DW 3. Maybe no physical DW Mixed Approach: 1. Design of DW for
DM1 2. Design of DM2 and
integration with DW 3. Design of DM3 and
integration with DW 4. ...
Appl.Appl.
DB1
Appl.Appl.
DB3 DB2
Appl.
DB4
Trans..
DW
DM1
Appl.
DM2
Appl.
DM3
Appl.
Existing databases and systems (OLTP)
Global Data Warehouse
Data Marts
Top-Down Approach Botto
n-Up A
ppro
ach Mixed Approach
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 8
Analyze global business needs, plan how to develop a DW, design it, and implement it as a whole with its DMs
(+) Stengths: ! Promising: it is based on a global picture of the goal to achieve,
and in principle it ensures consistent, well integrated DW (-) Weakness:
! High-cost estimates with long-term implementations discourage company managers from embarking on these kind of projects.
! Analyzing and integrating all relevant sources at the same time is a very difficult task: they are all available and stable at the same time.
! Extremely difficult to forecast the specific needs of every department involved in a project, which leads to specific DMs
! As no working DW system is going to be delivered in the short term, users cannot check for this project to be useful, so they lose trust and interest in it.
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 9
• DW is incrementally built and several DM are iteratively created • Each DM is based on a set of facts that are linked to a specific
department and that can be interesting for a user group (+) Stengths:
! Leads to concrete results in a short time ! Does not require huge investments ! Enables designers to investigate one area at a time
Gives managers a quick feedback about the actual benefits of the system being built
(-) Weakness: Keeps the interest for the project constantly high may determine a partial vision of the business domain.
=> Mixed strategy …
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 10
Top-Down and Bottom-Up strategies should be mixed : • When planning a DW, a bottom-up strategy should be followed • One Data Mart (DM) at a time is identified and prototyped
according to a top-down strategy by building a conceptual schema for each fact of interest
• The first DM (DM1) to prototype : ! is the one playing the most strategic role for the enterprise ! should be a backbone for the whole DW ! should lean on available and consistent data sources
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 11
Phase 1 : Goal setting and planning od the DW • set system goals, borders, and size • select an approach for design and implementation • estimate costs and benefits • analyze risks and expectations • examine the skills of the working team
Phase 2 : Infrastructure design of the DW • analyze and compare the possible architectural
solutions • assess the available technologies and tools • create a preliminary plan of the whole DW
system
Phase 3 : Design and development of DMs • Every iteration causes a new DM and new
applications to be created and progressively added to the DW system
Phase 1: Goal setting
and planning
Phase 2: Infrastructure Design
Phase 3: Design and developpement of
Data Marts
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 12
Each Data Mart (DM) will be designed according these steps:
J. Gamper, Free University of Bolzano, DWDM 2012/13 40
!"#"$%"&#$'()*+,$-.")()
!"#$%&'()(*+,-,'
().'-)/&0$(/-")
1&2#-$&3&)/
()(*+,-,
4")%&5/#(*
.&,-0)
6"$7*"(.
().'.(/('8"*#3&
9"0-%(*
.&,-0)
:;9
.&,-0)
business user
designer
db administrator
<=+,-%(*
.&,-0)
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 13
51
Methodological frameworkMethodological framework
analysis of theoperational db
requirementspecification
conceptualdesign
workloadrefinement
logicaldesign
physicaldesign
final user
designer
db administrator
DWs are based on a pre-existing information system
52
Methodological framework Methodological framework (2)(2)
LogicalLogicalSchemeScheme
LOGICALDESIGN
WorkloadTargetlogicalmodel
PhysicalPhysicalSchemeScheme
PHYSICALDESIGN
Workload TargetDBMS
E/R E/R SchemeScheme
chiave negozio negozio città regione indirizzo resp. vendite
N1 …. …. …. …… ………
N2
chiave tempo chiave negozio chiave_prodotto quant venduta incasso num_clienti
T1 N1 P1 10 10000002
T1 N1 P2 8 12000008
T1 N2 P5 15 15000005
… ….. …… …….
RelationalRelationalSchemeScheme
ConceptualConceptualSchemeScheme
CONCEPTUALDESIGN
Facts
Preliminaryworkload
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 14
• Fact schema • Dimension hierarchies • Fact schema and fact instances • Additive attributes • Semi-additive and non-additive attributes • Overlapping compatible fact schemata • Representing query patterns on a fact schema
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 15
• Conceptual Design is based on the documentation of the underlying operational information system (IS):
! Relational schemata or ! E/R schemata
• Steps:
1. Find facts 2. For each fact:
a) Navigate functional dependencies b) Drop useless attributes c) Define dimensions and measures
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 16
The Dimensional Fact Model (DFM) has be proposed by Golfarelli M., Rizzi S. to support a Conceptual Design of DW
The DFM is a graphical conceptual model for Data Mart design
The aim of the DFM is to :
1. Provide an efficient support to Conceptual Design 2. Create an environment in which user queries may be formulated
intuitively 3. Make communication possible between designers and end users
with the goal of formalizing requirement specifications 4. Build a stable platform for logical design (independently of the target
logical model) 5. Provide clear and expressive design documentation
The conceptual representation generated by the DFM consists of a set of fact schemata that basically model facts, measures, dimensions, and hierarchies.
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 17
Ex : a simple 3-dimensional fact schema « SALE » for a chain of stores :
• A fact schema is structured as a tree whose root is a fact • A Conceptual Model of a DW consists of a set of fact schemata
date store
product
quantityreceipsunitPricenumberOfCustomer
SALE
fact
dimensions
measures
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 18
A fact is a concept relevant to decision-making processes : • It models a set of events (ex: in a compagny: sales, shipments, purchases, ...) • It has dynamic properties or evolve in some way over time • It has one or more numeric and continuously valued attributes which
"measure" the fact from different points of view
• a measure is a numerical property of a fact and describes a quantitative fact aspect that is relevant to analysis : Ex : every sale is quantified by its quantity, receips, unitPrice, numberOfCustomer
• a dimension is a fact property with a
finite domain and describes an analysis coordinate of the fact : Ex : typical dimensions for the sales fact are product, store, and date
date store
product
quantityreceipsunitPricenumberOfCustomer
SALE
fact
dimensions
measures
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 19
• Hierarchy determines how fact instances may be aggregated and selected significantly for the decision-making process and determines the granularity adopted for representing facts.
• Hierarchies are subtrees rooted in dimensions:
datemonth
salesDistrict
store storeCity state
product
quantityreceipsunitPricenumberOfCustomer
SALE
factdimensions
quarteryear
typebrand
brandCity
department
category
salesManager
marketingGroup
country
holidayday
week
hierarchies
sizenon dimension
attribute
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 20
In dimension hierarchies :
• nodes represented by circles are dimension attributes which may assume a discrete set of values.
Ex : week, month, product, …
• arcs represent relationships between pairs of attributes: these relationships are functional dependencies:
Ex: product -> type; type -> category; category -> department …
• dimension attributes in the nodes along each sub-path of the hierarchy starting from the dimension define progressively coarser granularities.
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 21
non-dimension attributes contains additional information about an attribute of the hierarchy: it cannot be used for aggregation ! Ex : size : aggregating sales according to the size of the product would not make sense!
datemonth
salesDistrictstore
storeCity state
product
quantityreceipsunitPricenumberOfCustomer
SALE
quarteryear
typebrand
brandCity
department
category
salesManager
marketingGroup
country
holidayday
week
size
non dimension attribute
addresstelephone
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 22
Optional arcs (marked by a dash) express optional relationships between pairs of attributes (useful for logical design) Ex : diet, promotion. The diet attribute takes a value (such as cholesterol-free, gluten-free, or sugar-free) only for food products; for the other products, it is undefined.
datemonth
salesDistrict
store storeCity state
product
quantityreceipsunitPricenumberOfCustomer
SALE
quarteryear
typebrand
brandCity
department
category
salesManager
marketingGroup
country
holidayday
week
size
diet
optional arc
promotion
discount
advertising
startDateendDate
cost
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 23
Cross-dimensional attribute is a dimensionnal or descriptive attribute whose value is defined by the combination of 2 or more dimensional attributes, possibly belonging to different hierarchies. Ex : if a product Value Added Tax (VAT) depends both on the product category and on the country where the product is sold, you can use a cross-dimensional attribute to represent it:
datemonth
salesDistrict
store storeCity state
product
quantityreceipsunitPricenumberOfCustomer
SALE
quarteryear
typebrand
brandCity
department
category
salesManager
marketingGroup
country
holidayday
week
size
diet
cross-dimensionnal attributes
VAT
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 24
A convergence takes place when 2 dimensional attributes within a hierarchy are connected by 2 or more alternative paths of many-to-one associations (Graphically, use of arrows). Ex : in store dimension, store are grouped into sales districts and no inclusive relationship exists between districts and states, but each district is part of only one country:
Store -> salesDistrict -> country or
Store -> storeCity -> state -> country
datemonth
salesDistrict
store storeCity state
quantityreceipsunitPricenumberOfCustomer
SALE
quarteryear
salesManager
country
holidayday
week
convergence
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 25
Shared hierarchies exist when entire portion of hierarchies are frequently replicated 2 or more time in fact schemata In particular in time hierarchies, 2 or more date-type dimensions with different meaning can easily exist in a same fact, and need to build a month-year hierarchy on each one of them
=> an abreviation is introduced Ex: calling and called phone numbers …
hour
numberduration
CALL
date month year
callingNumber
calledNumbercalledNumberType
callingNumberType
callingNumberDistrict
calledNumberDistrict
roles
hour
numberduration
CALL
shared hierarchy
date month year
O
calling
calledtelNumber
type
district
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 26
Multiple arc models a many-to-many association between the 2 dimensional attributes it connects (Graphically, denoted by doubling of the arc) Ex : in a fact schema modeling the sales of books, whose dimensions are date and book. It would certainly be interesting to aggregate and select sales on the basis of book authors. However, it would not be accurate to model author as a dimensional child attribute of book because many different authors can write many books. Then, the relationship between books and authors is modeled as a multiple arc:
datemonth book author
quantityreceipsunitPricenumberOfCustomer
SALE
quarteryear
genreholiday
day
weekmultiple arc
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 27
3 Types of measure : ! Flow measure: refer to a timeframe (ex: number of products sold in a day) ! Level measure: evaluated at particular time (ex: number of products in
inventory) ! Unit measure: evaluated at particular time but are expressed in relative terms
(ex: product unit price, discount percentage) ! Suitable operators for aggregation:
Temporal hierarchies Nontemporal hierarchies Flow measures SUM, AVG, MIN, MAX SUM, AVG, MIN, MAX Level measures AVG, MIN, MAX SUM, AVG, MIN, MAX Unit measures AVG, MIN, MAX AVG, MIN, MAX
3 Natures of measure : ! additive along a dimension when can be used the SUM aggregation operator ! non-additive along a dimension if the aggregation operator is not SUM (ex:
inventory level) ! a non-additive measure is non-aggregable is no operator exists (ex: unitPrice
product)
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 28
• In the DFM, along all the dimensions, by default measures are additive (operator SUM)
• Non-additive measure can be explicitely specify with its operator(s) used for aggregation – other that SUM (Ex: AVG and MIN for inventory level)
datemonth warehouse city
levelincomingQuantity
INVENTORY
quarteryear
address
week
non additive measure
country
AVG, MINproduct
brandtype
category
department
ItemPerPalletpackaging
weight
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 29
• Different facts are represented in different fact schemata
• Queries the user formulates on the DW may require comparing fact attributes taken from distinct, though related, schemata (drill across in OLAP)
• 2 fact schemata are said compatible if they share at least one dimension attribute
• 2 compatible schemata F and G may be overlapped to create a resulting schema H
• Without conflict between attribute dependencies in the 2 schemata:
• the set of the fact attributes in H is the union of the sets in F and G
• the dimensions in H are the intersection of those in F and G, assuming that a given dimension is common to F and G if at least one dimension attribute is shared
• each hierarchy in H includes all and only the dimension attributes included in the corresponding hierarchies of both F and G.
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 30
Consider the 2 fact schemata : • F represents all employees of an enterprise • G only the non-European employees.
F and G are compatible, they share the time, job and store dimensions
attribute dependencies in the two schemes are notconflicting:
• the set of the fact attributes in H is the union of thesets in F and G;
• the dimensions in H are the intersection of those inF and G , assuming that a given dimension iscommon to F and G if at least one dimensionattribute is shared.
• each hierarchy in H includes all and only thedimension attributes included in the correspondinghierarchies of both F and G.
job
monthyear city stateAVG
storeEMPLOYEES
number of emp.max. salary
MAX
MAXMAX(a)
sex
job
quarteryear city stateAVG
NON-EUROPEANEMPLOYEES
number of emp.
age range
continent
nation
(b)
job
year
city stateAVG
AVG
ALL EMPLOYEES
number of emp.max salary of emp.numb. of non-eur. emp.MAX
MAX
MAX
(c)
Figure 3. Scheme overlapping.
Consider the two fact schemes in Figures 3.a and 3.b:the first represents all employees of an enterprise, thesecond only the non-European employees. Although theseschemes are aimed at extracting different information, theyare compatible; in fact they share the time, job and storedimensions. The scheme resulting from overlapping isshown in Figure 3.c; it can be used, for instance, to
calculate the percentage of non-European employees foreach city, job and year.
In some cases, aggregation along a dimension can becarried out at different abstraction levels even if thecorresponding dimension attributes were not explicitlyshown. For instance, given a month attribute within atime hierarchy, fact instances can be aggregated by quarter,semester and year by performing a simple calculation.Thus, given the two compatible fact schemes in Figure 3,attribute quarter could in principle be added to the timedimension in the resulting scheme. On the other hand, thedesigner must keep in mind that, by adopting thissolution, the time for extracting data by quarter willincrease significantly; thus, the best solution wouldprobably be to add explicitly the quarter attribute to thetime hierarchy in the employee fact scheme.
3.3. Representing query patterns on a factscheme
The basic OLAP operators for formulating typicalqueries on DWs are roll-up, drill down, drill across andslice-and-dice; they are used, respectively, to aggregate factattributes in order to view data at a higher level ofabstraction, disaggregate fact attributes in order tointroduce further detail, relate and compare distinct facts,select and project facts so as to reduce their dimensionality[2].
On a fact scheme, a query may be represented by aquery pattern, which consists in a set of markers placed onthe dimension attributes. One or more markers can beplaced within each hierarchy, to indicate at what level(s)fact instances must be aggregated. A dimension may alsocontain no markers, to indicate that none of its attributesis involved in the query. Non-dimension attributes neednot be shown on the query pattern.
The data shown as a result of a query may be anycombination of fact attributes, and/or the result of anycomputation made on them. Figure 4 shows the querypattern representing the following query: "total quantitysold and average returns per unit sold for each week and foreach type of product". The average returns per unit sold isthe ratio between the total returns and the quantity sold.
SALE
product
qty soldreturns/qty sold
categorytypemanufacturer
weekmonth store city state
sales manager
Figure 4. Query pattern.
attribute dependencies in the two schemes are notconflicting:
• the set of the fact attributes in H is the union of thesets in F and G;
• the dimensions in H are the intersection of those inF and G , assuming that a given dimension iscommon to F and G if at least one dimensionattribute is shared.
• each hierarchy in H includes all and only thedimension attributes included in the correspondinghierarchies of both F and G.
job
monthyear city stateAVG
storeEMPLOYEES
number of emp.max. salary
MAX
MAXMAX(a)
sex
job
quarteryear city stateAVG
NON-EUROPEANEMPLOYEES
number of emp.
age range
continent
nation
(b)
job
year
city stateAVG
AVG
ALL EMPLOYEES
number of emp.max salary of emp.numb. of non-eur. emp.MAX
MAX
MAX
(c)
Figure 3. Scheme overlapping.
Consider the two fact schemes in Figures 3.a and 3.b:the first represents all employees of an enterprise, thesecond only the non-European employees. Although theseschemes are aimed at extracting different information, theyare compatible; in fact they share the time, job and storedimensions. The scheme resulting from overlapping isshown in Figure 3.c; it can be used, for instance, to
calculate the percentage of non-European employees foreach city, job and year.
In some cases, aggregation along a dimension can becarried out at different abstraction levels even if thecorresponding dimension attributes were not explicitlyshown. For instance, given a month attribute within atime hierarchy, fact instances can be aggregated by quarter,semester and year by performing a simple calculation.Thus, given the two compatible fact schemes in Figure 3,attribute quarter could in principle be added to the timedimension in the resulting scheme. On the other hand, thedesigner must keep in mind that, by adopting thissolution, the time for extracting data by quarter willincrease significantly; thus, the best solution wouldprobably be to add explicitly the quarter attribute to thetime hierarchy in the employee fact scheme.
3.3. Representing query patterns on a factscheme
The basic OLAP operators for formulating typicalqueries on DWs are roll-up, drill down, drill across andslice-and-dice; they are used, respectively, to aggregate factattributes in order to view data at a higher level ofabstraction, disaggregate fact attributes in order tointroduce further detail, relate and compare distinct facts,select and project facts so as to reduce their dimensionality[2].
On a fact scheme, a query may be represented by aquery pattern, which consists in a set of markers placed onthe dimension attributes. One or more markers can beplaced within each hierarchy, to indicate at what level(s)fact instances must be aggregated. A dimension may alsocontain no markers, to indicate that none of its attributesis involved in the query. Non-dimension attributes neednot be shown on the query pattern.
The data shown as a result of a query may be anycombination of fact attributes, and/or the result of anycomputation made on them. Figure 4 shows the querypattern representing the following query: "total quantitysold and average returns per unit sold for each week and foreach type of product". The average returns per unit sold isthe ratio between the total returns and the quantity sold.
SALE
product
qty soldreturns/qty sold
categorytypemanufacturer
weekmonth store city state
sales manager
Figure 4. Query pattern.
F G
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 31
• Schema resulting from overlapping F and G is H:
• H can be used, for instance, to calculate the percentage of non-European
employees for each city, job and year.
attribute dependencies in the two schemes are notconflicting:
• the set of the fact attributes in H is the union of thesets in F and G;
• the dimensions in H are the intersection of those inF and G , assuming that a given dimension iscommon to F and G if at least one dimensionattribute is shared.
• each hierarchy in H includes all and only thedimension attributes included in the correspondinghierarchies of both F and G.
job
monthyear city stateAVG
storeEMPLOYEES
number of emp.max. salary
MAX
MAXMAX(a)
sex
job
quarteryear city stateAVG
NON-EUROPEANEMPLOYEES
number of emp.
age range
continent
nation
(b)
job
year
city stateAVG
AVG
ALL EMPLOYEES
number of emp.max salary of emp.numb. of non-eur. emp.MAX
MAX
MAX
(c)
Figure 3. Scheme overlapping.
Consider the two fact schemes in Figures 3.a and 3.b:the first represents all employees of an enterprise, thesecond only the non-European employees. Although theseschemes are aimed at extracting different information, theyare compatible; in fact they share the time, job and storedimensions. The scheme resulting from overlapping isshown in Figure 3.c; it can be used, for instance, to
calculate the percentage of non-European employees foreach city, job and year.
In some cases, aggregation along a dimension can becarried out at different abstraction levels even if thecorresponding dimension attributes were not explicitlyshown. For instance, given a month attribute within atime hierarchy, fact instances can be aggregated by quarter,semester and year by performing a simple calculation.Thus, given the two compatible fact schemes in Figure 3,attribute quarter could in principle be added to the timedimension in the resulting scheme. On the other hand, thedesigner must keep in mind that, by adopting thissolution, the time for extracting data by quarter willincrease significantly; thus, the best solution wouldprobably be to add explicitly the quarter attribute to thetime hierarchy in the employee fact scheme.
3.3. Representing query patterns on a factscheme
The basic OLAP operators for formulating typicalqueries on DWs are roll-up, drill down, drill across andslice-and-dice; they are used, respectively, to aggregate factattributes in order to view data at a higher level ofabstraction, disaggregate fact attributes in order tointroduce further detail, relate and compare distinct facts,select and project facts so as to reduce their dimensionality[2].
On a fact scheme, a query may be represented by aquery pattern, which consists in a set of markers placed onthe dimension attributes. One or more markers can beplaced within each hierarchy, to indicate at what level(s)fact instances must be aggregated. A dimension may alsocontain no markers, to indicate that none of its attributesis involved in the query. Non-dimension attributes neednot be shown on the query pattern.
The data shown as a result of a query may be anycombination of fact attributes, and/or the result of anycomputation made on them. Figure 4 shows the querypattern representing the following query: "total quantitysold and average returns per unit sold for each week and foreach type of product". The average returns per unit sold isthe ratio between the total returns and the quantity sold.
SALE
product
qty soldreturns/qty sold
categorytypemanufacturer
weekmonth store city state
sales manager
Figure 4. Query pattern.
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 32
• In some cases, aggregation along a dimension can be carried out at different abstraction levels even if the corresponding dimension attributes were not explicitly shown.
• Ex: a month attribute within a time hierarchy, fact instances can be aggregated by quarter, semester and year by performing a simple calculation.
• Thus, given the F and G fact schemata, attribute quarter could in principle be added to the time dimension in the resulting schema H
• On the other hand, the designer must keep in mind that, by adopting this solution, the time for extracting data by quarter will increase significantly
• thus, the best solution would probably be to add explicitly the quarter attribute to the time hierarchy in the employee fact schema.
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 33
Fact schema INVENTORY : Fact schema SHIPMENT:
Fact schema overlaping INVENTORY and SHIPMENT:
3.3. Add i t i v i t y
Aggregation requires defining a proper operator to compose the measure valuescharacterizing primary fact instances into measure values characterizing each secondary factinstance.
Definition 4. Given a fact scheme f, measure mj!M is said to be aggregable ondimension dk!Dim(f) if "(mj, dk, #)!S, non-aggregable otherwise. Measure mj issaid to be additive on dk if "(mj, dk, 'SUM')!S, non-additive otherwise.
As a guideline, most measures in a fact scheme should be additive. An example ofadditive measure in the sale scheme is qty sold: the quantity sold for a given sales manageris the sum of the quantities sold for all the stores managed by that sales manager.
A measure may be non-additive on one or more dimensions. Examples of this are allthe measures expressing a level, such as an inventory level, a temperature, etc. Aninventory level is non-additive on time, but it is additive on the other dimensions. Atemperature measure is non-additive on all the dimensions, since adding up twotemperatures hardly makes sense. However, this kind of non-additive measures can still beaggregated by using operators such as average, maximum, minimum; Figure 5 shows anexample where both operators AVG and MIN can be used for aggregation; measure qtyexpresses, for each product, the number of copies present within each warehouse duringeach week.
state
INVENTORY
product
qty
category
type
month warehouse city
address
year week
season
weightpackage size brand
AVG, MIN
units per palletpackage type
Fig. 5. The INVENTORY fact scheme.
For other measures, aggregation is inherently impossible for conceptual reasons.Consider the measure number of customers in the sale example, estimated for a givenproduct, day and store by counting the number of purchase tickets for that product printedon that day in that store. Since the same ticket may include other products, adding oraveraging the number of customers for two or more products would lead to an inconsistentresult. Thus, number of customers is non-aggregable on the product dimension (while itis additive on the time and the stores dimensions). In this case, the reason for non-aggregability is that the relationship between purchase tickets and products is many-to-many instead of many-to-one: measure number of customers cannot be consistently
stateSHIPMENT
product
qty shipped.....
category
type
quarter monthship to city
address
year
corporate
customer
date
season
department
weightpackage size brand
diet
manager
deal
termsincentive
ship from
addresscontact person
ship mode
addressallowance
typecarrier
order dateinvoice number
(a)
SHIPMENT!
INVENTORY
product
qty shippedinventory qty.....
category
type
month
year
season
weightpackage size brand
AVG,MIN
(b)
Fig. 8. The SHIPMENT scheme (a) and its overlap with INVENTORY (b).
• The measures in f are the union of those in f' and f". Thus, the fact on which f iscentred may be considered as a sort of "macro-fact" embracing both f' and f".
• Each hierarchy in f includes all and only the attributes included in the correspondinghierarchies of both f' and f". The functional dependencies expressed by the inter-attribute links in f' and f" are preserved.
• The domain of each dimension attribute in f is the intersection of the domains of thecorresponding attributes in f' and f".
stateSHIPMENT
product
qty shipped.....
category
type
quarter monthship to city
address
year
corporate
customer
date
season
department
weightpackage size brand
diet
manager
deal
termsincentive
ship from
addresscontact person
ship mode
addressallowance
typecarrier
order dateinvoice number
(a)
SHIPMENT!
INVENTORY
product
qty shippedinventory qty.....
category
type
month
year
season
weightpackage size brand
AVG,MIN
(b)
Fig. 8. The SHIPMENT scheme (a) and its overlap with INVENTORY (b).
• The measures in f are the union of those in f' and f". Thus, the fact on which f iscentred may be considered as a sort of "macro-fact" embracing both f' and f".
• Each hierarchy in f includes all and only the attributes included in the correspondinghierarchies of both f' and f". The functional dependencies expressed by the inter-attribute links in f' and f" are preserved.
• The domain of each dimension attribute in f is the intersection of the domains of thecorresponding attributes in f' and f".
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 34
• Building the attribute tree • Pruning and grafting the attribute tree • Defining dimensions • Defining measures (fact attributes) • Defining the granularity of data (hierarchies).
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 35
The step to derive DF schemata from Relational schema is :
• 1. Finding and defining facts from Relational schema
For each fact :
• 2. Building the Attribute Tree from Relational schema
• 3. Building the Fact Schema from Attribute Tree
Note that the step to derive DF schemata from E/R schema is very similar: the main difference concerns the algorithm used to build the attribute tree
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 36
• Facts correspond to events occurring dynamically
• Within an Relational schema, a fact is represented by a table:
• Tables representing frequently updated archives are good candidates to define facts
• Tables representing nearly-static archives or representing structural properties of the domain (such as STORE and CITY), are not candidates to define facts
• Each fact identified on the Relational schema becomes the root of an attribute tree, that become a fact schema.
Ex : In the case the more important fact is a product sale is represented by the SALES table
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 37
For each fact defined, the attribute tree is built as follow :
• Each node of the attribute tree corresponds to one or more Relational schema attributes
• The root of the attribute tree corresponds to the primary key of F
• For each node v, the corresponding attribute functionally determines all the attributes that correspond to the descendants of v (functionnal dependencies)
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 38
Relational schema of the DVD rental BD: • CARDS (cardNumber, expiry) • CUSTOMERS (cardNumber:CARDS, name, gender, address, telephone,
personalDocument) • MOVIES (moviesCode, title, category, director, lengh, mainActor) • COPIES (positionOnShelf, movieCode:MOVIES) • RENTALS (positionOnShelf:COPIES, cardNumber:CARDS, date, time)
The table RENTALS is the only candidate for expressing facts, the attribute tree associated is:
cardNumber(CARDS)
cardNumber(CUSTOMER)
positionOnShelf(RENTALS) movieCode
positionOnShelf(COPIES)
name
telephone
gender
address
personalDocument
title
category
lengh
director
mainActor
expiry
date time
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 39
Relational schema of the Flight BD: • FLIGHTS (flightNumber, airline, fromAirport:AIRPORTS) • FLIGHT_INSTANCES (FlightNumber:FLIGHTS, date) • AIRPORTS (IATAcode, name, city, country) • TICKETS (ticketNumber, flightNumber:FLIGHT_INSTANCES), seat, fate,
passengersFirstName, passengersSurname, passengersGender) • CHECK-IN (ticketNumber:TICKETS, CheckInTime, numberOfBags)
The tables that are candidates for expressing facts are : • FLIGHTS • FLIGHT_INSTANCES • TICKETS • CHECK_IN
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 40
Attribute Tree 1 (FLIGHTS) Attribute Tree 2 (FLIGHTS_INSTANCES)
country
flightNumber(FLIGHTS)
city
fromAirport
airline
departureTime
toAirport
country carrier
name
name
citycountry
flightNumber(FLIGHTS)
city
fromAirport
airline
departureTime
toAirport
country carrier
name
name
city
flightNumber(FLIGHTS_INSTANCES)
date
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 41
Attribute Tree 3 (TICKETS):
country
flightNumber(FLIGHTS)
city
fromAirport
airline
departureTime
toAirport
country carrier
name
name
city
flightNumber(FLIGHTS_INSTANCES)
date
ticketNumber(TICKETS)
fare checkInTime
numberOfBagsticketNumber(CHECK_IN)
passengerGender
passagerLastName
passagerFirstName
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 42
Attribute Tree 4 (CHECK_IN):
Facts TICKETS and CHECK_IN are the best choices because existing functional
dependencies permit to include a maximum of attributs in trees 3 and 4.
country
flightNumber(FLIGHTS)
city
fromAirport
airline
departureTime
toAirport
country carrier
name
name
city
flightNumber(FLIGHTS_INSTANCES)
date
ticketNumber(TICKETS)
fare checkInTime
numberOfBagsticketNumber(CHECK_IN)
passengerGender
passagerLastName
passagerFirstName
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 43
For each fact: • 3.1. Pruning and grafting the attribute tree:
• We can retain or graft any nodes corresponding to composite keys • We can modify, add, or delete a fuctional dependency • We can add one or more fuctional dependencies if a non-mormalized
table exists in the relational schema • 3.2. Defining Fact Schema with its dimensions (fact dimensions) • 3.3. Defining Fact Schema measures (fact attributes) • 3.4. Defining Fact Schema granularity of data (dimension
hierarchies). The step to derive DF schemata from E/R schema is very similar: the main
difference concerns the algorithm used to build the attribute tree
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 44
3.1: Pruning and grafting the attribute tree:
• movieCode and Title are inverted • cardNumber(CARDS) and name (renamed customer) are inverted • positionOnShelf(COPIES) and cardNumber(CARDS) are grafted • time, expiry, telephone, address, personalDocument, movieCode and
cardNumber(CUSTOMERS) are pruned
cardNumber(CARDS)
cardNumber(CUSTOMER)
positionOnShelf(RENTALS) movieCode
positionOnShelf(COPIES)
name
telephone
gender
address
personalDocument
title
category
lengh
director
mainActor
expiry
date time
positionOnShelf(RENTALS)
cardNumber(CUSTOMER)
gender titlecategory
lengh
director
mainActor
date
customer
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 45
Fact schema “RENTAL”:
positionOnShelf(RENTALS)
cardNumber(CUSTOMER)
gender titlecategory
lengh
director
mainActor
date
customer
customergender titlecategory
lengh
director
mainActor
date
number
RENTAL
fact
dimensions
measure
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 46
Relational schema of the DVD rental BD: • CARDS (cardNumber, expiry) • CUSTOMERS (cardNumber:CARDS, name, gender, address, telephone,
personalDocument) • MOVIES (moviesCode, title, category, director, lengh, mainActor) • COPIES (positionOnShelf, movieCode:MOVIES) • RENTALS (positionOnShelf:COPIES, cardNumber:CARDS, date, time)
SQL measure glossaries for fact schema “RENTAL”: number = SELECT COUNT (*) FROM RENTALS R INNER JOINT COPIES C ON R.positionOnShelf = C.positionOnShelf, COPIES C INNER JOINT MOVIES F RENTALS R INNER JOINT CUSTOMERS C ON R.cardNumber = C.cardNumber GROUP BY F.title, R.date, C.name;
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 47
Pruning and grafting the attribute tree:
• country is now the child of city • checkIn is now a bolean added on the tree when number node was grafted: ist value is
TRUE only for tickets whose passengers have checked in.
country
flightNumber
city
fromAirport
airline
departureTime
toAirport
country
carrier
city
date
ticketNumber(TICKETS)
fare
numberOfBags
passengerGender
seat
check-in
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 48
Fact schema “TICKET ISSUE”:
flightNumber
city
Airport
airline
departureTime
to
country
carrier
date
passengerGender
check-in
TICKET ISSUE
numberOfFlightsnumberOfBagsreceipts
arrivalTime
from
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 49
• FLIGHTS (flightNumber, airline, fromAirport:AIRPORTS) • FLIGHT_INSTANCES (FlightNumber:FLIGHTS, date) • AIRPORTS (IATAcode, name, city, country) • TICKETS (ticketNumber, flightNumber:FLIGHT_INSTANCES), seat, fate,
passengersFirstName, passengersSurname, passengersGender) • CHECK-IN (ticketNumber:TICKETS, CheckInTime, numberOfBags)
SQL measure glossaries for fact schema “TICKET ISSUE”: numberOfFlight = SELECT COUNT (*) FROM TICKETS T INNER JOINT FLIGHT_INSTANCES I ON T.flightNumber = I.flightNumber AND T.date = I.date GROUP BY T.passengerNumber, I.date, T.flightNumber; numberOfBags = SELECT SUM (C.numberOfBag) FROM TICKETS T INNER JOINT FLIGHT_INSTANCES I ON T.flightNumber = I.flightNumber AND T.date = I.date TICKETS T INNER JOINT CHECK_IN C ON T.ticketNumber = C.ticketNumber GROUP BY T.ticketNumber, I.date, T.flightNumber; receipts = SELECT SUM (T.fare) FROM TICKETS T INNER JOINT FLIGHT_INSTANCES I ON T.flightNumber = I.flightNumber AND T.date = I.date GROUP BY T.passengerGender, I.date, T.flightNumber;
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 50
The following relationnal logical schema describes an operational database for car rentals :
• RENTAL_OFFICES (OfficeName, City, Area, State, Country) • CARS (LicensePlate, Category, Model, Brand, Fuel, RegistrationDate) • HAVE_OPTIONAL (LicensePlate:CARS, Optional) • RENTALS (LicensePlate: CARS, PickupDate, DropoffDate,
PickupPlace:RENTALJDFFICES, DropoffPlace :RENTAL_OFFICES, Miles) • DRIVERS (LicenseNumber, LicenseExpiration, DriverName, Birthdate) • DRIVE (LicenseNumber: DRIVERS,(LicensePlate, PickupDate):RENTALS) • INSURANCES (Risk,(LicensePlate, PickupDate):RENTALS, Cost) • PAYMENTS ((LicensePlate, PickupDate):RENTALS, Amount, Discount,
PaymentMode)
Some hidden functional dependencies hold: City->State->Country->Area, and Model->Brand. Inspect and normalize the source schema, then choose a fact of interest and design its fact schema.
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 51
Choosing either RENTALS or PAYMENTS as fact is the same here, because these 2 tables are related by a one-to-one link
!"#$%&
'()*#$%&
+,-".%&/#0.
',&.
/(1.)
2(*.%).+1,&.
3,&.4#5-
/#0.1
65,%0
7$.1 899(*.:,".
3(&-
;&,&.
3#$%&5-!5.,
05#<#99
<(*=$<
5.4()&5,&(#%
05#<#99
<(*=$<
2(*.%).+1,&.>+(*=$<',&.
?@:A!2;
!"#$%&#&'($&'#)$$*(+,$&#$*&&-#$%&#'*./0.11#')$&#(2#/*,"&'#)"'#*&/3)4&'#+5#)#!"#$%&'(#)$$*(+,$.6/,$&'#)2
$%&#",6+&*#.1#')52#+&$7&&"#$%&#'*./0.11#)"'#$%&#/(480,/#')$&29
!"#$%&
'()*#$%&
+,-".%&/#0.
1234!56
',&.
/#%&7
8.,9
/(:.)
;,9
;,&.<#9-
/#0.:
=9,%0
>$.: ?@@(*.
;(&-
6&,&.
;#$%&9-!9.,
9.<()&9,&(#%
09#A#@@
A(*B$A
'$9,&(#%
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 52
In the edited attribute tree, the drop-off date is pruned and replaced by a Duration attribute computed as the number of days between the drop-off and the pick-up dates :
!"#$%&
'()*#$%&
+,-".%&/#0.
',&.
/(1.)
2(*.%).+1,&.
3,&.4#5-
/#0.1
65,%0
7$.1 899(*.:,".
3(&-
;&,&.
3#$%&5-!5.,
05#<#99
<(*=$<
5.4()&5,&(#%
05#<#99
<(*=$<
2(*.%).+1,&.>+(*=$<',&.
?@:A!2;
!"#$%&#&'($&'#)$$*(+,$&#$*&&-#$%&#'*./0.11#')$&#(2#/*,"&'#)"'#*&/3)4&'#+5#)#!"#$%&'(#)$$*(+,$.6/,$&'#)2
$%&#",6+&*#.1#')52#+&$7&&"#$%&#'*./0.11#)"'#$%&#/(480,/#')$&29
!"#$%&
'()*#$%&
+,-".%&/#0.
1234!56
',&.
/#%&7
8.,9
/(:.)
;,9
;,&.<#9-
/#0.:
=9,%0
>$.: ?@@(*.
;(&-
6&,&.
;#$%&9-!9.,
9.<()&9,&(#%
09#A#@@
A(*B$A
'$9,&(#%
Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 53
« RENTAL » Fact schema :
!"#$%&'()*%+"'%
()&',
-%".
/".
/"'%0).#
()*%1
2."&*
34%1 56678%
/7'#
9'"'%
/)4&'.#:.%"
.%07;'."'7)&
*.)<)66<78=4<
>?@A:B
:$)4&'
+7;8)4&'
+4."'7)&
(71%;
<78=4<
!!!"#$%&'#()*+
!"#$%&&'()*&#$&'##$+(&"$',,&$(-$!"#$%$(.$/#0(1&#/$)#2,+3
!"#$%&'()
$*+)
!*,(%-.()+/01)
$-2)!)34(5
&67'(*03
$*89)(/01)
26785'.)
/0.(
:,,*8)/01)
;'+)<117)..
=0*3(.
:,,*8)/01)>=6785'.)&'()>=7047)..*?);6+@)7
26785'.)
=7047)..*?);6+@)7
=6785'.)$*+)
A0B1)7;'+)
&*.8063(
$-2)/01)
&).87*2(*03
&'-.
=0*3(.
/0.(
4,&#$&"%&5$(-$&"#$#/(&#/$&'##5$&"#$6*-1&(,-%2$/#0#-/#-17$6',8$!&'(%)*(+,-./+$&,$0.&1(*$"%.$)##-
'#8,9#/$&,$8%:#$0.&1(*$%$1"(2/$,6$&"#$',,&5$.,$&"%&$(&$1%-$)#$1",.#-$%.$%$8#%.*'#3$!"#$.%8#$(.$/,-#$6,'
2345(&.13$!"#$&(1:#&$%-/$&"#$.:(0%..$;'%-*2%'(&(#.$%'#$'#8,9#/5$%-/$%$%6&75**84$&96+($62%;$(.$%//#/$&,