Data Warehouse/Data Mart Conceptual Modeling and … · Bernard ESPINASSE - Data Warehouse...

Bernard ESPINASSE - Data Warehouse Conceptual modeling and Design 1

Data Warehouse/Data Mart Conceptual Modeling and Design

(4)

Bernard ESPINASSE Professeur à Aix-Marseille Université (AMU)

Ecole Polytechnique Universitaire de Marseille

September 2013

Methodological Framework Conceptual Modelling: the Dimensionnal Fact Model (DFM) Conceptual Design: from Relational schema to DFM


1. Methodological Framework • Conceptual Design & Logical Design • Top-Down Versus Botton-Up Approach • Design Phases and schemata derivations

2. Conceptual Modelling: The Dimensionnal Fact Model (DFM) • Fact schema • Dimension hierarchies • Additive, semi-additive and non-additive attributes • Overlapping compatible fact schemata • Representing query patterns on a fact schema

3. Conceptual Design : From Relationnal schema to DFM of Data Mart • Building the attribute tree, pruning and grafting the attribute tree • Defining dimensions, measures and granularity of data


• Books • Golfarelli M., Rizzi S., “Data Warehouse Design : Modern Principles and

Methodologies”, McGrawHill, 2009. • Kimball R., Ross, M., “Entrepôts de données : guide pratique de

modélisation dimensionnelle”, 2°édition, Ed. Vuibert, 2003. • S. Rizzi. “Conceptual modeling solutions for the data warehouse”. In Data

Warehousing and Mining: Concepts, Methodologies, Tools, and Applications, J. Wang (Ed.), Information Science Reference, pp. 208-227, 2008.

• M. Golfarelli, D. Maio, S. Rizzi. “Conceptual Design of Data Warehouses from E/R Schemes”. Proceedings 31st Hawaii International Conference on System Sciences (HICSS-31), vol. VII, Kona, Hawaii, pp. 334-343, 1998.

• Courses • Course of M. Golfarelli M. and S. Rizzi, University of Bologna • Courses of M. Böhlen and J. Gamper J., Free University of Bolzano


• Conceptual Design & Logical Design • Life-Cycle • Top-Down, Botton-Up and Mixed Strategies • Design Phases • Schemata derivations for DMs design


• Entite-Relation models are not very useful in modeling DWs • Is now universally recognized that a DW is based on a

multidimensional view of data : ! But there is still no agreement on HOW to implement its

conceptual design ! • Most of the time, DW design is at the logical level : a

multidimensional model (star/snowflake schema) is directly designed : ! But a star schema is nothing but a relational schema: it

contains only the definition of a set of relations and integrity constraints !

• A better approach: ! 1) design first a conceptual model : Conceptual Design ! 2) which is then translated into a logical model : Logical

Design


• Building a DW is a very complex task, which requires an accurate planning aimed at devising satisfactory answers to organizational and architectural questions

• A large number of organizations lack experience and skills that are required to meet the challenges involved in DW projects

• Major cause of DW failures lies in the absence of a global view of the design process, of a design methodology

• Design Methodologies are necessary to minimizing the risks for failure

• Tree main strategies for DW design: ! Top-Down strategy ! Botton-Up strategy ! Mixed strategy


Top-Down Approach: 1. Design of DW 2. Design of DMs Bottom-Up Approach: 1. Design of DMs 2. Integration of DMs in DW 3. Maybe no physical DW Mixed Approach: 1. Design of DW for

DM1 2. Design of DM2 and

integration with DW 3. Design of DM3 and

integration with DW 4. ...

Appl.Appl.

DB1

Appl.Appl.

DB3 DB2

Appl.

DB4

Trans..

DW

DM1

Appl.

DM2

Appl.

DM3

Appl.

Existing databases and systems (OLTP)

Global Data Warehouse

Data Marts

Top-Down Approach Botto

n-Up A

ppro

ach Mixed Approach


Analyze global business needs, plan how to develop a DW, design it, and implement it as a whole with its DMs

(+) Stengths: ! Promising: it is based on a global picture of the goal to achieve,

and in principle it ensures consistent, well integrated DW (-) Weakness:

! High-cost estimates with long-term implementations discourage company managers from embarking on these kind of projects.

! Analyzing and integrating all relevant sources at the same time is a very difficult task: they are all available and stable at the same time.

! Extremely difficult to forecast the specific needs of every department involved in a project, which leads to specific DMs

! As no working DW system is going to be delivered in the short term, users cannot check for this project to be useful, so they lose trust and interest in it.


• DW is incrementally built and several DM are iteratively created • Each DM is based on a set of facts that are linked to a specific

department and that can be interesting for a user group (+) Stengths:

! Leads to concrete results in a short time ! Does not require huge investments ! Enables designers to investigate one area at a time

Gives managers a quick feedback about the actual benefits of the system being built

(-) Weakness: Keeps the interest for the project constantly high may determine a partial vision of the business domain.

=> Mixed strategy …


Top-Down and Bottom-Up strategies should be mixed : • When planning a DW, a bottom-up strategy should be followed • One Data Mart (DM) at a time is identified and prototyped

according to a top-down strategy by building a conceptual schema for each fact of interest

• The first DM (DM1) to prototype : ! is the one playing the most strategic role for the enterprise ! should be a backbone for the whole DW ! should lean on available and consistent data sources


Phase 1 : Goal setting and planning od the DW • set system goals, borders, and size • select an approach for design and implementation • estimate costs and benefits • analyze risks and expectations • examine the skills of the working team

Phase 2 : Infrastructure design of the DW • analyze and compare the possible architectural

solutions • assess the available technologies and tools • create a preliminary plan of the whole DW

system

Phase 3 : Design and development of DMs • Every iteration causes a new DM and new

applications to be created and progressively added to the DW system

Phase 1: Goal setting

and planning

Phase 2: Infrastructure Design

Phase 3: Design and developpement of

Data Marts


Each Data Mart (DM) will be designed according these steps:

J. Gamper, Free University of Bolzano, DWDM 2012/13 40

!"#"$%"&#$'()*+,$-.")()

!"#$%&'()(*+,-,'

().'-)/&0$(/-")

1&2#-$&3&)/

()(*+,-,

4")%&5/#(*

.&,-0)

6"$7*"(.

().'.(/('8"*#3&

9"0-%(*

.&,-0)

:;9

.&,-0)

business user

designer

db administrator

<=+,-%(*

.&,-0)


51

Methodological frameworkMethodological framework

analysis of theoperational db

requirementspecification

conceptualdesign

workloadrefinement

logicaldesign

physicaldesign

final user

designer

db administrator

DWs are based on a pre-existing information system

52

Methodological framework Methodological framework (2)(2)

LogicalLogicalSchemeScheme

LOGICALDESIGN

WorkloadTargetlogicalmodel

PhysicalPhysicalSchemeScheme

PHYSICALDESIGN

Workload TargetDBMS

E/R E/R SchemeScheme

chiave negozio negozio città regione indirizzo resp. vendite

N1 …. …. …. …… ………

N2

chiave tempo chiave negozio chiave_prodotto quant venduta incasso num_clienti

T1 N1 P1 10 10000002

T1 N1 P2 8 12000008

T1 N2 P5 15 15000005

… ….. …… …….

RelationalRelationalSchemeScheme

ConceptualConceptualSchemeScheme

CONCEPTUALDESIGN

Facts

Preliminaryworkload


• Fact schema • Dimension hierarchies • Fact schema and fact instances • Additive attributes • Semi-additive and non-additive attributes • Overlapping compatible fact schemata • Representing query patterns on a fact schema


• Conceptual Design is based on the documentation of the underlying operational information system (IS):

! Relational schemata or ! E/R schemata

• Steps:

1. Find facts 2. For each fact:

a) Navigate functional dependencies b) Drop useless attributes c) Define dimensions and measures


The Dimensional Fact Model (DFM) has be proposed by Golfarelli M., Rizzi S. to support a Conceptual Design of DW

The DFM is a graphical conceptual model for Data Mart design

The aim of the DFM is to :

1. Provide an efficient support to Conceptual Design 2. Create an environment in which user queries may be formulated

intuitively 3. Make communication possible between designers and end users

with the goal of formalizing requirement specifications 4. Build a stable platform for logical design (independently of the target

logical model) 5. Provide clear and expressive design documentation

The conceptual representation generated by the DFM consists of a set of fact schemata that basically model facts, measures, dimensions, and hierarchies.


Ex : a simple 3-dimensional fact schema « SALE » for a chain of stores :

• A fact schema is structured as a tree whose root is a fact • A Conceptual Model of a DW consists of a set of fact schemata

date store

product

quantityreceipsunitPricenumberOfCustomer

SALE

fact

dimensions

measures


A fact is a concept relevant to decision-making processes : • It models a set of events (ex: in a compagny: sales, shipments, purchases, ...) • It has dynamic properties or evolve in some way over time • It has one or more numeric and continuously valued attributes which

"measure" the fact from different points of view

• a measure is a numerical property of a fact and describes a quantitative fact aspect that is relevant to analysis : Ex : every sale is quantified by its quantity, receips, unitPrice, numberOfCustomer

• a dimension is a fact property with a

finite domain and describes an analysis coordinate of the fact : Ex : typical dimensions for the sales fact are product, store, and date

date store

product


SALE

fact

dimensions

measures


• Hierarchy determines how fact instances may be aggregated and selected significantly for the decision-making process and determines the granularity adopted for representing facts.

• Hierarchies are subtrees rooted in dimensions:

datemonth

salesDistrict

store storeCity state

product


SALE

factdimensions

quarteryear

typebrand

brandCity

department

category

salesManager

marketingGroup

country

holidayday

week

hierarchies

sizenon dimension

attribute


In dimension hierarchies :

• nodes represented by circles are dimension attributes which may assume a discrete set of values.

Ex : week, month, product, …

• arcs represent relationships between pairs of attributes: these relationships are functional dependencies:

Ex: product -> type; type -> category; category -> department …

• dimension attributes in the nodes along each sub-path of the hierarchy starting from the dimension define progressively coarser granularities.


non-dimension attributes contains additional information about an attribute of the hierarchy: it cannot be used for aggregation ! Ex : size : aggregating sales according to the size of the product would not make sense!

datemonth

salesDistrictstore

storeCity state

product


SALE

quarteryear

typebrand

brandCity

department

category

salesManager

marketingGroup

country

holidayday

week

size

non dimension attribute

addresstelephone


Optional arcs (marked by a dash) express optional relationships between pairs of attributes (useful for logical design) Ex : diet, promotion. The diet attribute takes a value (such as cholesterol-free, gluten-free, or sugar-free) only for food products; for the other products, it is undefined.

datemonth

salesDistrict


product


SALE

quarteryear

typebrand

brandCity

department

category

salesManager

marketingGroup

country

holidayday

week

size

diet

optional arc

promotion

discount

advertising

startDateendDate

cost


Cross-dimensional attribute is a dimensionnal or descriptive attribute whose value is defined by the combination of 2 or more dimensional attributes, possibly belonging to different hierarchies. Ex : if a product Value Added Tax (VAT) depends both on the product category and on the country where the product is sold, you can use a cross-dimensional attribute to represent it:

datemonth

salesDistrict


product


SALE

quarteryear

typebrand

brandCity

department

category

salesManager

marketingGroup

country

holidayday

week

size

diet

cross-dimensionnal attributes

VAT


A convergence takes place when 2 dimensional attributes within a hierarchy are connected by 2 or more alternative paths of many-to-one associations (Graphically, use of arrows). Ex : in store dimension, store are grouped into sales districts and no inclusive relationship exists between districts and states, but each district is part of only one country:

Store -> salesDistrict -> country or

Store -> storeCity -> state -> country

datemonth

salesDistrict



SALE

quarteryear

salesManager

country

holidayday

week

convergence


Shared hierarchies exist when entire portion of hierarchies are frequently replicated 2 or more time in fact schemata In particular in time hierarchies, 2 or more date-type dimensions with different meaning can easily exist in a same fact, and need to build a month-year hierarchy on each one of them

=> an abreviation is introduced Ex: calling and called phone numbers …

hour

numberduration

CALL

date month year

callingNumber

calledNumbercalledNumberType

callingNumberType

callingNumberDistrict

calledNumberDistrict

roles

hour

numberduration

CALL

shared hierarchy

date month year

O

calling

calledtelNumber

type

district


Multiple arc models a many-to-many association between the 2 dimensional attributes it connects (Graphically, denoted by doubling of the arc) Ex : in a fact schema modeling the sales of books, whose dimensions are date and book. It would certainly be interesting to aggregate and select sales on the basis of book authors. However, it would not be accurate to model author as a dimensional child attribute of book because many different authors can write many books. Then, the relationship between books and authors is modeled as a multiple arc:

datemonth book author


SALE

quarteryear

genreholiday

day

weekmultiple arc


3 Types of measure : ! Flow measure: refer to a timeframe (ex: number of products sold in a day) ! Level measure: evaluated at particular time (ex: number of products in

inventory) ! Unit measure: evaluated at particular time but are expressed in relative terms

(ex: product unit price, discount percentage) ! Suitable operators for aggregation:

Temporal hierarchies Nontemporal hierarchies Flow measures SUM, AVG, MIN, MAX SUM, AVG, MIN, MAX Level measures AVG, MIN, MAX SUM, AVG, MIN, MAX Unit measures AVG, MIN, MAX AVG, MIN, MAX

3 Natures of measure : ! additive along a dimension when can be used the SUM aggregation operator ! non-additive along a dimension if the aggregation operator is not SUM (ex:

inventory level) ! a non-additive measure is non-aggregable is no operator exists (ex: unitPrice

product)


• In the DFM, along all the dimensions, by default measures are additive (operator SUM)

• Non-additive measure can be explicitely specify with its operator(s) used for aggregation – other that SUM (Ex: AVG and MIN for inventory level)

datemonth warehouse city

levelincomingQuantity

INVENTORY

quarteryear

address

week

non additive measure

country

AVG, MINproduct

brandtype

category

department

ItemPerPalletpackaging

weight


• Different facts are represented in different fact schemata

• Queries the user formulates on the DW may require comparing fact attributes taken from distinct, though related, schemata (drill across in OLAP)

• 2 fact schemata are said compatible if they share at least one dimension attribute

• 2 compatible schemata F and G may be overlapped to create a resulting schema H

• Without conflict between attribute dependencies in the 2 schemata:

• the set of the fact attributes in H is the union of the sets in F and G

• the dimensions in H are the intersection of those in F and G, assuming that a given dimension is common to F and G if at least one dimension attribute is shared

• each hierarchy in H includes all and only the dimension attributes included in the corresponding hierarchies of both F and G.


Consider the 2 fact schemata : • F represents all employees of an enterprise • G only the non-European employees.

F and G are compatible, they share the time, job and store dimensions

attribute dependencies in the two schemes are notconflicting:

• the set of the fact attributes in H is the union of thesets in F and G;

• the dimensions in H are the intersection of those inF and G , assuming that a given dimension iscommon to F and G if at least one dimensionattribute is shared.

• each hierarchy in H includes all and only thedimension attributes included in the correspondinghierarchies of both F and G.

job

monthyear city stateAVG

storeEMPLOYEES

number of emp.max. salary

MAX

MAXMAX(a)

sex

job

quarteryear city stateAVG

NON-EUROPEANEMPLOYEES

number of emp.

age range

continent

nation

(b)

job

year

city stateAVG

AVG

ALL EMPLOYEES

number of emp.max salary of emp.numb. of non-eur. emp.MAX

MAX

MAX

(c)

Figure 3. Scheme overlapping.

Consider the two fact schemes in Figures 3.a and 3.b:the first represents all employees of an enterprise, thesecond only the non-European employees. Although theseschemes are aimed at extracting different information, theyare compatible; in fact they share the time, job and storedimensions. The scheme resulting from overlapping isshown in Figure 3.c; it can be used, for instance, to

calculate the percentage of non-European employees foreach city, job and year.

In some cases, aggregation along a dimension can becarried out at different abstraction levels even if thecorresponding dimension attributes were not explicitlyshown. For instance, given a month attribute within atime hierarchy, fact instances can be aggregated by quarter,semester and year by performing a simple calculation.Thus, given the two compatible fact schemes in Figure 3,attribute quarter could in principle be added to the timedimension in the resulting scheme. On the other hand, thedesigner must keep in mind that, by adopting thissolution, the time for extracting data by quarter willincrease significantly; thus, the best solution wouldprobably be to add explicitly the quarter attribute to thetime hierarchy in the employee fact scheme.

3.3. Representing query patterns on a factscheme

The basic OLAP operators for formulating typicalqueries on DWs are roll-up, drill down, drill across andslice-and-dice; they are used, respectively, to aggregate factattributes in order to view data at a higher level ofabstraction, disaggregate fact attributes in order tointroduce further detail, relate and compare distinct facts,select and project facts so as to reduce their dimensionality[2].

On a fact scheme, a query may be represented by aquery pattern, which consists in a set of markers placed onthe dimension attributes. One or more markers can beplaced within each hierarchy, to indicate at what level(s)fact instances must be aggregated. A dimension may alsocontain no markers, to indicate that none of its attributesis involved in the query. Non-dimension attributes neednot be shown on the query pattern.

The data shown as a result of a query may be anycombination of fact attributes, and/or the result of anycomputation made on them. Figure 4 shows the querypattern representing the following query: "total quantitysold and average returns per unit sold for each week and foreach type of product". The average returns per unit sold isthe ratio between the total returns and the quantity sold.

SALE

product

qty soldreturns/qty sold

categorytypemanufacturer

weekmonth store city state

sales manager

Figure 4. Query pattern.





job


storeEMPLOYEES


MAX

MAXMAX(a)

sex

job



number of emp.

age range

continent

nation

(b)

job

year

city stateAVG

AVG

ALL EMPLOYEES


MAX

MAX

(c)









SALE

product




sales manager


F G


• Schema resulting from overlapping F and G is H:

• H can be used, for instance, to calculate the percentage of non-European

employees for each city, job and year.





job


storeEMPLOYEES


MAX

MAXMAX(a)

sex

job



number of emp.

age range

continent

nation

(b)

job

year

city stateAVG

AVG

ALL EMPLOYEES


MAX

MAX

(c)









SALE

product




sales manager



• In some cases, aggregation along a dimension can be carried out at different abstraction levels even if the corresponding dimension attributes were not explicitly shown.

• Ex: a month attribute within a time hierarchy, fact instances can be aggregated by quarter, semester and year by performing a simple calculation.

• Thus, given the F and G fact schemata, attribute quarter could in principle be added to the time dimension in the resulting schema H

• On the other hand, the designer must keep in mind that, by adopting this solution, the time for extracting data by quarter will increase significantly

• thus, the best solution would probably be to add explicitly the quarter attribute to the time hierarchy in the employee fact schema.


Fact schema INVENTORY : Fact schema SHIPMENT:

Fact schema overlaping INVENTORY and SHIPMENT:

3.3. Add i t i v i t y

Aggregation requires defining a proper operator to compose the measure valuescharacterizing primary fact instances into measure values characterizing each secondary factinstance.

Definition 4. Given a fact scheme f, measure mj!M is said to be aggregable ondimension dk!Dim(f) if "(mj, dk, #)!S, non-aggregable otherwise. Measure mj issaid to be additive on dk if "(mj, dk, 'SUM')!S, non-additive otherwise.

As a guideline, most measures in a fact scheme should be additive. An example ofadditive measure in the sale scheme is qty sold: the quantity sold for a given sales manageris the sum of the quantities sold for all the stores managed by that sales manager.

A measure may be non-additive on one or more dimensions. Examples of this are allthe measures expressing a level, such as an inventory level, a temperature, etc. Aninventory level is non-additive on time, but it is additive on the other dimensions. Atemperature measure is non-additive on all the dimensions, since adding up twotemperatures hardly makes sense. However, this kind of non-additive measures can still beaggregated by using operators such as average, maximum, minimum; Figure 5 shows anexample where both operators AVG and MIN can be used for aggregation; measure qtyexpresses, for each product, the number of copies present within each warehouse duringeach week.

state

INVENTORY

product

qty

category

type

month warehouse city

address

year week

season

weightpackage size brand

AVG, MIN

units per palletpackage type

Fig. 5. The INVENTORY fact scheme.

For other measures, aggregation is inherently impossible for conceptual reasons.Consider the measure number of customers in the sale example, estimated for a givenproduct, day and store by counting the number of purchase tickets for that product printedon that day in that store. Since the same ticket may include other products, adding oraveraging the number of customers for two or more products would lead to an inconsistentresult. Thus, number of customers is non-aggregable on the product dimension (while itis additive on the time and the stores dimensions). In this case, the reason for non-aggregability is that the relationship between purchase tickets and products is many-to-many instead of many-to-one: measure number of customers cannot be consistently

stateSHIPMENT

product

qty shipped.....

category

type

quarter monthship to city

address

year

corporate

customer

date

season

department


diet

manager

deal

termsincentive

ship from

addresscontact person

ship mode

addressallowance

typecarrier

order dateinvoice number

(a)

SHIPMENT!

INVENTORY

product

qty shippedinventory qty.....

category

type

month

year

season


AVG,MIN

(b)

Fig. 8. The SHIPMENT scheme (a) and its overlap with INVENTORY (b).

• The measures in f are the union of those in f' and f". Thus, the fact on which f iscentred may be considered as a sort of "macro-fact" embracing both f' and f".

• Each hierarchy in f includes all and only the attributes included in the correspondinghierarchies of both f' and f". The functional dependencies expressed by the inter-attribute links in f' and f" are preserved.

• The domain of each dimension attribute in f is the intersection of the domains of thecorresponding attributes in f' and f".

stateSHIPMENT

product

qty shipped.....

category

type

quarter monthship to city

address

year

corporate

customer

date

season

department


diet

manager

deal

termsincentive

ship from

addresscontact person

ship mode

addressallowance

typecarrier

order dateinvoice number

(a)

SHIPMENT!

INVENTORY

product

qty shippedinventory qty.....

category

type

month

year

season


AVG,MIN

(b)

Fig. 8. The SHIPMENT scheme (a) and its overlap with INVENTORY (b).

• The measures in f are the union of those in f' and f". Thus, the fact on which f iscentred may be considered as a sort of "macro-fact" embracing both f' and f".

• Each hierarchy in f includes all and only the attributes included in the correspondinghierarchies of both f' and f". The functional dependencies expressed by the inter-attribute links in f' and f" are preserved.

• The domain of each dimension attribute in f is the intersection of the domains of thecorresponding attributes in f' and f".


• Building the attribute tree • Pruning and grafting the attribute tree • Defining dimensions • Defining measures (fact attributes) • Defining the granularity of data (hierarchies).


The step to derive DF schemata from Relational schema is :

• 1. Finding and defining facts from Relational schema

For each fact :

• 2. Building the Attribute Tree from Relational schema

• 3. Building the Fact Schema from Attribute Tree

Note that the step to derive DF schemata from E/R schema is very similar: the main difference concerns the algorithm used to build the attribute tree


• Facts correspond to events occurring dynamically

• Within an Relational schema, a fact is represented by a table:

• Tables representing frequently updated archives are good candidates to define facts

• Tables representing nearly-static archives or representing structural properties of the domain (such as STORE and CITY), are not candidates to define facts

• Each fact identified on the Relational schema becomes the root of an attribute tree, that become a fact schema.

Ex : In the case the more important fact is a product sale is represented by the SALES table


For each fact defined, the attribute tree is built as follow :

• Each node of the attribute tree corresponds to one or more Relational schema attributes

• The root of the attribute tree corresponds to the primary key of F

• For each node v, the corresponding attribute functionally determines all the attributes that correspond to the descendants of v (functionnal dependencies)


Relational schema of the DVD rental BD: • CARDS (cardNumber, expiry) • CUSTOMERS (cardNumber:CARDS, name, gender, address, telephone,

personalDocument) • MOVIES (moviesCode, title, category, director, lengh, mainActor) • COPIES (positionOnShelf, movieCode:MOVIES) • RENTALS (positionOnShelf:COPIES, cardNumber:CARDS, date, time)

The table RENTALS is the only candidate for expressing facts, the attribute tree associated is:

cardNumber(CARDS)

cardNumber(CUSTOMER)

positionOnShelf(RENTALS) movieCode

positionOnShelf(COPIES)

name

telephone

gender

address

personalDocument

title

category

lengh

director

mainActor

expiry

date time


Relational schema of the Flight BD: • FLIGHTS (flightNumber, airline, fromAirport:AIRPORTS) • FLIGHT_INSTANCES (FlightNumber:FLIGHTS, date) • AIRPORTS (IATAcode, name, city, country) • TICKETS (ticketNumber, flightNumber:FLIGHT_INSTANCES), seat, fate,

passengersFirstName, passengersSurname, passengersGender) • CHECK-IN (ticketNumber:TICKETS, CheckInTime, numberOfBags)

The tables that are candidates for expressing facts are : • FLIGHTS • FLIGHT_INSTANCES • TICKETS • CHECK_IN


Attribute Tree 1 (FLIGHTS) Attribute Tree 2 (FLIGHTS_INSTANCES)

country

flightNumber(FLIGHTS)

city

fromAirport

airline

departureTime

toAirport

country carrier

name

name

citycountry


city

fromAirport

airline

departureTime

toAirport

country carrier

name

name

city

flightNumber(FLIGHTS_INSTANCES)

date


Attribute Tree 3 (TICKETS):

country


city

fromAirport

airline

departureTime

toAirport

country carrier

name

name

city


date

ticketNumber(TICKETS)

fare checkInTime

numberOfBagsticketNumber(CHECK_IN)

passengerGender

passagerLastName

passagerFirstName


Attribute Tree 4 (CHECK_IN):

Facts TICKETS and CHECK_IN are the best choices because existing functional

dependencies permit to include a maximum of attributs in trees 3 and 4.

country


city

fromAirport

airline

departureTime

toAirport

country carrier

name

name

city


date


fare checkInTime

numberOfBagsticketNumber(CHECK_IN)

passengerGender

passagerLastName

passagerFirstName


For each fact: • 3.1. Pruning and grafting the attribute tree:

• We can retain or graft any nodes corresponding to composite keys • We can modify, add, or delete a fuctional dependency • We can add one or more fuctional dependencies if a non-mormalized

table exists in the relational schema • 3.2. Defining Fact Schema with its dimensions (fact dimensions) • 3.3. Defining Fact Schema measures (fact attributes) • 3.4. Defining Fact Schema granularity of data (dimension

hierarchies). The step to derive DF schemata from E/R schema is very similar: the main

difference concerns the algorithm used to build the attribute tree


3.1: Pruning and grafting the attribute tree:

• movieCode and Title are inverted • cardNumber(CARDS) and name (renamed customer) are inverted • positionOnShelf(COPIES) and cardNumber(CARDS) are grafted • time, expiry, telephone, address, personalDocument, movieCode and

cardNumber(CUSTOMERS) are pruned

cardNumber(CARDS)


positionOnShelf(RENTALS) movieCode

positionOnShelf(COPIES)

name

telephone

gender

address

personalDocument

title

category

lengh

director

mainActor

expiry

date time

positionOnShelf(RENTALS)


gender titlecategory

lengh

director

mainActor

date

customer


Fact schema “RENTAL”:

positionOnShelf(RENTALS)


gender titlecategory

lengh

director

mainActor

date

customer

customergender titlecategory

lengh

director

mainActor

date

number

RENTAL

fact

dimensions

measure


Relational schema of the DVD rental BD: • CARDS (cardNumber, expiry) • CUSTOMERS (cardNumber:CARDS, name, gender, address, telephone,

personalDocument) • MOVIES (moviesCode, title, category, director, lengh, mainActor) • COPIES (positionOnShelf, movieCode:MOVIES) • RENTALS (positionOnShelf:COPIES, cardNumber:CARDS, date, time)

SQL measure glossaries for fact schema “RENTAL”: number = SELECT COUNT (*) FROM RENTALS R INNER JOINT COPIES C ON R.positionOnShelf = C.positionOnShelf, COPIES C INNER JOINT MOVIES F RENTALS R INNER JOINT CUSTOMERS C ON R.cardNumber = C.cardNumber GROUP BY F.title, R.date, C.name;


Pruning and grafting the attribute tree:

• country is now the child of city • checkIn is now a bolean added on the tree when number node was grafted: ist value is

TRUE only for tickets whose passengers have checked in.

country

flightNumber

city

fromAirport

airline

departureTime

toAirport

country

carrier

city

date


fare

numberOfBags

passengerGender

seat

check-in


Fact schema “TICKET ISSUE”:

flightNumber

city

Airport

airline

departureTime

to

country

carrier

date

passengerGender

check-in

TICKET ISSUE

numberOfFlightsnumberOfBagsreceipts

arrivalTime

from


• FLIGHTS (flightNumber, airline, fromAirport:AIRPORTS) • FLIGHT_INSTANCES (FlightNumber:FLIGHTS, date) • AIRPORTS (IATAcode, name, city, country) • TICKETS (ticketNumber, flightNumber:FLIGHT_INSTANCES), seat, fate,

passengersFirstName, passengersSurname, passengersGender) • CHECK-IN (ticketNumber:TICKETS, CheckInTime, numberOfBags)

SQL measure glossaries for fact schema “TICKET ISSUE”: numberOfFlight = SELECT COUNT (*) FROM TICKETS T INNER JOINT FLIGHT_INSTANCES I ON T.flightNumber = I.flightNumber AND T.date = I.date GROUP BY T.passengerNumber, I.date, T.flightNumber; numberOfBags = SELECT SUM (C.numberOfBag) FROM TICKETS T INNER JOINT FLIGHT_INSTANCES I ON T.flightNumber = I.flightNumber AND T.date = I.date TICKETS T INNER JOINT CHECK_IN C ON T.ticketNumber = C.ticketNumber GROUP BY T.ticketNumber, I.date, T.flightNumber; receipts = SELECT SUM (T.fare) FROM TICKETS T INNER JOINT FLIGHT_INSTANCES I ON T.flightNumber = I.flightNumber AND T.date = I.date GROUP BY T.passengerGender, I.date, T.flightNumber;


The following relationnal logical schema describes an operational database for car rentals :

• RENTAL_OFFICES (OfficeName, City, Area, State, Country) • CARS (LicensePlate, Category, Model, Brand, Fuel, RegistrationDate) • HAVE_OPTIONAL (LicensePlate:CARS, Optional) • RENTALS (LicensePlate: CARS, PickupDate, DropoffDate,

PickupPlace:RENTALJDFFICES, DropoffPlace :RENTAL_OFFICES, Miles) • DRIVERS (LicenseNumber, LicenseExpiration, DriverName, Birthdate) • DRIVE (LicenseNumber: DRIVERS,(LicensePlate, PickupDate):RENTALS) • INSURANCES (Risk,(LicensePlate, PickupDate):RENTALS, Cost) • PAYMENTS ((LicensePlate, PickupDate):RENTALS, Amount, Discount,

PaymentMode)

Some hidden functional dependencies hold: City->State->Country->Area, and Model->Brand. Inspect and normalize the source schema, then choose a fact of interest and design its fact schema.


Choosing either RENTALS or PAYMENTS as fact is the same here, because these 2 tables are related by a one-to-one link

!"#$%&

'()*#$%&

+,-".%&/#0.

',&.

/(1.)

2(*.%).+1,&.

3,&.4#5-

/#0.1

65,%0

7$.1 899(*.:,".

3(&-

;&,&.

3#$%&5-!5.,

05#<#99

<(*=$<

5.4()&5,&(#%

05#<#99

<(*=$<

2(*.%).+1,&.>+(*=$<',&.

?@:A!2;

!"#$%&#&'($&'#)$$*(+,$&#$*&&-#$%&#'*./0.11#')$&#(2#/*,"&'#)"'#*&/3)4&'#+5#)#!"#$%&'(#)$$*(+,$&#4.6/,$&'#)2

$%&#",6+&*#.1#')52#+&$7&&"#$%&#'*./0.11#)"'#$%&#/(480,/#')$&29

!"#$%&

'()*#$%&

+,-".%&/#0.

1234!56

',&.

/#%&7

8.,9

/(:.)

;,9

;,&.<#9-

/#0.:

=9,%0

>$.: ?@@(*.

;(&-

6&,&.

;#$%&9-!9.,

9.<()&9,&(#%

09#A#@@

A(*B$A

'$9,&(#%


In the edited attribute tree, the drop-off date is pruned and replaced by a Duration attribute computed as the number of days between the drop-off and the pick-up dates :

!"#$%&

'()*#$%&

+,-".%&/#0.

',&.

/(1.)

2(*.%).+1,&.

3,&.4#5-

/#0.1

65,%0

7$.1 899(*.:,".

3(&-

;&,&.

3#$%&5-!5.,

05#<#99

<(*=$<

5.4()&5,&(#%

05#<#99

<(*=$<

2(*.%).+1,&.>+(*=$<',&.

?@:A!2;

!"#$%&#&'($&'#)$$*(+,$&#$*&&-#$%&#'*./0.11#')$&#(2#/*,"&'#)"'#*&/3)4&'#+5#)#!"#$%&'(#)$$*(+,$&#4.6/,$&'#)2

$%&#",6+&*#.1#')52#+&$7&&"#$%&#'*./0.11#)"'#$%&#/(480,/#')$&29

!"#$%&

'()*#$%&

+,-".%&/#0.

1234!56

',&.

/#%&7

8.,9

/(:.)

;,9

;,&.<#9-

/#0.:

=9,%0

>$.: ?@@(*.

;(&-

6&,&.

;#$%&9-!9.,

9.<()&9,&(#%

09#A#@@

A(*B$A

'$9,&(#%


« RENTAL » Fact schema :

!"#$%&'()*%+"'%

()&',

-%".

/".

/"'%0).#

()*%1

2."&*

34%1 56678%

/7'#

9'"'%

/)4&'.#:.%"

.%07;'."'7)&

*.)<)66<78=4<

>?@A:B

:$)4&'

+7;8)4&'

+4."'7)&

(71%;

<78=4<

!!!"#$%&'#()*+

!"#$%&&'()*&#$&'##$+(&"$',,&$(-$!"#$%$(.$/#0(1&#/$)#2,+3

!"#$%&'()

$*+)

!*,(%-.()+/01)

$-2)!)34(5

&67'(*03

$*89)(/01)

26785'.)

/0.(

:,,*8)/01)

;'+)<117)..

=0*3(.

:,,*8)/01)>=6785'.)&'()>=7047)..*?);6+@)7

26785'.)

=7047)..*?);6+@)7

=6785'.)$*+)

A0B1)7;'+)

&*.8063(

$-2)/01)

&).87*2(*03

&'-.

=0*3(.

/0.(

4,&#$&"%&5$(-$&"#$#/(&#/$&'##5$&"#$6*-1&(,-%2$/#0#-/#-17$6',8$!&'(%)*(+,-./+$&,$0.&1(*$"%.$)##-

'#8,9#/$&,$8%:#$0.&1(*$%$1"(2/$,6$&"#$',,&5$.,$&"%&$(&$1%-$)#$1",.#-$%.$%$8#%.*'#3$!"#$.%8#$(.$/,-#$6,'

2345(&.13$!"#$&(1:#&$%-/$&"#$.:(0%..$;'%-*2%'(&(#.$%'#$'#8,9#/5$%-/$%$%6&75**84$&96+($62%;$(.$%//#/$&,

Data Warehouse/Data Mart Conceptual Modeling and … · Bernard ESPINASSE - Data Warehouse...

Documents

Transcript of Data Warehouse/Data Mart Conceptual Modeling and … · Bernard ESPINASSE - Data Warehouse...