Contents of this slideshow :

30
Contents of this slideshow : • What is a data warehouse? • Multi-dimensional data modeling

description

Contents of this slideshow :. What is a data warehouse? Multi-dimensional data modeling. A star shema datawarehouse has a central table ( the Fact table ) surrouded by dimension tables with on-to-many relationships towards the fact table. An example of a Datawarehouse:. - PowerPoint PPT Presentation

Transcript of Contents of this slideshow :

Page 1: Contents of this slideshow :

Contents of this slideshow:

• What is a data warehouse?

• Multi-dimensional data modeling

Page 2: Contents of this slideshow :

An example of a Datawarehouse:

- Product#- Order#- Qty- Date#- Salesman#

Fact table

Orders

Orderdetails

Time

Products Salesmen

Dimension Dimension

Dimension

Dimension

- Product#- Product-name- Price

- Order#- Ordertype

- Salesman#- Salesman-name

- Date#- Date-Name

A star shema datawarehouse has a central table (the Fact table) surrouded by dimension tables with on-to-many relationships towards the fact table.

The fixed data base structure implies that application programs (drilling functions/aggregates) can be generated automatically!

Page 3: Contents of this slideshow :

Dimension hierarchies:

A dimension hierarchy is a set of tables connected by one-to-many relationships towards the fact table:

Orders CustomersOrderdetails

- Product# - Order# - Qty- Price

- Order# - Customer#- Date

- Customer# - Customer-name

Fact table Dimension hierarchy

In a dimension hierarchiy it is possible to aggregate data from the fact table to the different levels of the hierachy.

Drill-down = “de-aggregate” = break an aggregate into its constituents.

Roll-up = aggregate along one or more dimensions.

Page 4: Contents of this slideshow :

Two different types of drilling:• Drilling in dimension hierarchies.

• Drilling between dimensions.

- Product#- Order#- Qty- Date#- Salesman#

Fact table

Orders

Orderdetails

Time

Products Salesmen

Dimension Dimension

Dimension

Dimension

- Product#- Product-name- Price

- Order#- Ordertype

- Salesman#- Salesman-name

- Date#- Date-Name

Page 5: Contents of this slideshow :

Which star schemas or data marts can be build by using the illustrated integrated E-commerce/ERP data model?

Which star schema would you recommend to be implemented first?

LocationLocation#Address

UserSessionSession#IPaddress#ClickTimestamp

ProductProduct#ProductNamePrice

OrderOrder#OrderDateBalanceState

Order-DetailHistoryInv-Item#Order#Seq#StateTimestamp

UserAccountSalesman#PassWordTimestamp#visits#transTtl-tr-amount

Order-DetailProduct#Order#QtyPriceTimestamp

ShippingShipping#ShipMethodShipChargeStateShipDate

CreditCardCard#HolderNameExpireDate

PaymentPayment#AmmountStateTimestamp

InvoyceHistoryInvoice#TimestampStateNotes

AddressAddress#NameAdd1Add2CityStateZip

InvoiceInvoice#CreationDate

Billing

Shipping

Product-StockProduct#Location#Qty

CustomerCustomer#Kredit-LimitBalance

Page 6: Contents of this slideshow :

Data marts = Kimball uses the word for any multidimensional database/star schema.

A galaxy is a set of multidimensional databases with conformed (fælles tilpassede) dimensions:

Sale-Orderdetails

Storage-per-product

Purchase-orderdetails

- Product# - Sale-order#

- Qty- Discount

- Sale-price - Date#

- Product# - Date# - End-of-day-

storage-qty

- Product# - Purchase-order#

- Purchase-price - Qty - Date#

Fact table Fact table- Date# - Qty

Day

Month

Year

Fact table

Products

Productgroups

Time dimension hierarchy

- yy

- yy- mm

- yy- mm- dd

- Product#- Product-name

- Product-group#- Product-group-name

Product dimension hierarchy

The value chain

Suppose an entreprise has a datamart for Purchase and another datamart for Sale as illustrated above. Is it possible to calculate the revenue per month for the last year by using such a galaxy?

Page 7: Contents of this slideshow :

Conformed dimensions = dimensions designed to be common for different data marts in order to make drill across operations possible.

Conformed facts = measures with common units of measurement and granularities that make it possible to integrate measures from different fact tables.

Sale-Orderdetails

Storage-per-product

Purchase-orderdetails

- Product# - Sale-order#

- Qty- Discount

- Sale-price - Date#

- Product# - Date# - End-of-day-

storage-qty

- Product# - Purchase-order#

- Purchase-price - Qty - Date#

Fact table Fact table- Date# - Qty

Day

Month

Year

Fact table

Products

Productgroups

Time dimension hierarchy

- yy

- yy- mm

- yy- mm- dd

- Product#- Product-name

- Product-group#- Product-group-name

Product dimension hierarchy

The value chain

Is it possible to calculate the revenue per month for the last year if the datamart for Purchase and the datamart for Sale do not have conformed dimensions or facts?

Page 8: Contents of this slideshow :

Contents of this slideshow:

• What is a datawarehouse?

• Multi-dimensional data modelling

Page 9: Contents of this slideshow :

Datawarehouse aggregating to the product level:

- Product#- Order#- Qty- Date#- Salesman#

Fact table

Orders

Orderdetails

Time

Products Salesmen

Dimension Dimension

Dimension

Dimension

- Product#- Product-name- Price

- Order#- Ordertype

- Salesman#- Salesman-name

- Date#- Date-Name

SELECT Product#, SUM(Qty*Price) AS omsætningFROM Orderdetails JOIN ProductsGROUP BY Product#

Page 10: Contents of this slideshow :

Drill down to the Product per Salesman level:

- Product#- Order#- Qty- Date#- Salesman#

Fact table

Orders

Orderdetails

Time

Products Salesmen

Dimension Dimension

Dimension

Dimension

- Product#- Product-name- Price

- Order#- Ordertype

- Salesman#- Salesman-name

- Date#- Date-Name

SELECT Product#, Salesman#, SUM(Qty*Price) AS omsætningFROM Orderdetails JOIN Products JOIN Salesmen GROUP BY Product#, Salesman#;

Where should the Price be stored?

Page 11: Contents of this slideshow :

Dimension hierarchies:A dimension hierarchi is a set of tables connected by one-to-many relationships towards the fact table:

Orders CustomersOrderdetails

- Product# - Order# - Qty- Price

- Order# - Customer#- Date

- Customer# - Customer-name

Fact table Dimension hierarchy

A Snowflake schema may in contrast to star schemas have dimension hierarchies.

Describe advantage and disadvantage by using dimension hierarchies/Snowflake schema?

Page 12: Contents of this slideshow :

Snowflake schema with branches:

A Snowflake schema may have branches in the dimension hierarchies:

Orders CustomersOrderdetails

- Product# - Order# - Qty

- Order# - Customer#- Date

- Customer# - Customer-name

Fact table Dimension hierarchy

Salesmen Branchoffices

Regions

Products

- Product# - Product-name- Price- Group#

Productgroups

- Group# - Group-name- Department#

Departments- Department# - Department-name

- Salesman# - Salesman-name– Branch-office#

- Branch-office# - Branch-office#- Region#

- Region# - Region-name

Snowflake hierarchy

Dimension hierarchyAre Customers related to the regions?

Page 13: Contents of this slideshow :

The aggregation level is the argument to the GROUP BY statement.

x1 x2 … xn Aggregated data Non-aggregated data

Salesman# Productname Turnover Branch-office#

Smith Screw 10,000 LA

Smith Bolt 30,000 LA

Smith Nut 60,000 LA

Jones Screw 20,000 SF

Jones Nut 40,000 SF

. . .

- Product# - Order# - Qty - Date# - Salesman#

Fact table

Orders

Orderdetails

Time

Products Salesmen

Dimension Dimension

Dimension

Dimension

- Product# - Product-name - Price

- Order# - Ordertype

- Salesman# - Salesman-name - Branch-Office#

- Date# - Date-Name

Page 14: Contents of this slideshow :

Drilling in dimension hierarchies:

Orders Customers Orderdetails

- Product# - Order# - Qty

- Order# - Customer# - Date

- Customer# - Customer-name

Fact table Dimension hierarchy

Salesmen Branch offices

Products

- Product# - Product-name - Price - Group#

Product groups

- Group# - Group-name - Department#

Departments

- Salesman# - Salesman-name – Branch-office#

- Branch-office# - Branch-office# - Region#

Snowflake hierarchy

Dimension hierarchy

Branch-office# Turnover

LA 400,000

SF 200,000

Salesman# Turnover Branch-office#

Smith 100,000 LA

Jones 300,000 LA

Adams 200,000 SF

Page 15: Contents of this slideshow :

Drilling between dimension hierarchies:

Orders Customers Orderdetails

- Product# - Order# - Qty

- Order# - Customer# - Date

- Customer# - Customer-name

Fact table Dimension hierarchy

Salesmen Branch offices

Products

- Salesman# - Salesman-name – Branch-office#

- Branch-office# - Branch-office# - Region#

Snowflake hierarchy

Salesman# Turn-over

Branch-office#

Smith 100,000 LA

Jones 300,000 LA

Adams 200,000 SF

Salesman#

Product-name

Turn-over

Branch-office#

Smith Screw 10,000 LA

Smith Bolt 30,000 LA

Smith Nut 60,000 LA

Jones Screw 20,000 SF

Jones Nut 40,000 SF

. . .

Page 16: Contents of this slideshow :

Roll up to the top level:

Roll up can be executed by removing one or more argument to the GROUP BY statement.

Salesman#

Product-name

Turn-over

Branch-office#

Smith Screw 10,000 LA

Smith Bolt 30,000 LA

Smith Nut 60,000 LA

Jones Screw 20,000 SF

Jones Nut 40,000 SF

. . .

Productname Turnover

Screw 100.000

Bolt 200.000

Nut 300,000

Roll up to the product level.

Top level Turnover

600.000 Roll up to the top level.

Page 17: Contents of this slideshow :

Non-linear dimensions as e.g. the Date Dimension:

• The granularity is day.

• Many different hierarchies.

• Two major problems:– Calender Week do not aggregate to

year.

– Type of Day distinguish between working day and holiday. However, they are idependent of the other dimensions (e.g. Easter).

Day of Week

Type of Day

Fiscal Week

Fiscal Month

Fiscal Quarter

Fiscal Year

Day

Calendar Month

Calendar Quarter

Calendar Year

Calendar Week

What aggregation level would you use to calculate the average sale in non-hollyday mondays per month?

Page 18: Contents of this slideshow :

The time dimension:

• The granularity is minute.

• The top level is a hole day.

Minute

Hour

Day Part

AM/PM Flag

Why do you think Kimball recommends to separate the date and time dimensions?

Page 19: Contents of this slideshow :

Degenerated dimension =A dimension that is not created because nobody want to aggregate data to the degenerated level.

Example: The Order dimension should be deleted while the Time and Customer attributes should be created as new dimensions to which it is meaningful to aggregate data.

- Product# - Order# - Qty - Date# - Salesman#

Fact table

Orders

Orderdetails Products Salesmen

- Product# - Product-name - Price

- Order# - Time - Customer#

- Salesman# - Salesman-name

Page 20: Contents of this slideshow :

Exercise:

The figure illustrates an ER-diagram of a car rental company like Hertz or Avis.

Design a snowflake shema, star shema or Galaxy for the car rental company!

Customers

Car types

Reservations

Orders

Branch offices

Cars

GaragesGarage services

Pick up

Contracts

Car return

Page 21: Contents of this slideshow :

Major problems in data warehouse design:

Drilling in many-to-many relationships and tree structures.

Inconsistensies caused by ”slowly changing dimensions”.

Page 22: Contents of this slideshow :

Slowly Changing Dimensions (SCD)

Bank accounts

Branch-offices

- Account# - Interest-last-year - Cost-last-year - Branch#

- Branch# - Branch-name - Branch-size

Fact table Dimension

If the attributes of a dimension is dynamic (e.i. they may be updated) we say that they are slowly changing.

May the Branch-size of a Branch-office change after e.g. a renovation?May the Branch-name of a Branch-office change?

Page 23: Contents of this slideshow :

Exercise in SCD:

Soppose the attribute Branch-size is dynamic and aggregations is made to the levels (Branch-size, Year) or (Branch-size, Month) .

Does this aggregation make sense and how would you solve possible problems?

Bank accounts

Branch-offices

- Account# - Interest-last-year - Cost-last-year - Branch#

- Branch# - Branch-name - Branch-size

Fact table Dimension

Page 24: Contents of this slideshow :

Exercise:

Is the region of the customer a dynamic attribute of the customer?

Does it make sense to aggregate the rental revenue to the region of the customers?

Customers

Car types

Reservations

Orders

Branch offices

Cars

GaragesGarage services

Pick up

Contracts

Car return

Page 25: Contents of this slideshow :

It is possible to cheat the application generator. That is, special very complicated data structures may function as many-to-many or networt relationships when they are dealt with as 1-to-many relationships.

How would you recommend to design a datawarehouse where it is possible to aggregate Sale to the Stock locations used for the sale?

LocationLocation#Address

UserSessionSession#IPaddress#ClickTimestamp

ProductProduct#ProductNamePrice

OrderOrder#OrderDateBalanceState

Order-DetailHistoryInv-Item#Order#Seq#StateTimestamp

UserAccountSalesman#PassWordTimestamp#visits#transTtl-tr-amount

Order-DetailProduct#Order#QtyPriceTimestamp

ShippingShipping#ShipMethodShipChargeStateShipDate

CreditCardCard#HolderNameExpireDate

PaymentPayment#AmmountStateTimestamp

InvoyceHistoryInvoice#TimestampStateNotes

AddressAddress#NameAdd1Add2CityStateZip

InvoiceInvoice#CreationDate

Billing

Shipping

Product-StockProduct#Location#Qty

CustomerCustomer#Kredit-LimitBalance

Page 26: Contents of this slideshow :

Exercise.Design a datawarehouse for a travel agency.

Customers

Reservations

Orders

Departures/Hotel rooms/Car rentals/

etc.

Flight routes/Room types/Car types/

service types

Buyer

Bookings

Traveler

Product owners

Page 27: Contents of this slideshow :

Design a data warehouse (or galaxy) for an ERP system with as many meaningful dimensions as possible:

Orders

Accounts

Customers

Orderlines

Products

Stocks per product per location

Account items

The sales module

The account module offer services to the other ERP modules.

Page 28: Contents of this slideshow :

End of session

Thank you !!!Thank you !!!

Page 29: Contents of this slideshow :

Response type Evaluation criteriaIs historical information preserved

Aggregation performance Storage consumption

Response 1 where dimension records are overwritten

No In the evaluation, we define this solution to have average performance

Only the current dimension record version is stored. No redundant data is stored

Response 2 where new versions are created

Yes Version records makes performance slower proportional to the number of changes

All old versions of dimension records are stored often with redundant attributes

Response 3 where only one historical version is saved

The current version and a single history destroying version are saved

No performance degradation occurs if either the current or the historical version are used in a query

Normally, only a single extra attribute version is stored

Response 4 that use the top of a dynamic dimen-sion hierarchy as a new static dimension

Yes Better or worse depen-ding on whether both dimension tables are used in a query

The relatively large fact table must have an extra foreign key attribute

Response 5 with dimension data as fact data

Yes Better or worse depen-ding on whether the new fact data are used in a query

The relatively large fact table must have an extra attribute for each dynamic dimension attribute

Response 6 that use fine granularity in combination with response 1 or 3

The finer the granularity, the more historical state information is preserved

The finer the granularity, the slower the performance

The finer the granularity, the more storage consumption

Response 7 that stores dynamic dimension data as static facts in another data mart

Yes Better or worse depen-ding on whether both fact tables are used in a drill across query

This is the most storage consuming solution as at least a new fact and foreign key are stored in the new fact table

Page 30: Contents of this slideshow :

Where do the responses of SCDs store historic information?

• Response 1 does not store historic information.

• Response 2 store historic information in a new record version.

• Response 3 store at one historic value in a new dimension attribute.

• Response 4 store historic information in a new dimension relationship.

• Response 5 store historic information in a new fact attribute.

• Response 6 can sometimes deminish the aggregation error of response 1 as finer granularity in a state fact more acurately can be related to the right dimension record.

• Response 7 store historic information in a new fact table.