Download - Data Warehousing Design Considerations

7/31/2019 Data Warehousing Design Considerations

1/32

Data Warehouse Design Considerations

M. Tech. Course Seminar Report

Submitted in partial fulfillment of the requirements

for the degree of

Master of Technology

by

Abhishek Sugandhi

Roll No: 04305016

under the guidance of

Prof. N.L.Sarda

Department of Computer Science and Engineering

Indian Institute of Technology, Bombay

Mumbai


2/32

Acknowledgment

I would like to thank my seminar guide, Prof. N. L. Sarda for his valuable guidance,

and encouragement without which, it would not be possible for me to complete my work.

1


3/32

Abstract

Data warehouse is a complex information system primarily used in decision making pro-

cess by means of On-Line Analytical Processing (OLAP) applications.Over the last years,

data warehouses are getting a lot of attention both from the industrial and the researchcommunity. The reason lies in their great importance: making predictions about the

(near) future, has always been desirable for business companies. In chapter 1, I will dis-

cuss the basics of data warehouse and its modeling techniques.

Decision support places some rather different requirements on database technology

compared to traditional on-line transaction processing. Data Warehouses are usually

modeled using Dimensional Modeling, for better understandability and easy extendibil-

ity. As Data Warehouses store huge amount of both current and historical data, special

attention should be given to changing dimensions, time and date dimensions, hierarchal

dimensions, while modeling data warehouse.In this discussion,in chapter 2, I am going to

focus on handling this issues while modeling the Data warehouse.

Software vendors have quickly developed products and services for improving the ef-

ficiency of querying on Data Warehouses.In chapter 3, I will discuss the querying feature

provided by Oracle 9i for improving efficiency of aggregate queries, and querying feature

provided by MDX.MDX stands for the Multidimensional Expressions (MDX). It is a lan-

guage used to manipulate multidimensional information in Microsoft SQL Server 2000

Analysis Services.

2


4/32

Contents

1 Introduction 4

1.1 What is Data Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 Warehouse Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Dimensional Model Vs. ER Model . . . . . . . . . . . . . . . . . . . . . . . 5

2 Data Warehouse Design Issues 7

2.1 How to model time and date dimension . . . . . . . . . . . . . . . . . . . . 72.2 Dimension normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Surrogate keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Slowly Changing Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4.1 Type 1: Overwrite the Value . . . . . . . . . . . . . . . . . . . . . . 9

2.4.2 Type 2: Add a new Dimension Row . . . . . . . . . . . . . . . . . . 10

2.4.3 Type 3: Add a new Dimension Column . . . . . . . . . . . . . . . . 10

2.5 Rapidly Changing Dimension . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6 Handling Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.6.1 Fixed Depth Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.6.2 Variable Depth Hierarchy . . . . . . . . . . . . . . . . . . . . . . . 12

2.7 Multivalued Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.8 Heterogeneous Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.9 Dimension Role Playing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.10 Conformed Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Querying on Data Warehouse 18

3.1 Oracles 9i SQL extension for Aggregation Queries in data warehouse . . . 18

3.1.1 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.1.2 Applications in building Cross-Tabular Report . . . . . . . . . . . . 193.2 Writing MDX queries for Data Warehouse . . . . . . . . . . . . . . . . . . 22

3.2.1 Common Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.2 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.3 MDX Query Structure . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.4 Specifying Axis Dimensions . . . . . . . . . . . . . . . . . . . . . . 26

3


5/32

3.2.5 Establishing Cube Context . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.6 Specifying Slicer Dimensions . . . . . . . . . . . . . . . . . . . . . . 26

3.2.7 Example Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.8 Difference of MDX with SQL . . . . . . . . . . . . . . . . . . . . . 28

4 Conclusion 29

4


6/32


7/32

Sales

Customer anne s

Promotion

time_id

customer_id

Time

channel_id

Promotion_id

Other Attributes

Other Attributes

Other Attributes

Other Attributes

customer_id

time_id

channel_id

Promotion_idquantity sold

costamount

Figure 1.1: Star Schema [?]

requirements, dimension attributes are usually short identifiers that are foreign key in

some other tables called dimensional tables. Usually, a fact table is associated with many

dimension table and contain foreign key for each of these dimensional table. Fact table is

kept highly normalized to reduce space requirement whereas Dimension tables are highly

denormalized to ease the browsing among different attributes of a dimension and to enable

us to write simple and easily understandable queries.

Resultant Schema with a fact table and multiple dimensional tables, and foreign keysfrom the fact table to dimensional tables is called a star schema. If we normalize the

dimension table, so that a dimension table contain foreign key to other dimensional table

then, the resultant schema have a multiple level of dimensional tables, then such schema

are called snowflake schemas. Some Complex Data Warehouse may have more than one

fact table [?].

1.3 Dimensional Model Vs. ER Model

The main difference between Dimensional Model and ER Model lies in the fact thatdimension tables in dimensional model are denormalized, whereas Dimension tables in ER

model are highly normalized. ER design technique seeks to remove the redundancy in data

by normalizing the relations so that there is less disk space wastage and there are no insert,

or update anomaly, but in case of data warehouses, if the dimension tables are normalized

into typical snowflake (normalized) structures, two bad things happen. First, the data

6


8/32

model becomes too complex to be presented to the user. Second, linking the elements

among the various branches of the snowflake compromises browsing performance. Even

when a long text string appears redundantly in the dimension table and can be moved

to an outrigger table(table that is formed after normalization), you wont save enough

disk space to justify moving it because the major amount of the disk space is consumed

by the fact table(which is highly normalized) [?].

In many cases, normalization can actually increase the storage requirements. If thecardinality of the repeated dimension data element is high (in other words, there are just

a few duplications), the outrigger table may be nearly as big as the main dimension table.

But we have introduced another key structure that is now repeated in both tables [?].

Another argument given for normalizing the dimensions is to improve insert or update

performance. This is rarely important in a decision-support environment. You update

the dimension tables only once per night (typically), and the processing associated with

loading perhaps millions of fact records dominates the really minor processing associated

with inserting or updating dimension records.A dimensional database design has a fixed

structure that has no alternative join paths. This greatly simplifies the optimization andevaluation of queries on these schema [?].

Fact table in Dimensional model represent many-many relationship among the dimen-

sional table. We can convert an ER model into dimensional model in presence of such a

many-many relationship, and such relationship is always present in Data Warehouses.

7


9/32

Chapter 2

Data Warehouse Design Issues

2.1 How to model time and date dimension

Date Dimension is one of the most important dimensions in data warehouse. It is guar-

anteed to be present in every data mart because virtually every data mart is time se-ries.Instead of keeping date as an attribute in fact table, or other dimensional table, we

should build a separate dimension table, because it will allow the analysts to query the

data warehouse, on some special attributes like a holiday or major event etc. SQL date

functions do no support filtering by these attributes, so if the business process need to

slice the data by these nonstandard date attributes, then an explicit dimensional table

is essential.Calendar logic should belong in Dimension table, rather than in application

code [?]. Unlike most of other dimension we may build date dimension in advance. Every

day is represented as row in date dimension. For keeping history of 10 years only 3650

rows will be needed. Date dimension key should be an integer rather than a date data

type. This is explained further in surrogate key section.

If we wanted access to time of transaction for day part analysis, instead of keeping time

of day attributes like hour, min etc as fields in Date dimension table, we should handle

it through separate Time Of Day dimension joined to fact table [?].This can save a good

amount of space, as now instead of keeping 24 * 60 = 1440 rows to keep information about

every minute in Date dimensional table for every day (means 3650 * 1440 for 10 years,

which is very large for any dimensional table), we can build only one Time Of Day table

which will contain only 1440 rows as total. Date dimension and Time Of Day dimension

are completely independent.

8


10/32

2.2 Dimension normalization

Dimension table normalization is usually referred as snowflaking. Redundant attributes

are removed from from flat denormalized dimension table and placed in normalized sec-

ondary dimension tables.But we should generally avoid snowflaking due to following rea-

sons [?] :

1. Snowflake tables make much more complex representation

2. Numerous tables and joins usually translate into slow query performance

3. The minor space savings associated with snowflaked dimension tables are insignif-

icant as dimension tables are generally smaller and most of the space is consumed

by the fact table.

4. It slows down user ability to browse within a dimension.

5. Finally snowflaking defeats the use of bitmap index. Bit map indexes are very usefulwhen indexing low cardinality fields in our dimension tables.

But there are times when snowflaking is permissible, when a clump of correlated at-

tributes is used repeatedly in various independent roles. for example, in promotion dimen-

sion we would need to store promotion begin date and promotion end date attribute[ ?].

One more example, when we have to store multivalued attribute then we would need

bridge table.

2.3 Surrogate keysSurrogate keys are integers that are assigned sequentially as needed to populate a dimen-

sion. the surrogate keys merely serve to join dimensional tables to the fact table.surrogate

keys are beneficial as the following reasons [?] :

1. We should avoid operational code or other smart keys as data warehouse keys,

because normally these operation codes are recycled after some period say one year

but data warehouse will retain data for years. One of primary benefit of surrogate

keys is that they buffer the data warehouse environment from operational changes.

If we rely on operational code, we are also vulnerable to key overlap problems.

2. The smaller surrogate keys translate into smaller fact tables, smaller fact table

indices and more fact table rows per i/o operation.

3. Surrogate keys can be used to record dimension conditions that have no operational

code. For example, when our dimensional model have dates that are yet to be

9


11/32

determined. There are no SQL date value for it, but it can be handled in case of

surrogate keys.We can just keep one more row in date dimensional table with its

unique key, to identify YET TO DETERMINE condition, and avoid a null date

dimension key in fact table.

2.4 Slowly Changing Dimensions

While dimension attributes are relatively static, they are not fixed forever.Dimension at-

tributes change, albeit rather slowly, over time. Tracking of accurate change is necessary

so that business user can see the impact of each and every dimension change.When we

need to keep track, it is unacceptable to put everything in fact table and make every

dimension time-dependent to deal with these changes. Instead, we can take advantage of

the fact that most of the dimensions are constant over time. We can preserve independent

dimensional structure with only relatively minor changes to contend with changes. We

refer to these nearly constant dimension as slowly changing dimensions [?].

For each attribute in dimensional table, we must specify a strategy to handle change.

There are 3 basics technique for dealing with attribute changes [?].

1. Overwrite the value

2. Add a Dimension Row

3. Add a Dimension Column

for Example, Suppose that manufacturing operations makes a slight change in packag-

ing of SKU 38 (unique product no given by organization ), and the packaging description

changes from glued box to pasted box. Along with this change, manufacturing oper-

ations decides not to change the SKU number of the product, or bar code (UPC) that is

printed on the box.Let us see, how the issue of handling this changing dimension is taken

care of in all the above methods.

2.4.1 Type 1: Overwrite the Value

With the type 1, we merely overwrite the old attribute value in the dimension row,

replacing it with the current value. In doing so attribute always reflect the most recentassignment.Type 1 response is simple to implement but it does not maintain any history

of prior attribute value [?].

Type 1 technique is the simplest and fastest. But it doesnt maintain past history!

Nevertheless, overwriting is frequently used when the data warehouse team legitimately

decides that the old value of the changed dimension attribute is not interesting.[ ?]. In

10


12/32

above example, Original row

Product Key Produce Description Packaging SKU No.

12345 Scent glued Box ABC922

will be updated as


12345 Scent pasted Box ABC922

2.4.2 Type 2: Add a new Dimension Row

The second technique is the most common and has a number of powerful advantages.If the

data warehouse team decides to track the change of an attribute issue another record(row

in dimensional table), with the changed value of attribute. The only difference betweenrecords is in the changed attribute. Even the operational codes are the same.

This technique for tracking slowly changing dimensions is very powerful because new

dimension records automatically partition history in the fact table. The old version of

the dimension record points to all history in the fact table prior to the change. The new

version of the dimension record points to all history after the change [?].There is no need

for a time-stamp in the product table to record the change. This is best recorded by a

fact table record with the correct key of newly added record [?].

Another advantage of this technique is that you can gracefully track as many changes

to a dimensional item as you wish. Each change generates a new dimension record, andeach record partitions history perfectly. The main drawbacks of the technique are the

requirement to generalize the dimension key, and the growth of the dimension table itself

[?].

Using Type 2 technique for previous example, we would have 2 product dimension

rows (both original and updated ) as


12345 Scent glued Box ABC922

34567 Scent pasted Box ABC922

2.4.3 Type 3: Add a new Dimension Column

With Type 2 response partitions history, it does not allow us to associate the new attribute

value with old fact history or vice-versa. However, we sometimes want the ability to see

11


13/32

fact data as if the change never occurred. We can attack this requirement, not by creating

a new dimension record as in the Type 2 technique, but by creating a new current value

field. The type 3 technique allow us to see new and historical fact data by either the new

or prior attribute values [?].

Using Type 3 technique for previous example, we would have update original row as

Product Key Produce Description current Packaging previous packaging SKU No.

12345 Scent pasted Box glued box ABC922

2.5 Rapidly Changing Dimension

Normally, we will not use any of the techniques mentioned previously for handling chang-

ing dimension, if the dimension already contains million of the rows.Unfortunately, huge

dimensions are also more likely to change than moderately sized dimension. We sometimes

calls this situation rapidly changing monster dimensions [?].The solution to handle such problem, is to break frequently analyzed or frequently

changing dimensions into separate dimension, referred as minidimension [?]. There would

be one row in minidimension for every unique combination of frequently analyzed at-

tribute Level encountered in the data (not one row per customer). We leave behind

more constant or less frequently queried attributes in original huge customer table.When

creating the minidimension, continuously changing variable should be converted to banded

ranges.In other words, we force the attributes in minidimension to take relatively small

number of dimension values [?]. Although these restricts the use of predefined bands, it

drastically reduces the number of combinations in the minidimension.

Every time, we build the fact table row, we include 2 foreign keys related to the

dimension: the regular dimension key and minidimension key. The minidimension key

should be the part of fact tables set of foreign keys to provide efficient access to the fact

table.

This design delivers browsing and constraining performance benefits by providing a

slower point of entries to the fact table, and we can avoid joins to huge dimensional

table if attributes(static) from that table are not constrained. When the minidimension

key participates as foreign key in fact table, another benefit is that the fact table serves

to capture the minidimensions attribute changes.We can keep track of loading which

minidimension key when we want to change attribute of dimension. Earlier rows would

be still using the old values of minidimension key. Thus we could keep track of history as

well [?].

12


14/32

Customer KeyCustomer IDCustomer Name.............Age

GenderAnnual Income

Becomes

Customer KeyCustomer IDCustomer Name..............

Customer Minidimension KeyCustomer Age BandCustomer GenderCustomer Income Band

Customer KeyCustomer Minidimension KeyMore Foreign Keys..........Facts............

Customer Dimension

Customer DimensionFact Table

Customer Minidimension Dimension

Figure 2.1: Example of Handling Rapidly Changing large Dimension [?]

2.6 Handling HierarchiesIn many dimensions, hierarchy is inherent. We will take 2 approaches to handle hierar-

chies. The first is straightforward and handle the hierarchy adequately with simplistic

approach. The second approach is much more advanced and complicated but also much

more extensible.

2.6.1 Fixed Depth Hierarchy

This happens rare, if we are confronted with a dimension that is highly predictable with

fixed number of levels (say N). In this case, we can keep N attributes in dimension cor-responding to these N levels [?].If some other records from the dimension table are not

having hierarchy up to the maximum no of levels, then we would duplicate lower level

attributes to higher level attributes.In this way, we can report hierarchy to any level of

hierarchy. for every record of that dimension.

2.6.2 Variable Depth Hierarchy

Representing an arbitrary variable depth hierarchy is an inherently difficult task in a

relational environment.A simple computer science approach to storing such information

would add a Parent Key field to the Customer dimension. The Parent Key field would be

a recursive pointer that would contain the proper key value for the parent of any given

customer. A special null value would be required for the topmost Customer in any given

overall enterprise [?] .

The problem with this recursive pointer approach is that, it cannot be used effectively

with standard SQL. Standard SQL GROUP BY clause cannot follow the recursive pointers

13


15/32

downward, for aggregating an additive fact in the fact table [?]. Instead of using a

recursive pointer, we can solve this modeling problem by inserting a bridge table(helper

table) between the dimension table and the fact table.The bridge table contains one record

for each separate path from each node in the organization tree to itself and to every node

below it. Each Pathway row contains key of key of parent roll-up entity, no of levels

between parent and the subsidiary, bottom-most flag indicating that there are no further

nodes beneath it and finally, a top-most flag to indicate there are no further nodes abovethe parent [?].

Now, if we want to descend the hierarchy, we join the dimension table with bridge table

by connecting dimensions primary key with bridge tables parent dimension key. Now we

can constrain any particular dimension and request an aggregate measure of all dimensions

at or below it.We can use no of level attribute to control depth of analysis. Similarly when

we want to ascend the hierarchy, we reverse the join by connecting dimension key with

the bridge table subsidiary dimension key [?].

When a group of nodes is moved from one part of an organization to another, only the

bridge table rows that refer to paths outside the parent to the moved structures need tobe altered.All rows referring to paths within the moved structure need not be affected.We

need to add rows, if the moved structure had new parent.

When issuing the SQL statement using bridge table, we need to be cautious about

over counting the facts.When connecting the tables, we must constrain the customer di-

mensions to a single value and then join to the bridge table [?].

Customer Key

Customer Key

ParentSubsdiary

Leval Name

Bottom flag

Top Flag

Customer KeyCustomer Key Date Key

Customer

Customer

ID

Name

Customer Dimension Hierarchy Bridge Fact table

Figure 2.2: Handling hierarchy through bridge table [?]

14


16/32

2.7 Multivalued Dimension

There are situations where we need to attach a multivalued dimension table to the fact

table. Example of these situation is when we associate many customers to account, when

multiple diagnoses are associated with single patient etc. Database designers usually take

one of following approaches for handling Multivalued Dimension attributes [?] :

Choose one value and omit the other values

Extend the dimension list to have a fixed number of Multivalued dimensions

Put a bridge (helper) table in between this fact table and the Multivalued dimension

table.

Frequently, designers choose a single value (first approach). If we take these approach,

the modeling problem goes away, but we will still be in doubt whether the Multivalued

dimension data is useful.

The second approach of creating a fixed number of additional Multivalued dimension

slots in the fact table key is also not a good idea, as there can be some situation where

the number of Multivalued dimension exceed slots we have allocated. Also, we cannot

easily query the multiple separate Multivalued dimensions [?].

Bridge table placed between the Multivalued dimension and the fact table is the best

solution. The Multivalued dimension key in the fact table is changed to be a Multival-

ued dimension Group key. The helper table in the middle is the Multivalued dimension

Group table. It has one record for each Multivalued dimension in a group of Multivalued

dimensions [?].

The Multivalued dimension Group table is joined to the original Multivalued dimension

on the Multivalued dimension key. The Multivalued dimension Group table contains a

very important numeric attribute: the weighting factor. The weighting factor allows

reports to be created that dont double count the Billed Amount in the fact table.

We can assign the weighting factors equally within a Multivalued dimension Group.

If there are three Multivalued dimensions, then each gets a weighting factor of 1/3. If we

have some other rational basis for assigning the weighting factors differently, then we can

change the factors, as long as all the factors in a Diagnosis Group always add up to one.

2.8 Heterogeneous Dimension

Many a times, in real world the situation arises when the business provides heterogeneous

services or products. For example, a retail Bank offers variety of products like mortgage

15


17/32

Patient Fact table

Table

Diganosis Dimension

Diagnosis group

Helper Table

Digosis group key

Diagnosis key

Diagnosis key

Digosis group key

Other Attributes

Other Attributes

Weight

Figure 2.3: Handling Multivalued Diagnosis Dimension through bridge table [?]

or checking accounts to the same customer. These products have specific attributes and

facts related to them only, and also general attributes, and fact that are common among

them. In this case, Business users typically require two different perspective that are

difficult to present in single fact table. The first perspective is global view, including the

ability to slice and dice all general facts simultaneously, regardless of their product type.

The second perspective required by users is specific line-of business view that focuses onin-depth details of one business such as checking or mortgage [?].

There is a long list of attribute specifically for any specific line of business. We cannot

add these spatial facts in one fact table; if we did it for each line of business, we would

end up with several hundred facts, most of which include nulls in any specific row.

Like wise, if we attempt to include specific line of business attributes in any dimension

table, we would have hundred of attribute, almost all of which would be empty for any

given row.

The solution to this dilemma is to create a custom schema for the checking line of

business that is just limited to just checking accounts.Now both the custom checking facttable and corresponding product dimension are widened to describe all specific facts and

attributes that make sense only for checking products [?].

These custom tables also contain the core attributes and facts so that, we can avoid

join from the core and custom schema in order to get complete set of facts and attributes.

The keys of custom product dimension is same as used in core product dimension, which

16


18/32

contains all possible product keys [?]. As conformed dimensions are is essential, each

custom product dimension is subset of rows from core product dimension table.

A family of core and custom fact table is needed when a business has heterogeneous

products that have naturally different facts and descriptors but a simple customer base

demands an integrated view.

We can consider handling of the specific line of business attributes as context depen-

dent outrigger to the core dimension. We can isolate the core attributes in in the baseproduct dimension table, and we can include a snowflake key in each base record that

points to that point to its proper custom dimension outrigger [?].

If line of business of of custom and core dimension are separate, they cannot reside

in same space, in this case, data in core fact table need to be duplicated only once to

implement all custom tables. Otherwise, we can avoid duplicating both the core fact keys

and core facts in the custom line of business fact tables [?].

General Dimension1 Fact Table

Core Attributes ....Dimension 1 Key Date Key

Dimension 1 KeyDimension2 keyMore Foreign KeysCore Facts........

Dimension 2 KeyCore Attributes......

Custom facts.......

Custom Attributes.....

Specific Dimension 1Key Specific Dimension 2KeyCustom Attributes.....

Dimension 1Specific line of Business

General Dimension 2

Dimension 2Specific line of Business

Figure 2.4: Handling Heterogeneous Dimensions [?]

2.9 Dimension Role Playing

A role in a data warehouse is a situation in which a single dimension appears several times

in the same fact table.In certain kinds of fact tables, Date can appear repeatedly. For

example, a typical Fact table can include Order Date, Packaging Date, Shipping Date,Delivery Date, Payment Date, Return Date, Refer to Collection Date, and other facts [ ?].

We cannot join these seven foreign keys to the same table. SQL would interpret such

a seven-way simultaneous join as requiring that all of the dates be the same. Instead of a

seven-way join, we have to create an illusion of seven independent Date dimension tables.

We even need to go to the length of labeling all of the columns in each of the tables

17


19/32

uniquely. If we dont label the columns uniquely, we will not be able to differentiate the

columns apart if several of them have been dragged into a report [ ?].

For the user, we can create the illusion of seven independent time tables in a couple

of ways. We can either make seven identical physical copies of the time table, or we

can create seven virtual copies of the time table with the SQL SYNONYM command.

Regardless of the approach, once we have made these seven clones, we still have to define

a SQL view on each copy in order to make the field names uniquely different [?].Now that we have seven differently described Time dimensions, they can be used as

if they were independent. They can have completely unrelated constraints, and they can

play different roles in a report.

2.10 Conformed Dimensions

A conformed dimension is a dimension that means the same thing with every possible

fact table to which it can be joined. Generally this means that a conformed dimension is

identical in each data mart. A major responsibility of the central data warehouse designteam is to establish, publish, maintain, and enforce the conformed dimensions.

Conformed dimensions are enormously important to the data warehouse. Without

a strict adherence to conformed dimensions, the data warehouse cannot function as an

integrated whole. Conformed dimensions make possible a single dimension table to be

used against multiple fact tables in the same database space, consistent user interfaces and

data content whenever the dimension is used, and a consistent interpretation of attributes

and, therefore, roll ups across data marts [?].

It is possible to create a subset of a conformed dimension table for certain data marts

if you know that the domain of the associated fact table only contains that subset. Forexample, the master Product table can be restricted to just those products manufactured

at a particular location if the data mart in question pertains only to that location. We

could call this a simple data subset, because the reduced dimension table preserves all

the attributes of the original dimension and exists at the original granularity [?].

18


20/32

Chapter 3

Querying on Data Warehouse

3.1 Oracles 9i SQL extension for Aggregation Queries

in data warehouse

In this section, all example queries Will be performed on Sales History Schema in figure1.1. All examples, and theory is taken from [?].

Aggregation is a fundamental part of data warehousing. To improve aggregation

performance in your warehouse, Oracle provides the following extensions to the GROUP

BY clause to make query reporting faster and easier:

ROLLUP Extension ROLL UP calculates aggregate functions such as SUM, COUNT,

MAX, MIN, and AVG at increasing levels of aggregation, from the most detailed

up to a grand total.It is very helpful for subtotaling along a hierarchical dimension

such as time or geography. It creates subtotals that roll up from the most detailed

level to a grand total, following a grouping list specified in the ROLL UP clause.

CUBE Extension CUBE is an extension similar to ROLL UP, enabling a single state-

ment to calculate all possible combinations of aggregations. CUBE can generate the

information needed in cross-tabulation reports with a single query. CUBE is typi-

cally most suitable in queries that use columns from multiple dimensions rather than

columns representing different levels of a single dimension. CUBE takes a specified

set of grouping columns and creates subtotals for all of their possible combinations.

In terms of multidimensional analysis, CUBE generates all the subtotals that could

be calculated for a data cube with the specified dimensions. Multiple SELECTstatements combined with UNION ALL statements could provide the same infor-

mation gathered through CUBE or ROLL UP. However, this might require many

SELECT statements.The more columns used in a CUBE or ROLLUP clause, the

greater the savings compared to the UNION ALL approach.

GROUPING Functions The GROUPING functions help you identify the group each

19


21/32

row belongs to and enable sorting subtotal rows and filtering results. Grouping

helps in differentiating NULL values created by CUBE or ROLLUP and stored

NULL values.Secondly it helps in finding out programattically what is a level of

aggregation for a given subtotal.GROUPING returns 1 when it encounters a NULL

value created by a ROLLUP or CUBE operation. That is, if the NULL indicates

the row is a subtotal, GROUPING returns a 1. Any other type of value, including

a stored NULL, returns a 0.

Grouping ID Function GROUPING ID returns a single number that enables you to

determine the exact GROUP BY level. For each row, GROUPING ID takes the

set of 1 s and 0 s that would be generated if you used the appropriate GROUPING

functions and concatenates them, forming a bit vector. The bit vector is treated as a

binary number, and the number s base-10 value is returned by the GROUPING ID

function.

GROUPING SETS Expression Computing a full cube creates a heavy processing

load, so replacing cubes with grouping sets can significantly increase performance.Youcan selectively specify the set of groups that you want to create using a GROUP-

ING SETS expression within a GROUP BY clause. This allows precise specifica-

tion across multiple dimensions without computing the whole CUBE.CUBE and

ROLLUP can be thought of as grouping sets with very specific semantics.

3.1.1 Syntax

Extension Syntax

ROLLUP SELECT..... GROUP BY ROLLUP(grouping column reference list)

PARTIAL ROLLUP GROUP BY expr1, ROLLUP(expr2, expr3)

CUBE SELECT..... GROUP BY CUBE (grouping column reference list)

PARTIAL CUBE GROUP BY expr1, CUBE(expr2, expr3)

GROUPING SELECT.. [GROUPING(dimension column)..]..

GROUP BY.. CUBE ROLLUP (dimension column)

GROUPING SETS GROUP BY [GROUPING sets(dimension column).. ]

3.1.2 Applications in building Cross-Tabular Report

These extensions are used to generate cross-tabular reports easily and efficiently.

For example, in figure, for a cross-tabular report showing, the total sales by country id

and channel desc for the US and UK through the Internet and Direct Sales in September

20


22/32

Direct Sales

Internet

UK US Total

75000 45000

100000 200000

175000 245000

300000

120000

420000Total

Country

Channel

Figure 3.1: Cross Tabular Report

2004, we will need to calculate 4 subtotals and one grand total. Half of the values needed

for this report would not be calculated with a query that requested SUM(amount sold)

and did a GROUP BY(channel desc, country id). To get the higher-level aggregates we

would require additional queries.But we can easily generate all these subtotals and grandtotal by giving only one query using CUBE extension in its GROUPBY clause.

The Query will be

SELECT channel desc, calendar month desc, country id, SUM(amount sold)

FROM sales, customers, times, channels

WHERE sales.time id=times.time id AND

sales.cust id=customers.cust id AND

sales.channel id= channels.channel id ANDchannels.channel desc IN (Direct Sales, Internet) AND

times.calendar month desc = 2004-09 AND

country id IN (UK, US)

GROUP BY CUBE(channel desc,country id);

Result of these Query will appear as in table shown below

21


23/32


24/32

We can also generate above tables using GROUPING SETS extension. With GROUP-

ING SETS expression, we have to explicitly specify the levels of aggregation we wish to

perform.The Query will be

SELECT channel desc,country id, SUM(amount sold)

FROM sales, customers, times, channels

WHERE sales.time id=times.time id AND

sales.cust id=customers.cust id AND

sales.channel id= channels.channel id AND

channels.channel desc IN (Direct Sales, Internet) AND

times.calendar month desc = 2004-09 AND

country id IN (UK, US)

GROUP BY GROUPING SETS((channel desc, country id), (channel desc),(country id),());

Both CUBE and ROLLUP can be thought of as GROUPING SETS with very specific

semantics.

CUBE(a, b, c) is equivalent to GROUPING SETS ((a, b, c), (a, b), (a, c), (b, c),

(a), (b), (c), ())

ROLLUP(a, b, c) is equivalent to GROUPING SETS ((a, b, c), (a, b), ())

3.2 Writing MDX queries for Data Warehouse

MDX, stands for Multidimensional Expressions.It is a syntax that supports the definition

and manipulation of multidimensional objects and data. MDX is similar in many ways

to the Structured Query Language (SQL) syntax, but is not an extension of the SQL

language; in fact, some of the functionality that is supplied by MDX can be supplied,

although not as efficiently or intuitively, by SQL.As with an SQL query, each MDX query requires the SELECT clause,the FROM

clause and the WHERE clause. These and other keywords provide the tools used to

extract specific portions of data from a cube (multidimensional structure) for analysis.

MDX also supplies a robust set of functions for the manipulation of retrieved data, as

well as the ability to extend MDX with user-defined functions.

23


25/32

Figure 3.2: Multidimensional Structure : Cube [?]

3.2.1 Common Terms

Cube Cube is a multidimensional structure that contains dimensions and measures. Di-

mensions define the structure of the cube, while measures provide the numerical

values of interest to the end user. Cell positions in the cube are defined by the in-

tersection of dimension members, and the measure values are aggregated to provide

the values in the cells [?].

Member A member is the lowest level of reference when describing cell data in a Cube.A

member is an item in a dimension representing one or more occurrences of data.

Members are combined to form Tuples and Tuples are combined to form Sets. These

Sets are used in SELECT clause of SQL for retrieving data from Cube [?].

Tuples A Tuple is used to define a slice of data from a Cube; it is composed of an ordered

24


26/32

collection of one Member from one or more dimensions. A Tuple is used to identify

specific sections of multidimensional data from a cube; a tuple composed of one

member from each dimension in a cube completely describes a cell value [?].

Sets A Set is an ordered collection of zero, one or more Tuples. A Set is most commonly

used to define Axis and Slicer dimensions in an MDX query, and as such may have

only a single Tuple or may be, in certain cases, empty. In MDX syntax, tuples are

enclosed in braces to construct a set.A set is most commonly used to define axis

and slicer dimensions in an MDX query [?].

Axis and Slicer Dimensions A SELECT statement is used to select the Dimensions

and Members to be returned, referred to as Axis dimensions. The WHERE state-

ment is used to restrict the returned data to specific Dimension and Member criteria,

referred to as a slicer dimension. An axis dimension is expected to return data for

multiple members, while a slicer dimension is expected to return data for a single

member [?].

3.2.2 Rules

Rules for specifying Members

1. By specifying the actual name or the alias. for example [Packages]

If the member name starts with number or contains spaces, it should be within

braces

2. By specifying dimension name or any one of the ancestor member names as a prefix

to the member name. for example,[Measures].[Packages]. (Measure dimension is

associated with all the facts)

3. By specifying the name of a calculated member defined in the WITH section [?].

Rules for specifying Tuples

1. Tuple consist of one or more member

2. If a tuple is composed of members from more than one dimension, the members

represented by the tuple must be enclosed in parentheses. for example, (Time.[2nd

half], Route.nonground.air)

3. If a tuple consist of only one member, we can omit parenthesis [?].

25


27/32

Rules for specifying Sets

1. A set consist of one or more tuples enclosed in braces. except in some cases where

the set is represented by an MDX function which returns a set [?]. For example,

{ (Time.[1st half], Route.nonground.air), (Time.[2nd half], Route.nonground.sea) }

[?].

2. A set can contain more than one occurrence of the same tuple. for example,{Time.[2nd half], Time.[2nd half] }

3. When a set has more than one tuple,the in each tuple of the set, members must

represent the same dimensions as do the members of other tuples of the set. Addi-

tionally, the dimensions must be represented in the same order. In other words, each

tuple of the set must have the same dimensionality [?]. For example { (Time.[1st

half], Route.nonground.air), (Time.[2nd half], Route.nonground.sea) } [?].

4. A set can also be a collection of sets, and it can also be empty (containing no tuples)

[?].

3.2.3 MDX Query Structure

A basic Multidimensional Expressions (MDX) query is structured in a fashion similar to

the following example [?] :

SELECT

FROM

WHERE

In MDX, the SELECT statement is used to specify a dataset containing a subset of

multidimensional data. To specify a dataset, an MDX query must contain information

about

The number of axes. You can specify up to 128 axes in an MDX query.

The members from each dimension to include on each axis of the MDX query.

The name of the cube that sets the context of the MDX query.

The members from a slicer dimension on which data is sliced for members from the

axis dimensions [?].

26


28/32

3.2.4 Specifying Axis Dimensions

Axis dimensions determine layout of query results from a database.Multidimensional Ex-

pressions (MDX) uses the SELECT clause to specify axis dimensions by assigning a set

to a particular axis. In the following syntax example, each value

defines one axis dimension. The number of axes in the dataset is equal to the number of

values in the Multidimensional Expressions (MDX) query. An MDX

query can support up to 128 specified axes, but very few MDX queries will use more than

5 axes [?].

The breakdown of the syntax is:

[axis specification] ::= [set] ON [axis name]

[axis name] ::= COLUMNS ROWS PAGES SECTIONS CHAPTERS

AXIS([index])

Each axis dimension is associated with a number: 0 for the x-axis, 1 for the y-axis, 2

for the z-axis, and so on. The value is the axis number. For the first 5 axes, thealiases COLUMNS, ROWS, PAGES, SECTIONS, and CHAPTERS can be used in place

of AXIS(0), AXIS(1), AXIS(2), AXIS(3), and AXIS(4), respectively [?].

An MDX query cannot skip axes. That is, a query that includes one or more

values must not exclude lower-numbered or intermediate axes. For example, a query can-

not have a ROWS axis without a COLUMNS axis, or have COLUMNS and PAGES axes

without a ROWS axis [?].

3.2.5 Establishing Cube Context

To establish cube context, indicate the cube on which you want the Multidimensional

Expressions (MDX) query to run. The FROM clause in an MDX query determines the

cube context. The following syntax indicates which cube supplies the context for the

MDX query [?] :

FROM cube specification

The cube specification is completed with the name of a single cube.

For example, if an MDX query is to be run against the SalesCube cube, the FROM

clause would be:

FROM SalesCube

3.2.6 Specifying Slicer Dimensions

Slicer dimensions are used optionally in the WHERE Clause of the query,to limit a query

to apply only to a specific area of the database. Dimensions that are not explicitly

assigned to an axis are assumed to be slicer dimensions and filter with their default

27


29/32

members.Default Member is usually the All member if an (All) level exists, or else an

arbitrary member of the highest level.The breakdown of the WHERE clause syntax is [ ?]:

WHERE [(slicer specification)]

A slicer dimension can accept only expressions that evaluate into a single tuple. This

does not mean that only a single tuple can be explicitly stated in the slicer dimension.

for example, WHERE ( [Time].[1st half], [Route].[nonground] )

If the slicer specification cannot be resolved into a single tuple, an error will occur [?].

3.2.7 Example Queries

For example in the cube shown in figure, if we want to calculate total Unit Sales and

total Store Sales for all USA CA Stores in year 1997 and 1998 for a sales schema, then

we would give the following query

SELECT

{ [Measures].[Unit Sales], [Measures].[Store Sales] }

ON COLUMNS,

{ [Time].[1997], [Time].[1998] }

ON ROWS

FROM Sales

WHERE( [Store].[USA].[CA] )

This query will return the result as shown in the following table :

Unit Sales Store Sales

1997 75000 100000

1998 140000 200000

We can also rewrite the above query as

SELECT

{ [Measures].[Unit Sales], [Measures].[Store Sales] }

ON AXIS(0),

{ [Time].[1997], [Time].[1998] }

ON AXIS(1)FROM Sales

WHERE( [Store].[USA].[CA] )

28


30/32

3.2.8 Difference of MDX with SQL

Here are the main list of differences between MDX and SQL :

1. The principal difference between SQL and MDX is the ability of MDX to reference

multiple dimensions.SQL refers to only two dimensions, columns and rows, when

processing queries. Because SQL was designed to handle only two-dimensional tab-ular data, the terms column and row have meaning in SQL syntax.MDX, in

comparison, can process one, two, three, or more dimensions in queries. Because

multiple dimensions can be used in MDX, each dimension is referred to as an axis

[?].

2. In SQL, the SELECT clause is used to define the column layout for a query, while

the WHERE clause is used to define the row layout. However, in MDX the SELECT

clause can be used to define several axis dimensions, while the WHERE clause is

used to restrict multidimensional data to a specific dimension or member [ ?].

3. In SQL, the WHERE clause is used to filter the data returned by a query. In

MDX, the WHERE clause is used to provide a slice of the data returned by a query.

While the two concepts are similar, they are not equivalent.The SQL query uses the

WHERE clause to contain an arbitrary list of items that should (or should not) be

returned in the result set. While a long list of conditions in the filter can narrow

the scope of the data that is retrieved, there is no requirement that the elements

in the clause will produce a clear and concise subset of data.In MDX, however,

the concept of a slice means that each member in the WHERE clause identifies a

distinct portion of data from a different dimension. Because of the organizationalstructure of multidimensional data, it is not possible to request a slice for multiple

members of the same dimension. Because of this, the WHERE clause in MDX can

provide a clear and concise subset of data [?].

29


31/32

Chapter 4

Conclusion

Design of the data warehouse greatly influences the quality of the analysis that is possible

with data in it. If invalid or corrupt data is allowed to get into the data warehouse, the

analysis done with this data is likely to be invalid. So, special attention should be given

to the issues like slowly changing dimensions, rapidly changing dimensions, multivalued

dimensions etc. that are discussed here while designing the data warehouse.

Dimensional modeling should be used for designing Data Warehouse instead of ER

Modeling because main focus here in Data Warehouse is not for removing redundancy

from dimensions but focus is on queries that are simple to understand and easier to write.

After the rapid acceptance of data warehousing systems during past three years, there

will continue to be many more enhancements and adjustments to the data warehous-

ing system model. Further evolution of the hardware and software technology will also

continue to greatly influence the capabilities that are built into data warehouses.

30


32/32

Bibliography

[1] Basic MDX. World Wide Web, http://www.msdn.microsoft.com/library.

[2] Essbase Analytic Services Database Administrators Guide. World Wide Web,

http://dev.hyperion.com/techdocs/essbase/essbase 71/Docs/dbag/frameset.htm.

[3] Ralph Kimball. Dimensional Modelling Manisfesto. World Wide Web,

http://www.dbmsmag.com.

[4] Ralph Kimball and Margy Ross. The Data Warehouse ToolKit. second edition, 2004.

[5] Paul Lane. Oracle 9i Data Warehousing Guide. Release 1 (9.0.1) edition, 2001.

[6] Michael J. Corey,Michael Abbey , Ian Abramson and Ben Taub. Oracle 8 Data Ware-

housing. Oracle press edition, 1998.

[7] Korth SilberSchatz and Sudarshan. Database System Concepts. fourth edition, 2002.