7/31/2019 Data Warehousing Design Considerations
1/32
Data Warehouse Design Considerations
M. Tech. Course Seminar Report
Submitted in partial fulfillment of the requirements
for the degree of
Master of Technology
by
Abhishek Sugandhi
Roll No: 04305016
under the guidance of
Prof. N.L.Sarda
Department of Computer Science and Engineering
Indian Institute of Technology, Bombay
Mumbai
7/31/2019 Data Warehousing Design Considerations
2/32
Acknowledgment
I would like to thank my seminar guide, Prof. N. L. Sarda for his valuable guidance,
and encouragement without which, it would not be possible for me to complete my work.
1
7/31/2019 Data Warehousing Design Considerations
3/32
Abstract
Data warehouse is a complex information system primarily used in decision making pro-
cess by means of On-Line Analytical Processing (OLAP) applications.Over the last years,
data warehouses are getting a lot of attention both from the industrial and the researchcommunity. The reason lies in their great importance: making predictions about the
(near) future, has always been desirable for business companies. In chapter 1, I will dis-
cuss the basics of data warehouse and its modeling techniques.
Decision support places some rather different requirements on database technology
compared to traditional on-line transaction processing. Data Warehouses are usually
modeled using Dimensional Modeling, for better understandability and easy extendibil-
ity. As Data Warehouses store huge amount of both current and historical data, special
attention should be given to changing dimensions, time and date dimensions, hierarchal
dimensions, while modeling data warehouse.In this discussion,in chapter 2, I am going to
focus on handling this issues while modeling the Data warehouse.
Software vendors have quickly developed products and services for improving the ef-
ficiency of querying on Data Warehouses.In chapter 3, I will discuss the querying feature
provided by Oracle 9i for improving efficiency of aggregate queries, and querying feature
provided by MDX.MDX stands for the Multidimensional Expressions (MDX). It is a lan-
guage used to manipulate multidimensional information in Microsoft SQL Server 2000
Analysis Services.
2
7/31/2019 Data Warehousing Design Considerations
4/32
Contents
1 Introduction 4
1.1 What is Data Warehouse . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Warehouse Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Dimensional Model Vs. ER Model . . . . . . . . . . . . . . . . . . . . . . . 5
2 Data Warehouse Design Issues 7
2.1 How to model time and date dimension . . . . . . . . . . . . . . . . . . . . 72.2 Dimension normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Surrogate keys . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Slowly Changing Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 Type 1: Overwrite the Value . . . . . . . . . . . . . . . . . . . . . . 9
2.4.2 Type 2: Add a new Dimension Row . . . . . . . . . . . . . . . . . . 10
2.4.3 Type 3: Add a new Dimension Column . . . . . . . . . . . . . . . . 10
2.5 Rapidly Changing Dimension . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Handling Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6.1 Fixed Depth Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6.2 Variable Depth Hierarchy . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Multivalued Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.8 Heterogeneous Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.9 Dimension Role Playing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.10 Conformed Dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3 Querying on Data Warehouse 18
3.1 Oracles 9i SQL extension for Aggregation Queries in data warehouse . . . 18
3.1.1 Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Applications in building Cross-Tabular Report . . . . . . . . . . . . 193.2 Writing MDX queries for Data Warehouse . . . . . . . . . . . . . . . . . . 22
3.2.1 Common Terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 MDX Query Structure . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.4 Specifying Axis Dimensions . . . . . . . . . . . . . . . . . . . . . . 26
3
7/31/2019 Data Warehousing Design Considerations
5/32
3.2.5 Establishing Cube Context . . . . . . . . . . . . . . . . . . . . . . . 26
3.2.6 Specifying Slicer Dimensions . . . . . . . . . . . . . . . . . . . . . . 26
3.2.7 Example Queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.8 Difference of MDX with SQL . . . . . . . . . . . . . . . . . . . . . 28
4 Conclusion 29
4
7/31/2019 Data Warehousing Design Considerations
6/32
7/31/2019 Data Warehousing Design Considerations
7/32
Sales
Customer anne s
Promotion
time_id
customer_id
Time
channel_id
Promotion_id
Other Attributes
Other Attributes
Other Attributes
Other Attributes
customer_id
time_id
channel_id
Promotion_idquantity sold
costamount
Figure 1.1: Star Schema [?]
requirements, dimension attributes are usually short identifiers that are foreign key in
some other tables called dimensional tables. Usually, a fact table is associated with many
dimension table and contain foreign key for each of these dimensional table. Fact table is
kept highly normalized to reduce space requirement whereas Dimension tables are highly
denormalized to ease the browsing among different attributes of a dimension and to enable
us to write simple and easily understandable queries.
Resultant Schema with a fact table and multiple dimensional tables, and foreign keysfrom the fact table to dimensional tables is called a star schema. If we normalize the
dimension table, so that a dimension table contain foreign key to other dimensional table
then, the resultant schema have a multiple level of dimensional tables, then such schema
are called snowflake schemas. Some Complex Data Warehouse may have more than one
fact table [?].
1.3 Dimensional Model Vs. ER Model
The main difference between Dimensional Model and ER Model lies in the fact thatdimension tables in dimensional model are denormalized, whereas Dimension tables in ER
model are highly normalized. ER design technique seeks to remove the redundancy in data
by normalizing the relations so that there is less disk space wastage and there are no insert,
or update anomaly, but in case of data warehouses, if the dimension tables are normalized
into typical snowflake (normalized) structures, two bad things happen. First, the data
6
7/31/2019 Data Warehousing Design Considerations
8/32
model becomes too complex to be presented to the user. Second, linking the elements
among the various branches of the snowflake compromises browsing performance. Even
when a long text string appears redundantly in the dimension table and can be moved
to an outrigger table(table that is formed after normalization), you wont save enough
disk space to justify moving it because the major amount of the disk space is consumed
by the fact table(which is highly normalized) [?].
In many cases, normalization can actually increase the storage requirements. If thecardinality of the repeated dimension data element is high (in other words, there are just
a few duplications), the outrigger table may be nearly as big as the main dimension table.
But we have introduced another key structure that is now repeated in both tables [?].
Another argument given for normalizing the dimensions is to improve insert or update
performance. This is rarely important in a decision-support environment. You update
the dimension tables only once per night (typically), and the processing associated with
loading perhaps millions of fact records dominates the really minor processing associated
with inserting or updating dimension records.A dimensional database design has a fixed
structure that has no alternative join paths. This greatly simplifies the optimization andevaluation of queries on these schema [?].
Fact table in Dimensional model represent many-many relationship among the dimen-
sional table. We can convert an ER model into dimensional model in presence of such a
many-many relationship, and such relationship is always present in Data Warehouses.
7
7/31/2019 Data Warehousing Design Considerations
9/32
Chapter 2
Data Warehouse Design Issues
2.1 How to model time and date dimension
Date Dimension is one of the most important dimensions in data warehouse. It is guar-
anteed to be present in every data mart because virtually every data mart is time se-ries.Instead of keeping date as an attribute in fact table, or other dimensional table, we
should build a separate dimension table, because it will allow the analysts to query the
data warehouse, on some special attributes like a holiday or major event etc. SQL date
functions do no support filtering by these attributes, so if the business process need to
slice the data by these nonstandard date attributes, then an explicit dimensional table
is essential.Calendar logic should belong in Dimension table, rather than in application
code [?]. Unlike most of other dimension we may build date dimension in advance. Every
day is represented as row in date dimension. For keeping history of 10 years only 3650
rows will be needed. Date dimension key should be an integer rather than a date data
type. This is explained further in surrogate key section.
If we wanted access to time of transaction for day part analysis, instead of keeping time
of day attributes like hour, min etc as fields in Date dimension table, we should handle
it through separate Time Of Day dimension joined to fact table [?].This can save a good
amount of space, as now instead of keeping 24 * 60 = 1440 rows to keep information about
every minute in Date dimensional table for every day (means 3650 * 1440 for 10 years,
which is very large for any dimensional table), we can build only one Time Of Day table
which will contain only 1440 rows as total. Date dimension and Time Of Day dimension
are completely independent.
8
7/31/2019 Data Warehousing Design Considerations
10/32
2.2 Dimension normalization
Dimension table normalization is usually referred as snowflaking. Redundant attributes
are removed from from flat denormalized dimension table and placed in normalized sec-
ondary dimension tables.But we should generally avoid snowflaking due to following rea-
sons [?] :
1. Snowflake tables make much more complex representation
2. Numerous tables and joins usually translate into slow query performance
3. The minor space savings associated with snowflaked dimension tables are insignif-
icant as dimension tables are generally smaller and most of the space is consumed
by the fact table.
4. It slows down user ability to browse within a dimension.
5. Finally snowflaking defeats the use of bitmap index. Bit map indexes are very usefulwhen indexing low cardinality fields in our dimension tables.
But there are times when snowflaking is permissible, when a clump of correlated at-
tributes is used repeatedly in various independent roles. for example, in promotion dimen-
sion we would need to store promotion begin date and promotion end date attribute[ ?].
One more example, when we have to store multivalued attribute then we would need
bridge table.
2.3 Surrogate keysSurrogate keys are integers that are assigned sequentially as needed to populate a dimen-
sion. the surrogate keys merely serve to join dimensional tables to the fact table.surrogate
keys are beneficial as the following reasons [?] :
1. We should avoid operational code or other smart keys as data warehouse keys,
because normally these operation codes are recycled after some period say one year
but data warehouse will retain data for years. One of primary benefit of surrogate
keys is that they buffer the data warehouse environment from operational changes.
If we rely on operational code, we are also vulnerable to key overlap problems.
2. The smaller surrogate keys translate into smaller fact tables, smaller fact table
indices and more fact table rows per i/o operation.
3. Surrogate keys can be used to record dimension conditions that have no operational
code. For example, when our dimensional model have dates that are yet to be
9
7/31/2019 Data Warehousing Design Considerations
11/32
determined. There are no SQL date value for it, but it can be handled in case of
surrogate keys.We can just keep one more row in date dimensional table with its
unique key, to identify YET TO DETERMINE condition, and avoid a null date
dimension key in fact table.
2.4 Slowly Changing Dimensions
While dimension attributes are relatively static, they are not fixed forever.Dimension at-
tributes change, albeit rather slowly, over time. Tracking of accurate change is necessary
so that business user can see the impact of each and every dimension change.When we
need to keep track, it is unacceptable to put everything in fact table and make every
dimension time-dependent to deal with these changes. Instead, we can take advantage of
the fact that most of the dimensions are constant over time. We can preserve independent
dimensional structure with only relatively minor changes to contend with changes. We
refer to these nearly constant dimension as slowly changing dimensions [?].
For each attribute in dimensional table, we must specify a strategy to handle change.
There are 3 basics technique for dealing with attribute changes [?].
1. Overwrite the value
2. Add a Dimension Row
3. Add a Dimension Column
for Example, Suppose that manufacturing operations makes a slight change in packag-
ing of SKU 38 (unique product no given by organization ), and the packaging description
changes from glued box to pasted box. Along with this change, manufacturing oper-
ations decides not to change the SKU number of the product, or bar code (UPC) that is
printed on the box.Let us see, how the issue of handling this changing dimension is taken
care of in all the above methods.
2.4.1 Type 1: Overwrite the Value
With the type 1, we merely overwrite the old attribute value in the dimension row,
replacing it with the current value. In doing so attribute always reflect the most recentassignment.Type 1 response is simple to implement but it does not maintain any history
of prior attribute value [?].
Type 1 technique is the simplest and fastest. But it doesnt maintain past history!
Nevertheless, overwriting is frequently used when the data warehouse team legitimately
decides that the old value of the changed dimension attribute is not interesting.[ ?]. In
10
7/31/2019 Data Warehousing Design Considerations
12/32
above example, Original row
Product Key Produce Description Packaging SKU No.
12345 Scent glued Box ABC922
will be updated as
Product Key Produce Description Packaging SKU No.
12345 Scent pasted Box ABC922
2.4.2 Type 2: Add a new Dimension Row
The second technique is the most common and has a number of powerful advantages.If the
data warehouse team decides to track the change of an attribute issue another record(row
in dimensional table), with the changed value of attribute. The only difference betweenrecords is in the changed attribute. Even the operational codes are the same.
This technique for tracking slowly changing dimensions is very powerful because new
dimension records automatically partition history in the fact table. The old version of
the dimension record points to all history in the fact table prior to the change. The new
version of the dimension record points to all history after the change [?].There is no need
for a time-stamp in the product table to record the change. This is best recorded by a
fact table record with the correct key of newly added record [?].
Another advantage of this technique is that you can gracefully track as many changes
to a dimensional item as you wish. Each change generates a new dimension record, andeach record partitions history perfectly. The main drawbacks of the technique are the
requirement to generalize the dimension key, and the growth of the dimension table itself
[?].
Using Type 2 technique for previous example, we would have 2 product dimension
rows (both original and updated ) as
Product Key Produce Description Packaging SKU No.
12345 Scent glued Box ABC922
34567 Scent pasted Box ABC922
2.4.3 Type 3: Add a new Dimension Column
With Type 2 response partitions history, it does not allow us to associate the new attribute
value with old fact history or vice-versa. However, we sometimes want the ability to see
11
7/31/2019 Data Warehousing Design Considerations
13/32
fact data as if the change never occurred. We can attack this requirement, not by creating
a new dimension record as in the Type 2 technique, but by creating a new current value
field. The type 3 technique allow us to see new and historical fact data by either the new
or prior attribute values [?].
Using Type 3 technique for previous example, we would have update original row as
Product Key Produce Description current Packaging previous packaging SKU No.
12345 Scent pasted Box glued box ABC922
2.5 Rapidly Changing Dimension
Normally, we will not use any of the techniques mentioned previously for handling chang-
ing dimension, if the dimension already contains million of the rows.Unfortunately, huge
dimensions are also more likely to change than moderately sized dimension. We sometimes
calls this situation rapidly changing monster dimensions [?].The solution to handle such problem, is to break frequently analyzed or frequently
changing dimensions into separate dimension, referred as minidimension [?]. There would
be one row in minidimension for every unique combination of frequently analyzed at-
tribute Level encountered in the data (not one row per customer). We leave behind
more constant or less frequently queried attributes in original huge customer table.When
creating the minidimension, continuously changing variable should be converted to banded
ranges.In other words, we force the attributes in minidimension to take relatively small
number of dimension values [?]. Although these restricts the use of predefined bands, it
drastically reduces the number of combinations in the minidimension.
Every time, we build the fact table row, we include 2 foreign keys related to the
dimension: the regular dimension key and minidimension key. The minidimension key
should be the part of fact tables set of foreign keys to provide efficient access to the fact
table.
This design delivers browsing and constraining performance benefits by providing a
slower point of entries to the fact table, and we can avoid joins to huge dimensional
table if attributes(static) from that table are not constrained. When the minidimension
key participates as foreign key in fact table, another benefit is that the fact table serves
to capture the minidimensions attribute changes.We can keep track of loading which
minidimension key when we want to change attribute of dimension. Earlier rows would
be still using the old values of minidimension key. Thus we could keep track of history as
well [?].
12
7/31/2019 Data Warehousing Design Considerations
14/32
Customer KeyCustomer IDCustomer Name.............Age
GenderAnnual Income
Becomes
Customer KeyCustomer IDCustomer Name..............
Customer Minidimension KeyCustomer Age BandCustomer GenderCustomer Income Band
Customer KeyCustomer Minidimension KeyMore Foreign Keys..........Facts............
Customer Dimension
Customer DimensionFact Table
Customer Minidimension Dimension
Figure 2.1: Example of Handling Rapidly Changing large Dimension [?]
2.6 Handling HierarchiesIn many dimensions, hierarchy is inherent. We will take 2 approaches to handle hierar-
chies. The first is straightforward and handle the hierarchy adequately with simplistic
approach. The second approach is much more advanced and complicated but also much
more extensible.
2.6.1 Fixed Depth Hierarchy
This happens rare, if we are confronted with a dimension that is highly predictable with
fixed number of levels (say N). In this case, we can keep N attributes in dimension cor-responding to these N levels [?].If some other records from the dimension table are not
having hierarchy up to the maximum no of levels, then we would duplicate lower level
attributes to higher level attributes.In this way, we can report hierarchy to any level of
hierarchy. for every record of that dimension.
2.6.2 Variable Depth Hierarchy
Representing an arbitrary variable depth hierarchy is an inherently difficult task in a
relational environment.A simple computer science approach to storing such information
would add a Parent Key field to the Customer dimension. The Parent Key field would be
a recursive pointer that would contain the proper key value for the parent of any given
customer. A special null value would be required for the topmost Customer in any given
overall enterprise [?] .
The problem with this recursive pointer approach is that, it cannot be used effectively
with standard SQL. Standard SQL GROUP BY clause cannot follow the recursive pointers
13
7/31/2019 Data Warehousing Design Considerations
15/32
downward, for aggregating an additive fact in the fact table [?]. Instead of using a
recursive pointer, we can solve this modeling problem by inserting a bridge table(helper
table) between the dimension table and the fact table.The bridge table contains one record
for each separate path from each node in the organization tree to itself and to every node
below it. Each Pathway row contains key of key of parent roll-up entity, no of levels
between parent and the subsidiary, bottom-most flag indicating that there are no further
nodes beneath it and finally, a top-most flag to indicate there are no further nodes abovethe parent [?].
Now, if we want to descend the hierarchy, we join the dimension table with bridge table
by connecting dimensions primary key with bridge tables parent dimension key. Now we
can constrain any particular dimension and request an aggregate measure of all dimensions
at or below it.We can use no of level attribute to control depth of analysis. Similarly when
we want to ascend the hierarchy, we reverse the join by connecting dimension key with
the bridge table subsidiary dimension key [?].
When a group of nodes is moved from one part of an organization to another, only the
bridge table rows that refer to paths outside the parent to the moved structures need tobe altered.All rows referring to paths within the moved structure need not be affected.We
need to add rows, if the moved structure had new parent.
When issuing the SQL statement using bridge table, we need to be cautious about
over counting the facts.When connecting the tables, we must constrain the customer di-
mensions to a single value and then join to the bridge table [?].
Customer Key
Customer Key
ParentSubsdiary
Leval Name
Bottom flag
Top Flag
Customer KeyCustomer Key Date Key
Customer
Customer
ID
Name
Customer Dimension Hierarchy Bridge Fact table
Figure 2.2: Handling hierarchy through bridge table [?]
14
7/31/2019 Data Warehousing Design Considerations
16/32
2.7 Multivalued Dimension
There are situations where we need to attach a multivalued dimension table to the fact
table. Example of these situation is when we associate many customers to account, when
multiple diagnoses are associated with single patient etc. Database designers usually take
one of following approaches for handling Multivalued Dimension attributes [?] :
Choose one value and omit the other values
Extend the dimension list to have a fixed number of Multivalued dimensions
Put a bridge (helper) table in between this fact table and the Multivalued dimension
table.
Frequently, designers choose a single value (first approach). If we take these approach,
the modeling problem goes away, but we will still be in doubt whether the Multivalued
dimension data is useful.
The second approach of creating a fixed number of additional Multivalued dimension
slots in the fact table key is also not a good idea, as there can be some situation where
the number of Multivalued dimension exceed slots we have allocated. Also, we cannot
easily query the multiple separate Multivalued dimensions [?].
Bridge table placed between the Multivalued dimension and the fact table is the best
solution. The Multivalued dimension key in the fact table is changed to be a Multival-
ued dimension Group key. The helper table in the middle is the Multivalued dimension
Group table. It has one record for each Multivalued dimension in a group of Multivalued
dimensions [?].
The Multivalued dimension Group table is joined to the original Multivalued dimension
on the Multivalued dimension key. The Multivalued dimension Group table contains a
very important numeric attribute: the weighting factor. The weighting factor allows
reports to be created that dont double count the Billed Amount in the fact table.
We can assign the weighting factors equally within a Multivalued dimension Group.
If there are three Multivalued dimensions, then each gets a weighting factor of 1/3. If we
have some other rational basis for assigning the weighting factors differently, then we can
change the factors, as long as all the factors in a Diagnosis Group always add up to one.
2.8 Heterogeneous Dimension
Many a times, in real world the situation arises when the business provides heterogeneous
services or products. For example, a retail Bank offers variety of products like mortgage
15
7/31/2019 Data Warehousing Design Considerations
17/32
Patient Fact table
Table
Diganosis Dimension
Diagnosis group
Helper Table
Digosis group key
Diagnosis key
Diagnosis key
Digosis group key
Other Attributes
Other Attributes
Weight
Figure 2.3: Handling Multivalued Diagnosis Dimension through bridge table [?]
or checking accounts to the same customer. These products have specific attributes and
facts related to them only, and also general attributes, and fact that are common among
them. In this case, Business users typically require two different perspective that are
difficult to present in single fact table. The first perspective is global view, including the
ability to slice and dice all general facts simultaneously, regardless of their product type.
The second perspective required by users is specific line-of business view that focuses onin-depth details of one business such as checking or mortgage [?].
There is a long list of attribute specifically for any specific line of business. We cannot
add these spatial facts in one fact table; if we did it for each line of business, we would
end up with several hundred facts, most of which include nulls in any specific row.
Like wise, if we attempt to include specific line of business attributes in any dimension
table, we would have hundred of attribute, almost all of which would be empty for any
given row.
The solution to this dilemma is to create a custom schema for the checking line of
business that is just limited to just checking accounts.Now both the custom checking facttable and corresponding product dimension are widened to describe all specific facts and
attributes that make sense only for checking products [?].
These custom tables also contain the core attributes and facts so that, we can avoid
join from the core and custom schema in order to get complete set of facts and attributes.
The keys of custom product dimension is same as used in core product dimension, which
16
7/31/2019 Data Warehousing Design Considerations
18/32
contains all possible product keys [?]. As conformed dimensions are is essential, each
custom product dimension is subset of rows from core product dimension table.
A family of core and custom fact table is needed when a business has heterogeneous
products that have naturally different facts and descriptors but a simple customer base
demands an integrated view.
We can consider handling of the specific line of business attributes as context depen-
dent outrigger to the core dimension. We can isolate the core attributes in in the baseproduct dimension table, and we can include a snowflake key in each base record that
points to that point to its proper custom dimension outrigger [?].
If line of business of of custom and core dimension are separate, they cannot reside
in same space, in this case, data in core fact table need to be duplicated only once to
implement all custom tables. Otherwise, we can avoid duplicating both the core fact keys
and core facts in the custom line of business fact tables [?].
General Dimension1 Fact Table
Core Attributes ....Dimension 1 Key Date Key
Dimension 1 KeyDimension2 keyMore Foreign KeysCore Facts........
Dimension 2 KeyCore Attributes......
Custom facts.......
Custom Attributes.....
Specific Dimension 1Key Specific Dimension 2KeyCustom Attributes.....
Dimension 1Specific line of Business
General Dimension 2
Dimension 2Specific line of Business
Figure 2.4: Handling Heterogeneous Dimensions [?]
2.9 Dimension Role Playing
A role in a data warehouse is a situation in which a single dimension appears several times
in the same fact table.In certain kinds of fact tables, Date can appear repeatedly. For
example, a typical Fact table can include Order Date, Packaging Date, Shipping Date,Delivery Date, Payment Date, Return Date, Refer to Collection Date, and other facts [ ?].
We cannot join these seven foreign keys to the same table. SQL would interpret such
a seven-way simultaneous join as requiring that all of the dates be the same. Instead of a
seven-way join, we have to create an illusion of seven independent Date dimension tables.
We even need to go to the length of labeling all of the columns in each of the tables
17
7/31/2019 Data Warehousing Design Considerations
19/32
uniquely. If we dont label the columns uniquely, we will not be able to differentiate the
columns apart if several of them have been dragged into a report [ ?].
For the user, we can create the illusion of seven independent time tables in a couple
of ways. We can either make seven identical physical copies of the time table, or we
can create seven virtual copies of the time table with the SQL SYNONYM command.
Regardless of the approach, once we have made these seven clones, we still have to define
a SQL view on each copy in order to make the field names uniquely different [?].Now that we have seven differently described Time dimensions, they can be used as
if they were independent. They can have completely unrelated constraints, and they can
play different roles in a report.
2.10 Conformed Dimensions
A conformed dimension is a dimension that means the same thing with every possible
fact table to which it can be joined. Generally this means that a conformed dimension is
identical in each data mart. A major responsibility of the central data warehouse designteam is to establish, publish, maintain, and enforce the conformed dimensions.
Conformed dimensions are enormously important to the data warehouse. Without
a strict adherence to conformed dimensions, the data warehouse cannot function as an
integrated whole. Conformed dimensions make possible a single dimension table to be
used against multiple fact tables in the same database space, consistent user interfaces and
data content whenever the dimension is used, and a consistent interpretation of attributes
and, therefore, roll ups across data marts [?].
It is possible to create a subset of a conformed dimension table for certain data marts
if you know that the domain of the associated fact table only contains that subset. Forexample, the master Product table can be restricted to just those products manufactured
at a particular location if the data mart in question pertains only to that location. We
could call this a simple data subset, because the reduced dimension table preserves all
the attributes of the original dimension and exists at the original granularity [?].
18
7/31/2019 Data Warehousing Design Considerations
20/32
Chapter 3
Querying on Data Warehouse
3.1 Oracles 9i SQL extension for Aggregation Queries
in data warehouse
In this section, all example queries Will be performed on Sales History Schema in figure1.1. All examples, and theory is taken from [?].
Aggregation is a fundamental part of data warehousing. To improve aggregation
performance in your warehouse, Oracle provides the following extensions to the GROUP
BY clause to make query reporting faster and easier:
ROLLUP Extension ROLL UP calculates aggregate functions such as SUM, COUNT,
MAX, MIN, and AVG at increasing levels of aggregation, from the most detailed
up to a grand total.It is very helpful for subtotaling along a hierarchical dimension
such as time or geography. It creates subtotals that roll up from the most detailed
level to a grand total, following a grouping list specified in the ROLL UP clause.
CUBE Extension CUBE is an extension similar to ROLL UP, enabling a single state-
ment to calculate all possible combinations of aggregations. CUBE can generate the
information needed in cross-tabulation reports with a single query. CUBE is typi-
cally most suitable in queries that use columns from multiple dimensions rather than
columns representing different levels of a single dimension. CUBE takes a specified
set of grouping columns and creates subtotals for all of their possible combinations.
In terms of multidimensional analysis, CUBE generates all the subtotals that could
be calculated for a data cube with the specified dimensions. Multiple SELECTstatements combined with UNION ALL statements could provide the same infor-
mation gathered through CUBE or ROLL UP. However, this might require many
SELECT statements.The more columns used in a CUBE or ROLLUP clause, the
greater the savings compared to the UNION ALL approach.
GROUPING Functions The GROUPING functions help you identify the group each
19
7/31/2019 Data Warehousing Design Considerations
21/32
row belongs to and enable sorting subtotal rows and filtering results. Grouping
helps in differentiating NULL values created by CUBE or ROLLUP and stored
NULL values.Secondly it helps in finding out programattically what is a level of
aggregation for a given subtotal.GROUPING returns 1 when it encounters a NULL
value created by a ROLLUP or CUBE operation. That is, if the NULL indicates
the row is a subtotal, GROUPING returns a 1. Any other type of value, including
a stored NULL, returns a 0.
Grouping ID Function GROUPING ID returns a single number that enables you to
determine the exact GROUP BY level. For each row, GROUPING ID takes the
set of 1 s and 0 s that would be generated if you used the appropriate GROUPING
functions and concatenates them, forming a bit vector. The bit vector is treated as a
binary number, and the number s base-10 value is returned by the GROUPING ID
function.
GROUPING SETS Expression Computing a full cube creates a heavy processing
load, so replacing cubes with grouping sets can significantly increase performance.Youcan selectively specify the set of groups that you want to create using a GROUP-
ING SETS expression within a GROUP BY clause. This allows precise specifica-
tion across multiple dimensions without computing the whole CUBE.CUBE and
ROLLUP can be thought of as grouping sets with very specific semantics.
3.1.1 Syntax
Extension Syntax
ROLLUP SELECT..... GROUP BY ROLLUP(grouping column reference list)
PARTIAL ROLLUP GROUP BY expr1, ROLLUP(expr2, expr3)
CUBE SELECT..... GROUP BY CUBE (grouping column reference list)
PARTIAL CUBE GROUP BY expr1, CUBE(expr2, expr3)
GROUPING SELECT.. [GROUPING(dimension column)..]..
GROUP BY.. CUBE ROLLUP (dimension column)
GROUPING SETS GROUP BY [GROUPING sets(dimension column).. ]
3.1.2 Applications in building Cross-Tabular Report
These extensions are used to generate cross-tabular reports easily and efficiently.
For example, in figure, for a cross-tabular report showing, the total sales by country id
and channel desc for the US and UK through the Internet and Direct Sales in September
20
7/31/2019 Data Warehousing Design Considerations
22/32
Direct Sales
Internet
UK US Total
75000 45000
100000 200000
175000 245000
300000
120000
420000Total
Country
Channel
Figure 3.1: Cross Tabular Report
2004, we will need to calculate 4 subtotals and one grand total. Half of the values needed
for this report would not be calculated with a query that requested SUM(amount sold)
and did a GROUP BY(channel desc, country id). To get the higher-level aggregates we
would require additional queries.But we can easily generate all these subtotals and grandtotal by giving only one query using CUBE extension in its GROUPBY clause.
The Query will be
SELECT channel desc, calendar month desc, country id, SUM(amount sold)
FROM sales, customers, times, channels
WHERE sales.time id=times.time id AND
sales.cust id=customers.cust id AND
sales.channel id= channels.channel id ANDchannels.channel desc IN (Direct Sales, Internet) AND
times.calendar month desc = 2004-09 AND
country id IN (UK, US)
GROUP BY CUBE(channel desc,country id);
Result of these Query will appear as in table shown below
21
7/31/2019 Data Warehousing Design Considerations
23/32
7/31/2019 Data Warehousing Design Considerations
24/32
We can also generate above tables using GROUPING SETS extension. With GROUP-
ING SETS expression, we have to explicitly specify the levels of aggregation we wish to
perform.The Query will be
SELECT channel desc,country id, SUM(amount sold)
FROM sales, customers, times, channels
WHERE sales.time id=times.time id AND
sales.cust id=customers.cust id AND
sales.channel id= channels.channel id AND
channels.channel desc IN (Direct Sales, Internet) AND
times.calendar month desc = 2004-09 AND
country id IN (UK, US)
GROUP BY GROUPING SETS((channel desc, country id), (channel desc),(country id),());
Both CUBE and ROLLUP can be thought of as GROUPING SETS with very specific
semantics.
CUBE(a, b, c) is equivalent to GROUPING SETS ((a, b, c), (a, b), (a, c), (b, c),
(a), (b), (c), ())
ROLLUP(a, b, c) is equivalent to GROUPING SETS ((a, b, c), (a, b), ())
3.2 Writing MDX queries for Data Warehouse
MDX, stands for Multidimensional Expressions.It is a syntax that supports the definition
and manipulation of multidimensional objects and data. MDX is similar in many ways
to the Structured Query Language (SQL) syntax, but is not an extension of the SQL
language; in fact, some of the functionality that is supplied by MDX can be supplied,
although not as efficiently or intuitively, by SQL.As with an SQL query, each MDX query requires the SELECT clause,the FROM
clause and the WHERE clause. These and other keywords provide the tools used to
extract specific portions of data from a cube (multidimensional structure) for analysis.
MDX also supplies a robust set of functions for the manipulation of retrieved data, as
well as the ability to extend MDX with user-defined functions.
23
7/31/2019 Data Warehousing Design Considerations
25/32
Figure 3.2: Multidimensional Structure : Cube [?]
3.2.1 Common Terms
Cube Cube is a multidimensional structure that contains dimensions and measures. Di-
mensions define the structure of the cube, while measures provide the numerical
values of interest to the end user. Cell positions in the cube are defined by the in-
tersection of dimension members, and the measure values are aggregated to provide
the values in the cells [?].
Member A member is the lowest level of reference when describing cell data in a Cube.A
member is an item in a dimension representing one or more occurrences of data.
Members are combined to form Tuples and Tuples are combined to form Sets. These
Sets are used in SELECT clause of SQL for retrieving data from Cube [?].
Tuples A Tuple is used to define a slice of data from a Cube; it is composed of an ordered
24
7/31/2019 Data Warehousing Design Considerations
26/32
collection of one Member from one or more dimensions. A Tuple is used to identify
specific sections of multidimensional data from a cube; a tuple composed of one
member from each dimension in a cube completely describes a cell value [?].
Sets A Set is an ordered collection of zero, one or more Tuples. A Set is most commonly
used to define Axis and Slicer dimensions in an MDX query, and as such may have
only a single Tuple or may be, in certain cases, empty. In MDX syntax, tuples are
enclosed in braces to construct a set.A set is most commonly used to define axis
and slicer dimensions in an MDX query [?].
Axis and Slicer Dimensions A SELECT statement is used to select the Dimensions
and Members to be returned, referred to as Axis dimensions. The WHERE state-
ment is used to restrict the returned data to specific Dimension and Member criteria,
referred to as a slicer dimension. An axis dimension is expected to return data for
multiple members, while a slicer dimension is expected to return data for a single
member [?].
3.2.2 Rules
Rules for specifying Members
1. By specifying the actual name or the alias. for example [Packages]
If the member name starts with number or contains spaces, it should be within
braces
2. By specifying dimension name or any one of the ancestor member names as a prefix
to the member name. for example,[Measures].[Packages]. (Measure dimension is
associated with all the facts)
3. By specifying the name of a calculated member defined in the WITH section [?].
Rules for specifying Tuples
1. Tuple consist of one or more member
2. If a tuple is composed of members from more than one dimension, the members
represented by the tuple must be enclosed in parentheses. for example, (Time.[2nd
half], Route.nonground.air)
3. If a tuple consist of only one member, we can omit parenthesis [?].
25
7/31/2019 Data Warehousing Design Considerations
27/32
Rules for specifying Sets
1. A set consist of one or more tuples enclosed in braces. except in some cases where
the set is represented by an MDX function which returns a set [?]. For example,
{ (Time.[1st half], Route.nonground.air), (Time.[2nd half], Route.nonground.sea) }
[?].
2. A set can contain more than one occurrence of the same tuple. for example,{Time.[2nd half], Time.[2nd half] }
3. When a set has more than one tuple,the in each tuple of the set, members must
represent the same dimensions as do the members of other tuples of the set. Addi-
tionally, the dimensions must be represented in the same order. In other words, each
tuple of the set must have the same dimensionality [?]. For example { (Time.[1st
half], Route.nonground.air), (Time.[2nd half], Route.nonground.sea) } [?].
4. A set can also be a collection of sets, and it can also be empty (containing no tuples)
[?].
3.2.3 MDX Query Structure
A basic Multidimensional Expressions (MDX) query is structured in a fashion similar to
the following example [?] :
SELECT
FROM
WHERE
In MDX, the SELECT statement is used to specify a dataset containing a subset of
multidimensional data. To specify a dataset, an MDX query must contain information
about
The number of axes. You can specify up to 128 axes in an MDX query.
The members from each dimension to include on each axis of the MDX query.
The name of the cube that sets the context of the MDX query.
The members from a slicer dimension on which data is sliced for members from the
axis dimensions [?].
26
7/31/2019 Data Warehousing Design Considerations
28/32
3.2.4 Specifying Axis Dimensions
Axis dimensions determine layout of query results from a database.Multidimensional Ex-
pressions (MDX) uses the SELECT clause to specify axis dimensions by assigning a set
to a particular axis. In the following syntax example, each value
defines one axis dimension. The number of axes in the dataset is equal to the number of
values in the Multidimensional Expressions (MDX) query. An MDX
query can support up to 128 specified axes, but very few MDX queries will use more than
5 axes [?].
The breakdown of the syntax is:
[axis specification] ::= [set] ON [axis name]
[axis name] ::= COLUMNS ROWS PAGES SECTIONS CHAPTERS
AXIS([index])
Each axis dimension is associated with a number: 0 for the x-axis, 1 for the y-axis, 2
for the z-axis, and so on. The value is the axis number. For the first 5 axes, thealiases COLUMNS, ROWS, PAGES, SECTIONS, and CHAPTERS can be used in place
of AXIS(0), AXIS(1), AXIS(2), AXIS(3), and AXIS(4), respectively [?].
An MDX query cannot skip axes. That is, a query that includes one or more
values must not exclude lower-numbered or intermediate axes. For example, a query can-
not have a ROWS axis without a COLUMNS axis, or have COLUMNS and PAGES axes
without a ROWS axis [?].
3.2.5 Establishing Cube Context
To establish cube context, indicate the cube on which you want the Multidimensional
Expressions (MDX) query to run. The FROM clause in an MDX query determines the
cube context. The following syntax indicates which cube supplies the context for the
MDX query [?] :
FROM cube specification
The cube specification is completed with the name of a single cube.
For example, if an MDX query is to be run against the SalesCube cube, the FROM
clause would be:
FROM SalesCube
3.2.6 Specifying Slicer Dimensions
Slicer dimensions are used optionally in the WHERE Clause of the query,to limit a query
to apply only to a specific area of the database. Dimensions that are not explicitly
assigned to an axis are assumed to be slicer dimensions and filter with their default
27
7/31/2019 Data Warehousing Design Considerations
29/32
members.Default Member is usually the All member if an (All) level exists, or else an
arbitrary member of the highest level.The breakdown of the WHERE clause syntax is [ ?]:
WHERE [(slicer specification)]
A slicer dimension can accept only expressions that evaluate into a single tuple. This
does not mean that only a single tuple can be explicitly stated in the slicer dimension.
for example, WHERE ( [Time].[1st half], [Route].[nonground] )
If the slicer specification cannot be resolved into a single tuple, an error will occur [?].
3.2.7 Example Queries
For example in the cube shown in figure, if we want to calculate total Unit Sales and
total Store Sales for all USA CA Stores in year 1997 and 1998 for a sales schema, then
we would give the following query
SELECT
{ [Measures].[Unit Sales], [Measures].[Store Sales] }
ON COLUMNS,
{ [Time].[1997], [Time].[1998] }
ON ROWS
FROM Sales
WHERE( [Store].[USA].[CA] )
This query will return the result as shown in the following table :
Unit Sales Store Sales
1997 75000 100000
1998 140000 200000
We can also rewrite the above query as
SELECT
{ [Measures].[Unit Sales], [Measures].[Store Sales] }
ON AXIS(0),
{ [Time].[1997], [Time].[1998] }
ON AXIS(1)FROM Sales
WHERE( [Store].[USA].[CA] )
28
7/31/2019 Data Warehousing Design Considerations
30/32
3.2.8 Difference of MDX with SQL
Here are the main list of differences between MDX and SQL :
1. The principal difference between SQL and MDX is the ability of MDX to reference
multiple dimensions.SQL refers to only two dimensions, columns and rows, when
processing queries. Because SQL was designed to handle only two-dimensional tab-ular data, the terms column and row have meaning in SQL syntax.MDX, in
comparison, can process one, two, three, or more dimensions in queries. Because
multiple dimensions can be used in MDX, each dimension is referred to as an axis
[?].
2. In SQL, the SELECT clause is used to define the column layout for a query, while
the WHERE clause is used to define the row layout. However, in MDX the SELECT
clause can be used to define several axis dimensions, while the WHERE clause is
used to restrict multidimensional data to a specific dimension or member [ ?].
3. In SQL, the WHERE clause is used to filter the data returned by a query. In
MDX, the WHERE clause is used to provide a slice of the data returned by a query.
While the two concepts are similar, they are not equivalent.The SQL query uses the
WHERE clause to contain an arbitrary list of items that should (or should not) be
returned in the result set. While a long list of conditions in the filter can narrow
the scope of the data that is retrieved, there is no requirement that the elements
in the clause will produce a clear and concise subset of data.In MDX, however,
the concept of a slice means that each member in the WHERE clause identifies a
distinct portion of data from a different dimension. Because of the organizationalstructure of multidimensional data, it is not possible to request a slice for multiple
members of the same dimension. Because of this, the WHERE clause in MDX can
provide a clear and concise subset of data [?].
29
7/31/2019 Data Warehousing Design Considerations
31/32
Chapter 4
Conclusion
Design of the data warehouse greatly influences the quality of the analysis that is possible
with data in it. If invalid or corrupt data is allowed to get into the data warehouse, the
analysis done with this data is likely to be invalid. So, special attention should be given
to the issues like slowly changing dimensions, rapidly changing dimensions, multivalued
dimensions etc. that are discussed here while designing the data warehouse.
Dimensional modeling should be used for designing Data Warehouse instead of ER
Modeling because main focus here in Data Warehouse is not for removing redundancy
from dimensions but focus is on queries that are simple to understand and easier to write.
After the rapid acceptance of data warehousing systems during past three years, there
will continue to be many more enhancements and adjustments to the data warehous-
ing system model. Further evolution of the hardware and software technology will also
continue to greatly influence the capabilities that are built into data warehouses.
30
7/31/2019 Data Warehousing Design Considerations
32/32
Bibliography
[1] Basic MDX. World Wide Web, http://www.msdn.microsoft.com/library.
[2] Essbase Analytic Services Database Administrators Guide. World Wide Web,
http://dev.hyperion.com/techdocs/essbase/essbase 71/Docs/dbag/frameset.htm.
[3] Ralph Kimball. Dimensional Modelling Manisfesto. World Wide Web,
http://www.dbmsmag.com.
[4] Ralph Kimball and Margy Ross. The Data Warehouse ToolKit. second edition, 2004.
[5] Paul Lane. Oracle 9i Data Warehousing Guide. Release 1 (9.0.1) edition, 2001.
[6] Michael J. Corey,Michael Abbey , Ian Abramson and Ben Taub. Oracle 8 Data Ware-
housing. Oracle press edition, 1998.
[7] Korth SilberSchatz and Sudarshan. Database System Concepts. fourth edition, 2002.
Top Related