08 - Dimensional Modeling

Copyright © 2006 by Microsoft Corporation. All rights reserved. By using or providing

feedback on these materials, you agree to the attached license agreement. Please provide

feedback at BI Feedback Alias.

8 Dimensional Modeling

Microsoft Confidential. © 2006 Microsoft Corporation. All rights reserved. These materials are confidential to and maintained as a trade secret by Microsoft Corporation. Information in these materials is restricted to Microsoft authorized recipients only. Any use, distribution or public discussion of, and any feedback to, these materials is subject to the terms of the attached license. By providing any feedback on these materials to Microsoft, you agree to the terms of that license.

mailto:[email protected]?subject=BIFEEDBACK:




Microsoft Corporation Technical Documentation License Agreement (Standard)

READ THIS! THIS IS A LEGAL AGREEMENT BETWEEN MICROSOFT CORPORATION ("MICROSOFT") AND THE RECIPIENT OF THESE MATERIALS, WHETHER AN INDIVIDUAL OR AN ENTITY ("YOU"). IF YOU HAVE ACCESSED THIS AGREEMENT IN THE PROCESS OF DOWNLOADING MATERIALS ("MATERIALS") FROM A MICROSOFT WEB SITE, BY CLICKING "I ACCEPT", DOWNLOADING, USING OR PROVIDING FEEDBACK ON THE MATERIALS, YOU AGREE TO THESE TERMS. IF THIS AGREEMENT IS ATTACHED TO MATERIALS, BY ACCESSING, USING OR PROVIDING FEEDBACK ON THE ATTACHED MATERIALS, YOU AGREE TO THESE TERMS. 1. For good and valuable consideration, the receipt and sufficiency of which are acknowledged, You and Microsoft agree as follows: (a) If You are an authorized representative of the corporation or other entity designated below ("Company"), and such Company has executed a Microsoft Corporation Non-Disclosure Agreement that is not limited to a specific subject matter or event ("Microsoft NDA"), You represent that You have authority to act on behalf of Company and agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that Company will treat the Confidential Information accordingly; (b) If You are an individual, and have executed a Microsoft NDA, You agree that the Confidential Information, as defined in the Microsoft NDA, is subject to the terms and conditions of the Microsoft NDA and that You will treat the Confidential Information accordingly; or (c)If a Microsoft NDA has not been executed, You (if You are an individual), or Company (if You are an authorized representative of Company), as applicable, agrees: (a) to refrain from disclosing or distributing the Confidential Information to any third party for five (5) years from the date of disclosure of the Confidential Information by Microsoft to Company/You; (b) to refrain from reproducing or summarizing the Confidential Information; and (c) to take reasonable security precautions, at least as great as the precautions it takes to protect its own confidential information, but no less than reasonable care, to keep confidential the Confidential Information. You/Company, however, may disclose Confidential Information in accordance with a judicial or other governmental order, provided You/Company either (i) gives Microsoft reasonable notice prior to such disclosure and to allow Microsoft a reasonable opportunity to seek a protective order or equivalent, or (ii) obtains written assurance from the applicable judicial or governmental entity that it will afford the Confidential Information the highest level of protection afforded under applicable law or regulation. Confidential Information shall not include any information, however designated, that: (i) is or subsequently becomes publicly available without Your/Company’s breach of any obligation owed to Microsoft; (ii) became known to You/Company prior to Microsoft’s disclosure of such information to You/Company pursuant to the terms of this Agreement; (iii) became known to

You/Company from a source other than Microsoft other than by the breach of an obligation of confidentiality owed to Microsoft; or (iv) is independently developed by You/Company. For purposes of this paragraph, "Confidential Information" means nonpublic information that Microsoft designates as being confidential or which, under the circumstances surrounding disclosure ought to be treated as confidential by Recipient. "Confidential Information" includes, without limitation, information in tangible or intangible form relating to and/or including released or unreleased Microsoft software or hardware products, the marketing or promotion of any Microsoft product, Microsoft's business policies or practices, and information received from others that Microsoft is obligated to treat as confidential. 2. You may review these Materials only (a) as a reference to assist You in planning and designing Your product, service or technology ("Product") to interface with a Microsoft Product as described in these Materials; and (b) to provide feedback on these Materials to Microsoft. All other rights are retained by Microsoft; this agreement does not give You rights under any Microsoft patents. You may not (i) duplicate any part of these Materials, (ii) remove this agreement or any notices from these Materials, or (iii) give any part of these Materials, or assign or otherwise provide Your rights under this agreement, to anyone else. 3. These Materials may contain preliminary information or inaccuracies, and may not correctly represent any associated Microsoft Product as commercially released. All Materials are provided entirely "AS IS." To the extent permitted by law, MICROSOFT MAKES NO WARRANTY OF ANY KIND, DISCLAIMS ALL EXPRESS, IMPLIED AND STATUTORY WARRANTIES, AND ASSUMES NO LIABILITY TO YOU FOR ANY DAMAGES OF ANY TYPE IN CONNECTION WITH THESE MATERIALS OR ANY INTELLECTUAL PROPERTY IN THEM. 4. If You are an entity and (a) merge into another entity or (b) a controlling ownership interest in You changes, Your right to use these Materials automatically terminates and You must destroy them.





5. You have no obligation to give Microsoft any suggestions, comments or other feedback ("Feedback") relating to these Materials. However, any Feedback you voluntarily provide may be used in Microsoft Products and related specifications or other documentation (collectively, "Microsoft Offerings") which in turn may be relied upon by other third parties to develop their own Products. Accordingly, if You do give Microsoft Feedback on any version of these Materials or the Microsoft Offerings to which they apply, You agree: (a) Microsoft may freely use, reproduce, license, distribute, and otherwise commercialize Your Feedback in any Microsoft Offering; (b) You also grant third parties, without charge, only those patent rights necessary to enable other Products to use or interface with any specific parts of a Microsoft Product that incorporate Your Feedback; and (c) You will not give Microsoft any Feedback (i) that You have reason to believe is subject to any patent, copyright or other intellectual property claim or right of any third party; or (ii) subject to license terms which seek to require any Microsoft Offering incorporating or derived from such Feedback, or other Microsoft intellectual property, to be licensed to or otherwise shared with any third party. 6. Microsoft has no obligation to maintain confidentiality of any Microsoft Offering, but otherwise the confidentiality of Your Feedback, including Your identity as the source of such Feedback, is governed by Your NDA. 7. This agreement is governed by the laws of the State of Washington. Any dispute involving it must be brought in the federal or state superior courts located in King County, Washington, and You waive any defenses allowing the dispute to be litigated elsewhere. If there is litigation, the losing party must pay the other party’s reasonable attorneys’ fees, costs and other expenses. If any part of this agreement is unenforceable, it will be considered modified to the extent necessary to make it enforceable, and the

remainder shall continue in effect. This agreement is the entire agreement between You and Microsoft concerning these Materials; it may be changed only by a written document signed by both You and Microsoft.





In this Chapter ● Choosing appropriate schema type and designing

● Dimensions, Facts and Granularity

● Slowly Changing Dimensions

● Modeling Time and Date Dimension

Overview Dimensional modeling is the design concept used by many data warehouse designers to

build their data warehouse. Dimensional model is the underlying data model used by

many of the commercial OLAP products available today in the market. The dimensional

data model provides a method for making databases simple and understandable. The

major purpose of creating a data warehouse from transactional systems is creating

intelligence out of day to day activity and it is not intended for extracting operation reports

and cannot be treated merely as a report store.

Terminology The terms that are used frequently in this document, are explained below. Details of each of these terms are explained in the later sections of this chapter.

Dimension table: It is a business entity of the source system. There can be multiple normalized table represent one single business entity on the source system. Example: Customer Dimension, Product Dimension

Fact table: A central table in a data warehouse schema that contains numerical measures and keys relating facts to dimension tables.

Grain: The level of the measures within a fact table represented by the lowest level of the dimensions. It is the level of detail the facts are stored in the fact table. Example, Fact table may contain lower grain to daily transactions such as customer purchasing information and number books purchased. OR fact table may contain higher grain to simply state the number of books sold without the details of who bought which book.

Measures: It is a numeric attribute that would typically be derived from transactions. It is generally placed in fact table of dimensional structure. Example: Number of books sold today

Hierarchies and levels: This is basically group of attributes that would logically relate each other in the form a tree structure. Example: City, State, Country; Day, Month, Year. Levels is the relation between the members of the hierarchy, such as in the above example, city is at lower level in the hierarchy and Country is at higher level.

Physical Schema Structures There are basically two types structures that are popularly used while designing the

physical dimensional modeling, they are Star Schema and Snow flake Schema

Star Schema. The simplest relational schema for querying a data warehouse databases is the star schema. The star schema has a center, represented by a fact table, and the points of the star, represented by the dimension tables. From a technical perspective, the advantage of a star schema is that joins between the





dimensions and the fact tables are simple, performance, ability to slicing and easy understanding of data.

The following diagram depicts a sample star schema,

Snowflake Schema. A variation of a star schema is called a snowflake. In a snowflake, the dimension tables are normalized. From a reporting perspective, there is no logical difference because all the tables of a snowflake dimension can be logically joined into a single star dimension. From a performance perspective, the snowflake may result in slower queries because of the additional joins required.

Designing a Warehouse Schema

Following are the basic steps for designing a data warehouse based on facts and

dimensions.

Identify Facts and Fact Table Grain

o Identity business processes for analysis such as sales trend over the period or

cross sell

Sales Fact

ProductKey

CustomerKey

StoreKey

DateKey

Product Dimension

ProductKey

Product Name

StartDate

ProductManufacturer

Customer Dimension

CustomerKey

CustomerName

StartDate

CustomerLocation

Store Dimension

StoreKey

StoreName

StartDate

StoreLocation

Date Dimension

DateKey

CalendarDate

Month

Day

Sales Fact

ProductKey

CustomerKey

StoreKey

DateKey

Prod Dimension

ProductKey

Product Name

StartDate

ProductMfr

Cust Dimension

CustomerKey

CustomerName

StartDate

CustomerLocKey

Store Dimension

StoreKey

StoreName

StartDate

LocKey

Date Dimension

DateKey

CalendarDate

Month

Day CustLoc Dim

CustomerLocKey

Location

StartDate

Loc Dim

LocKey

LocName

StartDate





o Determine the facts that are needed directly or indirectly for the identified

business analysis

o The facts identification is the process of looking at source database and see if

the source database can satisfy business needs such as source does not have

any information on products name/brand sold but has only consolidated

information on the kind of product that is sold

o Identify and list out source facts typically stored in transaction tables

o Determine the lowest level of granularity needed for the facts such as business

analysis needs only weekly sales information

o Determine the entities for the source transactions. It can be identified from the

parameters for the source measures such as Product cost is based on products,

date and store

o Business entities are termed as dimensions and numeric attributes of the transactional tables are called measures. Fact table is a set of dimensions attributes and measures.

Identify Dimension Keys and Attributes

A dimension logically consists of a primary (unique) key that maps to a foreign key in

the fact table. The dimension also includes additional attributes about that key. For

example, in a Customer dimension, the Customer ID may be the logical primary key.

Some of the attributes provide the basis for grouping the facts. These are the

groupable attributes. For example, Gender, City, State, and Country may be

groupable attributes of a customer. Other attributes are descriptive. For example,

Mobile Phone Number is an attribute of a customer, but a report is unlikely to group

by phone numbers. Some of the groupable attributes can be organized into a

hierarchy, allowing drill-down reporting. For example, City rolls up to State, which

rolls up to Country. Not all attributes can be organized into a single hierarchy. For

example, Gender does not have any correlation to City.

o As OLTP is normalized there can be more than one table present for each

business entity such as Products, Products Manufacturer details

o Extract the required attributes of each business entity such as Products needs

only Product Name, Manufacturer though there are other attributes of parameters

but not needed for the analysis

o Denormalize the business entity (flattening normalized tables to make unique set

of attributes of each business entity)

o Prepare a logical data modal consisting structure on how business entities

related each other and entities that depend on source measures

o It is not required to use all the business entities collected for each fact

o Hierarchies for the dimensions are stored with in the dimensional table in star schema

o In Star schema every dimension will have surrogate key as a primary key and

there is no child/parent table present for the dimensions

o Fact tables will have a composite key containing the dimension keys

Steps in designing Snow Flake Schema

It is similar to Star schema but with a degree of normalization is being performed





One or more dimension tables, each with a generated primary key, which contain one or more descriptive fields used to determine some level attributes for each dimension. In addition, a dimension table may contain a join to one or more other child dimension tables that supply additional level attributes for each dimension Example: Manufacturer information is kept in a separate dimension and Product dimension has a reference of Manufacturer in its attributes. In star schema the manufacturer information would have been as attributes to Product dimension (not a separate dimension)

In a snowflake schema, more than one table is required to completely define the set of attributes for the dimension.

In a snow flake schema, a dimension table may have have a relationship to foreign keys in more than one dimension table. For example, a Geography table can join to the GeographyKey of Customer, Reseller, and Manufacturer

Whether a dimension is designed as a single table (star) or multiple tables (snowflake) has no effect on the fact table.

In a fully-normalized snowflake dimension, you can clearly see the hierarchical relationship between attributes. In a star dimension, you may not see the hierarchical relationships.

Guidelines for choosing Star vs. Snowflake

Consider a snowflake schema to reduce storage requirements for a large dimension tables. For example, Storing the information about the manufacturer in a product dimension table would make a product dimension unnecessarily large because one manufacturer may provide multiple products

Consider snowflaking a dimension if one entity is referenced by multiple dimensions. For example, Customers, Resellers, and Manufacturers may all have a Geography component. Extracting geography information into a normalized table allows for consistency between geographical definitions for the different dimensions. A snowflaked dimension accessed from multiple dimensions is called a reference dimension.

Another reason for using a snowflake schema is to supply information, not about the level attributes of a dimension, but the secondary attributes of a level.

Consideration of Operational Data Source (ODS) Operational Data Source is a topic covered in Extraction guide.

o ODS data store technique is adopted when business requirement does demand

OLTP granularity for shorter duration of time.

o ODS should be considered only when information retrieval requirements are to

be real time and cannot access OLTP systems because of criticality of

resources.

o ODS is a logically different data store apart from warehouse.

o The data in ODS is much granular or equal to the granularity of warehouse.

ODS does not maintain history of attributes and the duration of ODS data is

typically to the order of synch duration. Example: If data is loaded to warehouse

once in a day then intra day transactions are maintained in ODS.

o ODS does not maintain Surrogate keys but having business keys





o If ODS exists, it will be used for loading to warehouse than any other staging

source

o Population to ODS is through ETL process such as Data Extraction,

Transformation and Loading

Dimensions Explained OLTP data sources typically use an entity-relationship (E-R) schema technique for

modeling transactional data, because OLTP transactions usually involve a large number

of transactions of small amounts of data. Data warehouses, on the other hand, usually

involve fewer transactions of very large amounts of data, so E-R schema technique is not

as efficient. Instead, data warehouses generally employ one of two approaches towards

schema design, referred to as star schema or snowflake schema.

Dimension Types

There are various terms relating to dimensions that you should be familiar with. Following

is a brief summary. Most of these will be discussed in more detail elsewhere.

Changing Dimension: Changing dimension is the dimension which has at least one

attribute whose value would change over the time.

Slowly Changing Dimensions: Attributes of a dimension that would undergo changes

over time. It depends on the business requirement whether particular attribute history of

changes should be preserved in the warehouse. Example: Employee work location

attribute may undergo changes over time not frequently. This is called a Slowly Changing

Attribute and a dimension containing such an attribute is called a Slowly Changing

Dimension.

Rapidly Changing Dimensions: A dimension attribute that changes frequently is a

Rapidly Changing Attribute. For example, a CurrentLocation attribute for an Employee

may change very frequently. If you don’t need to track the changes, the Rapidly

Changing Attribute is no problem, but if you do need to track the changes, using a

standard Slowly Changing Dimension technique can result in a huge inflation of the size

of the dimension. One solution is to move the attribute to its own dimension, with a

separate foreign key in the fact table. This new dimension is called a Rapidly Changing

Dimension. Deciding if an attribute should be moved into a Rapidly Changing Dimension

or kept in a Slowly Changing Dimension is up to the data architect. The decision depends

on factors such as disk space, performance, expected data volume, history preservation

etc,

Junk Dimensions: A junk dimension is a single table with a combination of different and

unrelated attributes to avoid having a large number of foreign keys in the fact table. Junk

dimensions are often created to manage the foreign keys created by Rapidly Changing

Dimensions. If you have bit-mapped indexes available, you may not need to create

separate Junk dimensions, because the multiple, small-cardinality foreign keys can be

managed efficiently.

Inferred Dimensions: While loading fact records, a dimension record may not yet be

ready. One solution is to generate an surrogate key with Null for all the other attributes.





This should technically be called an inferred member, but is often called an inferred

dimension.

Conformed Dimensions: A Dimension that is used in multiple locations is called a

conformed dimension. A conformed dimension may be used with multiple fact tables in a

single database, or across multiple data marts or data warehouses. Conformed

dimensions are discussed in detail later in this chapter.

Degenerate Dimensions: A degenerate dimension is when the dimension attribute is

stored as part of fact table, and not in a separate dimension table. Common examples

include Order Line Number or Transaction Number. These are essentially dimension

keys for which there are no other attributes. In a data warehouse, these are often used as

the result of a drill through query to analyze the source of an aggregated number in a

report. You can use these values to trace back to transactions in the OLTP system.

Role Playing Dimensions: A role-playing dimension is one where the same dimension

key—along with its associated attributes—can be joined to more than one foreign key in

the fact table. For example, a fact table may include foreign keys for both Ship Date and

Delivery Date. But the same date dimension attributes apply to each foreign key, so you

can join the same dimension table to both foreign keys. Here the date dimension is taking

multiple roles to map ship date as well as delivery date, and hence the name of Role

Playing dimension.

Shrunken Dimensions: A shrunken dimension is a subset of another dimension. For

example, the Orders fact table may include a foreign key for Product, but the Target fact

table may include a foreign key only for ProductCategory, which is in the Product table,

but much less granular. Creating a smaller dimension table, with ProductCategory as its

primary key, is one way of dealing with this situation of heterogeneous grain. If the

Product dimension is snowflaked, there is probably already a separate table for

ProductCategory, which can serve as the Shrunken Dimension.

Outtriggers: Sometimes a dimension attribute can have multiple values for a single

dimension key member. For example, a customer may have multiple accounts. One

solution is to create a separate table with Customer Account as its logical primary key. An

outrigger dimension is the logical inverse of a shrunken dimension.

Static Dimensions: Static dimensions are not extracted from the original data source,

but are created within the context of the data warehouse. A static dimension can be

loaded manually—for example with Status codes—or it can be generated by a procedure,

such as a Date or Time dimension.

Identifying dimensions

Dimensions provide the 'who, what, where, when and why' that is applied to the 'how much' fact.

Lookup tables and master tables which generally act as a primary key tables in OLTP would end up as dimensions in warehouse

For normalization reasons, there can be more than one table for a single entity in OLTP such as customer table and customer address table, both of them can make it up to one dimension in warehouse as a part of the denormalization process

Dimensions are typically highly denormalized and descriptive. They are typically text fields, and serve as context for the facts

Dimensions can be left as degenerate dimensions whenever they make sense by themselves (usually a matter of opinion), provided that they do not have additional attributes. One reason for creating a separate dimension table is if you want to report on ―missing‖ values from the fact table: the dimension table can include all





possible values, not just the ones for which there are fact transactions. By creating an outer join, you can find members that do not have fact transactions.

Dimensions are some times referred to as causal: they contain the causal factors responsible for the collected measures

If the attributes are connected in one or more hierarchies then these are usually captured in the dimension table

Dimension attributes are business descriptors users believe are of significance to understanding the measures in the fact table

Dimension attributes are usually discrete (not continuous) and are often used as the source of constraints. Some dimension attributes may be continuous—for example the list price for products. Continuous values are not useful for grouping. You can often convert a continuous value into discrete buckets. For example, converting prices into High, Medium, Low.

Dimension attributes usually have a limited cardinality at least when compared to facts, which come from OLTP transactions

User input and evaluation of existing business reports help define the dimensions to include in the data warehouse. The words ―by‖, ―for‖, ―of‖, and ―in‖ often flag a dimension.

o FOR EX 1: A user who wants to see data by sales region and by product has just identified two dimensions (geography and product).

o FOR EX 2: Business reports that group sales by salesperson or sales by customer identify two more dimensions (salesforce and customer).

o FOR EX 3: The question, "What were the sales of food items in the northwest region for the third quarter of 1999?" represents the use of three dimensions (product, geography, time) to specify the information to be summarized

Dimension Maintenance Following are the guidelines pertaining to the dimension maintenance

Avoid removing attributes from a dimension’s schema. Consider not removing an attribute from a dimension without analyzing the impact on other objects such as analytical queries, report queries, and cubes. Even if the attribute is discontinued at the source, you can add a default value in the warehouse for new rows, rather than remove the attribute from dimension. Business keys should always be retained in the dimension tables, as these provide a mechanism for tracing the report values back to the source.

As an example, If the source schema is modified to remove an attribute such as Date of Birth, there probably already exist data for the attribute in the dimension table. So if the attribute is kept in Dimension it will have NULL, if removed you are loosing historical data

Adding a new attribute to the dimension should not change the granularity of the dimension. For example: Adding Customer location should not change the granularity of the customer dimension. If there is only one record in customer dimension before adding location, it should still have only one record per customer in the dimension even after adding location attribute to it. If adding a new attribute takes the dimension to a new lower granularity then consider managing the value as a slowly changing attribute, or add an outrigger table to the dimension. Changing the dimension granularity would affect the fact tables with references to this dimension





Dimensions must have the ability to add attributes to dimensions incrementally. Assume all attributes on the dimension is Type I then if a new attribute is added which could potentially result in changing the dimension to Type II this would impact the ETL logic. Consider maintenance changes while developing ETL framework such some attributes of Type I today may become Type II tomorrow or vice versa. Type I and Type II are part of slowly changing dimension design, which is covered in the later section.

Data warehouse architectures must be designed to accommodate ongoing data updates, and to allow for future expansion with minimum impact on existing design. Fortunately, the dimensional model and its straightforward schemas simplify these activities.

Records are added to the fact table in periodic batches, often with a little effect on most dimensions. For example, a sale of an existing product to an existing customer at an existing store will not affect the product, customer, or store dimensions at all. If the customer is new, a new record should have been added to the customer dimension table before adding the fact record. The historical nature of data warehouses means that records almost never have to be deleted from tables except to correct errors. Errors in source data are often detected in the extraction and transformation processes and are corrected before the data is loaded into the data warehouse database.

The date and time dimensions are often created and maintained in the data warehouse independent of the other dimension tables or fact tables – updating date and time dimensions may involve only a simple annual task to physically add the records for the next year.

An entirely new schema can be developed to a data warehouse. A new business subject area can be added by designing and creating a fact table and any new dimensions specific to the subject area. Existing dimensions can be reused without modification to maintain conformity throughout the entire warehouse. If the new fact table has a different grain, you may need to create a shrunken dimension (or add an outrigger table) while still conforming with the master dimension.

Dimension Guidelines Consider the following guidelines when designing your Dimension tables:

Denormalization

This is the process of flattening the dimensions to form one single unit of all the related

entities. Normalization is generally performed on OLTP systems to reduce redundancy

and reduce the risk of update anomalies while managing many transactions. On the

contrary, denormalization introduces redundancy and reduces joins, allowing for faster

retrieval and more user-friendly structures

Unlike a transaction system, where new transactions can be launched from many

process, in Data warehouse only the ETL process will add or update records, and the

update frequency is much less than that of a typical OLTP.

Benefits of selectively violating normalization rules and introducing redundancy:

Redundancy can reduce data retrieval time

Redundancy can create a more user-friendly model

Potential disadvantages with denormalizing the database:

It may cause update, delete and insert performance to suffer

It will take up more space





In a typical data warehouse, the advantages of denormalization far outweigh the disadvantages.

Avoid encoding record information in Surrogate key

A surrogate key should be a simple running sequence number and should not have any

encoded information. If the surrogate key is an integer, the query manager can perform

faster joins between the fact table and the dimension table. The surrogate key should be

able to change freely when you and it should be able to apply a new surrogate key to any

new records added because of Type II changes. Encoding record information in

surrogate key would have adverse affect when the record information is changed.

For example, a surrogate key that includes the combination of Employee ID and Location

ID would need to change if the employee location changes. This defeats the decoupling

purpose of the surrogate key.

Denormalize snowflake dimensions for end users

If you decide to store a dimension as a snowflake—in multiple tables—use a view to

denormalize it into a star before representing it to end users. A simple dimension

structure that consists of simply a dimension key and a list of attributes is easier for users

to understand than a complex relationship of tables with joins.

For example, an Employee dimension may be stored as an Employee table and a

Department table. If the Department ID is a foreign key in the Employee table, it is

logically part of the Employee dimension. Rather than show the end user two separate

entities, show the dimension as if it were a single table.

Do not show surrogate keys to the end user because they do not have any business

meaning.

Consider adding zero key for Dimension tables

Adding a dummy record in each dimension as a first record whose surrogate key value is

zero, so called as zero key. The business key and other attributes are defaulted to NULL

or default value such as Not Available.

The fact table typically contains the transactions that are recorded at source system.

Due to the nature of extraction process implemented, there is a possibility that business

key is not available at the time of fact table loading. In such a case, a default key is

assumed from dimension (as dimension is not yet populated with the required key)

having zero value. Once the dimension record comes back in the loading, the

corresponding zero key in the fact is updated with corresponding surrogate key of the

business key. Zero keys also added date and time dimensions. If the date/time

information is not available at the time fact table population zero keys of the

corresponding dimensions are assumed.

Do not combine attributes from different business entities into a single dimension

One Dimension represents one entity and its details. Combining two entities into one

dimension in order to reduce number of dimensions does not adequately model the

relationship.

For example, suppose that a single Employee can participate in multiple Projects, and a

single Project can include multiple Employees. This means that Employees and Projects





are different entities. The relationship between Employees and Projects must be from an

association table or a fact table. If you try to include Project information in the Employee

dimension, you would need a separate Employee member for each Employee Project.

Create a surrogate key as the primary key of each dimension

Surrogate keys will act as a primary key for the dimension. It is required because of

following advantages

Consolidating from Multiple sources: When data is consolidated from multiple sources,

there is a possibility that business key repeated across sources. To have it uniquely

identified, surrogate key is created. This is a single column incremented for each added

record and its value is not based on the business key

Slowly changing members: Dimension members can change over time. In Type II

dimensions where the history of Type II attribute is tracked will have business key

repeated. Surrogate keys will uniquely identify each dimensional record irrespective of

type of change in attribute columns.

Inferred dim support: When the dimension record is a late arrival after its fact records

then a new dummy record (having business key and all other columns as NULL) is

inserted in dimension table with a business key. Surrogate key here supports the

scenario to get referenced in fact table, be aware that dummy dimensional record will get

updated once the dimension has received the late arrived dimensional record.

Natural Key may change: The natural key (source key) may be modified in source

system, having it referenced in fact table would lead to chaos as it needs to update entire

fact table wherever the natural key is referenced. Having surrogate key independent of

natural key would not have to change anything even if the natural key is modified in the

source system

Performance: Having an integer surrogate key in the dimension can improve the

performance of data access queries because of its integer data type and also single

attribute to represent the dimension record. Consider the natural key at source is not a

single column but a group of columns acting as a composite key. Having this in

dimension with surrogate key support would force the facts having these two keys

referenced, this would not only increase the disk usage but joins on both the columns to

get dimensional record

Use a simple integer for Surrogate keys

Integer key joins are faster that joins with other data types, The SQL query engine is

optimized for integer comparison.

If possible, use an Identity constraint to generate the surrogate key. This is much more efficient than creating the key by using ETL code

Each time a member has a new value for a Type II attributes, you add a new record, which generate another surrogate key. This allows older fact table rows to have the old attribute value, while new fact table rows have the new value.

Surrogate keys can be defined as unique across source systems, so that a John Reynolds from system XYZ and a John Reese from system ABC might both be identified in their respective systems as JR. However, once they are loaded into the data warehouse, they are each assigned a unique, non-overlapping value.





Identify type and relationship between the dimensions and the facts

Consider following guidelines when defining the relationship between dimensions and facts

If an attribute in a dimension is a numeric value that can be meaningfully aggregated in user reports, consider adding the numeric value to a fact table

A dimension can have numeric values that don’t need to be aggregated in reports. These numeric values can remain in the dimension.

A fact table may contain only foreign keys, but no measures. This is called a factless fact table and exists when the business measures consist of only counts. This is a valid fact table design, but you may need to use distinct counts to aggregate the counts, rather than simple fact row counts.

The purpose for a dimension table is the ability to have multiple attributes for the dimension key. If a dimension key does not have any additional attributes, you can implement the dimension as a degenerate dimension in the fact table.

Creating separate snow flake dimension tables for variable number of attributes of entities

For heterogeneous dimension items such as products, some products have 100

attributes to track and some products just 10. Having all these product details in the

same dimension table would make the dimension table very wide, with no values for most

attributes in any one row (sparse). One solution is to create one main dimension table

containing the shared attributes and create individual dimension tables that join to the

main dimension table to track the extended attributes

Create an association table when there is a many-to-many relationship between two dimensions

An association table maps the relationship between attributes from different dimensions.

Example:

Consider there are two dimensions namely Authors Dimension and BookTitle Dimension.

One author can have multiple bookTitles and one Booktitle can have multiple authors.

This is many to many relationship between authors and booktitles.

Consider creating a separate association table which has the association between

Authors and Titles.

Association table

Also, the association is maintained without the need of any fact "happening" transactions

at the source. In general, the association tables have keys mapping in many to many

relationship but if underlying dimensions are type II, there is maintenance overhead. To

reduce overhead, we would suggest creating association table with business keys rather

than SKs. The key can easily found in Type II dimension with the help of EndKey

EmpId EmpName Age

1098 John 28 EmpId EmpName Age

1098 John 28





Guidelines for choosing an Outrigger table

If a dimension attribute is not unique for the business key then another dimension is

created also called a child dimension which would have reference of the parent

dimension and also has its own surrogate key. One business key record in parent

dimension can have in multiple records in the child dimension, the child dimension here

called as outrigger.

This is sometimes called a Multi-valued Snowflake Dimension or a Reference Dimension

An outrigger table is different from a snowflake table. A snowflake has lower cardinality

than the dimension key; An outrigger has higher cardinality. In a snow flake, the parent

dimension (which is normalized from star to snowflake) will have reference of snow flaked

dimension, where as here the outrigger will have the reference of parent dimension.

Example, Employee dimension is snowlaked to have EmpDimension and Status

Dimension. Here Emp Dimension will have an attribute called Status Dimension SK. In

outrigger (example below), the outrigger table AccountOutrigger will have an attribute to

store the Emp SK.

Outriggers are meant for improving maintainability, save space or increase flexibility.

Outriggers are not same as Snow Flake. Outriggers are created only when there is more

than one value for a dimensional attribute where snow flake is created just to take out

some dimensional attribute (normalization)

Avoid snow flaking the outrigger as it could result complex, less stable & more difficult to

maintain model

Example:

Employee Dimension

Here employee information is repeated because employee has more than one account,

to reduce the redundancy and to have only one record per employee in employee

dimension, create the following structure using outrigger.

Employee Dimension

AccountsOutrigger

SK EmployeeName EmpId Location AccountId AccountType

1 Dave 123 Redmond 9287 Savings

2 Dave 123 Redmond 4235 Checking

3 Jim 124 Redmond 8923 Savings

SK EmployeeName EmpId Location

1 Dave 123 Redmond

3 Jim 124 Redmond

AccSK EmpSK AccountId AccountType

1 1 9287 Savings

2 1 4235 Checking

3 3 Jim Savings





Conformed (Shared) dimensions A conformed (or shared) dimension is one that is used with multiple fact tables, or in

multiple data marts or in multiple data warehouses. Shared does not mean only one copy

is present, it does mean that only one copy is loaded with formal loading process and

other copies will get replicated from the loaded copy. A dimension is said to be

conformed if it has been used across different schemas such as data marts, data

warehouses and thus maintaining the logically same set of dimension attributes. Using

conformed dimensions is critical to successful data warehouse design.

Business Entities (also called as dimensions here) can be reused for other facts such as

Products entity can be reused for Sales as well as Product costs

Conformed dimension ensures the dimension data is sourced from only one process and

any other derived dimensions for data mart purposes would have consistent information.

When there are multiple fact tables or data marts in the system then all these fact tables

and data marts should access one logical dimension per one business entity. Example:

There should be only one customer dimension the system that is being used by all data

marts or fact tables or outrigger tables or other snow flake dimensions.

This dimension can be replicated or horizontally partitioned to suit the requirements of

other data marts. Horizontal partitioning of a dimension means that some of the rows are

in one copy and some are in another copy. For example, one dimension might have all

the East region customers, while another has all the West region customers.

ETL process is to populate only one customer dimension in the system, not once for each

data mart or separately for other horizontally/vertically separated dimensions

You should use conformed dimensions whenever possible instead of individual ones.

Data mart and Data warehouse will use the same dimension not multiple copies of the

same dimension, loading process is only once

Conformed dimensions are to be designed to satisfy all the data mart and data

warehouse needs. A dimension can be taken to higher granularity but cannot be taken to

lower granularity once fact tables are populated. As the confirmed dimensions are

Customer

Dimension

(Conformed)

OLTP ETL

Horizontal

Partitioned

Horizontal

Partitioned

Complete

Dimension

Data

Mart

Data

Mart

DW





designed to be shared among data marts and data warehouse, it should be designed as

lower granularity possible as per source data base schema

Changing Dimension A data warehouse must deal with attributes that change values over time. This is not an

issue for most transaction systems, because they deal only with the present. But a data

warehouse will often present data historically for anywhere from two to ten years.

Organization changes, customer attribute changes, and other changes must be

managed.

Procedure for identifying changes in a dimension

Identifying the attributes that would change overtime needs knowledge about the OLTP system. In principle, any attribute except the key can change over time. You need to plan for what to do in case an attribute changes.

Identifying the attributes whose change history to be maintained in warehouse is a business analysis decision

Changes are to be well modeled in the data model modal and data mapping sheet

Granularity and normalization to be properly represented in data modal

Assess the data type of surrogate keys that would sufficiently allow you to support changing dimensions

Dimension Size considerations

o Data must be split based on attribute changes and usage

o Data that is used together is kept together to reduce joins

o Splitting a fat dimension into several smaller dimensions can improve performance

o Snowflake schema with its separated dimensions performance slightly better than the star schema with its wide flattened dimensions

If the rapidly changing attributes are not textual and discrete, but are numeric and continuous, then consider placing them in facts

The validity of the data warehouse is thus dependent on how well changes in the its dimensions are tracked

Slowly Changing Dimensions Explained The Slowly Changing Dimension (SCD) should technically be called Slowly Changing

Attribute, because any attribute can be handled differently from any other. There are

logically three ways of dealing with a slowly changing attribute:

Type I: The existing record is updated with change. Replacing the data is very appropriate if the old data was erroneous or it is unlikely that the changes will materially affect important analyses. Type 1 slowly changing dimensions do require re-creating any aggregations that would be affected by the change.

Type II: History is preserved, a new record is added to dimension and old record is marked as inactive. EndDate or IsLatest is generally used to represent latest record of a business key. A new surrogate key is created for every change irrespective of the business key. Type II changes can cause a serious inflation in the number of members of a dimension.





Type III: Type III preserves a version of the history by adding it to the dimension record in a different attribute. Typically, there are two states: current and previous. In some cases, you may have 4 or 5 states. When the number of defined states is exceeded, previous states are discarded. A Type III change does not create a new row. A Type III change does require extra columns in the table schema. As with a Type I change, a Type III change requires a dimension update, so you need to re-process all aggregations after a change.

Slowly Changing Dimension Guidelines Consider the following guidelines for addressing slowly changing dimensions:

Avoid Type II for dimensions with a large number of columns; use number of smaller dimensions

If the SCD dimension having a Type II attribute and also has large number of columns

then any change in this attribute would result in adding entire row again to the same

dimension though there is no change in any of the other columns. Create smaller

dimensions as an alternative solution.

Consider loading with correct SK for Late arriving facts

In the case of an inferred dimension member, where you create the member surrogate

key before the attributes are available, do not create a new Type II record when the

attributes arrive. Instead, load the initial attributes as if they were Type II attributes. For

example, RecordX of fact1 is delayed and it is corresponding to customer when the

location was L1 (SK = 100). The location of customer is changed to L2 and so the

surrogate (SK=101). Now when the RecordX is added to Fact1 then you should consider

SK=100 instead of the latest SK (SK=101) of customer dimension

Consider not treating attributes as Type II attributes unless required

The idea behind of Type II attributes is to maintain history for business analysis to track

the changes over time. If the changes need not be tracked and unnecessary then

consider the attribute as Type I (overwrite the attribute value every time it changes)

Consider size of dimension record when identifying Type II attributes

Having more number of Type II attributes in a single dimension would result in adding the

entire dimension record for any change of Type II attribute. If the number of Type II

attributes are unavoidable, consider breaking the dimensions

Use a sufficiently large data type for surrogate keys

When estimating the size of a data type for a surrogate key, be sure to include the

additional records that will be created from Type II changes.

Example: In a dimension with an estimated 9,000,000 customers, you might be tempted

to use a small integer for the key, but one Type II attribute, with only one change per

customer would result in 18,000,000 records. To be safe, you may want to use bigint

keys for dimensions.





Use TYPE III SCD when you need current and previous values

Type III attribute changes are relatively rare in a data warehouse, but if you do need to

report only on the current and previous states, it can simplify reporting compared to a

Type II implementation.

Slowly Changing Dimensions – Scenarios and Solutions

Process Type I updates in the correct sequence

With a Type I attribute in the warehouse, it is possible to have a source system that

maintains multiple change entries. When applying the changes, be sure to apply them in

the correct sequence so that the warehouse maintains the last possible value. For

example, the Employee location was New York at 9AM. It was changed to Los Angeles at

1 PM and then changed again to Houston at 4 PM. When extraction is performed at 6

PM, it has all the changes starting from last extraction. When the data is applied to the

warehouse, it is critical that the final value for the dimension member be the last

change—Houston.

There are two ways to accomplish this

Batch Aggregation: Hourly aggregate the changes and load it at the end of the day of each hourly data. In such a case if the update occurred multiple times with in an hour then the latest change is propagated to warehouse.

Loading in Intervals: Consider loading the warehouse hourly from each hourly data extract and reduce the interval further if updates are frequent and need to be captured

Split a dimension table when attributes change at different frequencies

If there is a dimension has that Employee information and Department information,

consider splitting into two tables, with a foreign key to the department table in the

employee table. If both employee and department dimension are in the same dimension

and as they change at different rates, it will lead to huge cardinality particularly if both are

Type II in nature.

Use an Rapidly Changing Dimension (RCD) or Outrigger table for Type II attributes that change frequently

The best solution is to treat the attribute (that changes very frequently) as rapidly

changing and create a separate Rapidly Changing Dimensions (RCD) for the attribute.

Add the surrogate key of this new dimension to fact table to maintain the current status

when the record is inserted in the fact. If, however, there is no fact record then it is

difficult to know the latest value of this attribute. One solution is to create a separate fact

table that has only the dimension and the RCD surrogate key. (In essence, this becomes

an outrigger dimension.) .





When Current information is required on a large Type II Dimension table, consider joining on Business Key

Reports that access fact table joins Type II SCD dimensions to get latest instance of the

business key. But it results in self join to get latest SK for the business key whose key

referenced from the corresponding record in fact. Fact.DimSK = Dim.DimSK and

Dim.BusinessKey = SelfDim.BusinessKey and SelfDim.EndDate is NULL (EndDate is

typically a column on Type II dimensions to mark the start and end date of the Type II

change). The solution is to add business key in the facts, not in composite keys, as an

additional field would reduce the self join. This is particularly used for Type II SCD

dimension having large volume and reports direct access the underlying relational tables.

Typically the join condition would be

Fact.BusinessKey = Dim.BusinessKey and Dim.EndDate is NULL

If using SK key as a means to join the facts and dimensions to get latest information from

Type II dimensions would as

Select Current.CustomerName, Measure

From

Fact

Inner Join Customer old on Fact.SK = CustomerSK

inner join Customer Current on Old.CustomerId = Current.CustomerId

If using Business key, the query can be eliminated with a self join as

Select Current.CustoerName, Measure

From

Fact

Inner Join Customer Current on Fact.CustomerId = Current.CustomerId and

Customer.EndDate is NULL

This method of joining on the additional business key (instead of dimension surrogate

key) is particularly useful when the ―current‖ information is sought in the result set on a

large volume Type II dimension table. This method will not eliminate the need of

surrogate key in the facts, it only suggests having an additional attribute in fact table as

dimensional business key for improved performance queries.

Use a Full Outer Join to identify Inserts, Updates, and Deletes

While loading warehouse, you need to check for the existence of the new record in the

warehouse. If record does exist, then you have a Type I or Type II effect depending on

the attribute that is changed. If the record does not exist, you must add one. In addition,

you have the case where the record exists in the warehouse, but is not in the new

extract. This signals a logical deletion. Comparing the existence of the values in both the

new and old data sets can be slow, particularly when the tables are large.

A good solution is to use a Full Outer Join between the new table and the existing

warehouse dimension table, using the business key for the join. You can then determine

Updates, Inserts, and Deletes in one query.

Example





SourceTable Full Outer Join Dim On SourceTable.BusinessKey = Dim.BusinessKey and

Dim.EndDate is NULL

If the resulting dimension Business Key is NULL then it means the corresponding record

does not exist in warehouse or that it has expired. In this case a new record is to be

inserted in the warehouse.

If the resulting source Business Key is NULL, then it means the corresponding record has

been deleted and should be flagged with an EndDate in the warehouse table.

If neither Business Key is NULL then it means the corresponding record exists in

warehouse and is active. In this case the appropriate Type I or Type II processing can be

performed.

Using the Full Outer Join approach combined with a Hash value for attributes, you can

optimally find out the Type I and Type II attribute changes, as given below

Step 1: Identify Inserts, Updates, Deletes using Full Outer Join

Step 2: For Inserts, directly insert into dimensions

Step 3: For Deletes, update dimensions for marking as inactive

Step 4: For Updates, compare the Type I and Type II hash values to determine

subset of records to have Type I effect and subset of records to have Type II effect

Rapidly Changing Dimension A dimension is considered to be a rapidly changing dimension (RCD) if one or more of its

attributes changes frequently in many rows. Consider the following guidelines if you are

dealing with RCD.

Determine if an attribute is Rapidly Changing

When an attribute where you need to maintain history (a Type II attribute) changes

frequently, you can end up with a huge inflation of rows in the dimension table. To avoid

inflating the dimension table, you can move the attribute to a separate dimension, called

a Rapidly Changing Dimension (RCD). The threshold for when you would move an

attribute to an RCD is based on the size of the dimension, the number of changing

attributes, and the frequency of change. As a broad rule of thumb, if an attribute would

cause a dimension to double in size more than once per year, you may want to move it to

an RCD. The cut-off limit will vary depending on your resources and business

requirements.,

Slow down the rate of change by using Tracking bands

The number of changes the Rapidly Changed Attribute can be reduced if a range is

defined instead of precise value of the attribute, these ranges also termed as Tracking

Bands. This would reduce the number of changes in RCD unless it crosses the range

values. As the range it belongs to will not change as frequently as the value, it will not be

considered as RCD. This approach would eliminate the need of handling RCD in another

set of dimensions. Defining the range is not feasible for every RCD but can be

implemented wherever ―range‖ can be implemented and business requirements are met

with the range definition. Example, If business requirements does not demand actual

age of employee, and if age is considered to be rapidly changing attribute, then consider





defining the range for the age. The range include [10-20], [20-30] etc., So, even if

employee age changes from 25 to 26, the range he/she belongs does not change.

Employee (source)

EmpDimension (Warehouse)

EmpSK EmpId EmpName AgeRange

342 1098 John 20

Determine if Type II change is needed for Rapidly Changing attributes

Determine if having history maintained for rapidly changing attributes is important for

business reports/analysis needed. If it is not required, a type I change on such attributes

would not create any cardinality or performance problems.

Use static time components rather than changing attributes

Rather than use Age as an attribute, you can store the Birth Date. Birth Date will not

change, but can be used to calculated the exact age as needed.

Move Rapidly Changing Attributes to another dimension

When you have a rapidly changing attribute, you avoid inflating the dimension table by

moving the attribute to its own dimension, with its own The solution for a dimension with

rapidly changing attributes is to break the offending attributes out of the dimension and

create one or more new dimensions. Consider the following example.

Example:

Consider a customer dimension having status attribute which is identified as rapidly

changing attribute and retaining history is necessary. As a solution, create a

CustomerStatusDimension and add the foreign key for that dimension to each row of the

fact table as the fact data is loaded. Rather than add the CustomerStatusDimension to

the main fact table, you can also create a separate fact table including the Date, the

CustomerID, and the the CustomerStatusDimensionID. This solution would also allows

end users to analyze the history of status changes.

Create Junk dimension to store multiple RCD attributes

If you create a large number of rapidly changing attributes and create a dimension for

each, the number of dimensions can become very large, and the fact table can end up

with a large number of foreign keys. One approach is to combine several of these mini-

dimensions into a single physical dimension. This approach works because each RCD

typically has a very low cardinality. This is the same technique used to create what is

often called a junk dimension—a dimension that contains unrelated attributes to get them

out of the fact table. A junk dimension in the fact table reduces the number of extra

foreign keys in the fact table, but it is still necessary to use the fact table to resolve the

relationship between the customer and the attributes.

To keep the RCD attributes completely out of the main fact table, create a separate fact

table with a Date, the CustomerID, and the RCD composite key.

EmpId EmpName Age

1098 John 28





Modeling Date and Time dimensions There will be only one Date dimension and one Time dimension across all facts and

dimension tables. Not all date and time fields needed to be mapped to date and time

dimensions. Map a date-time field to a Date or Time dimension only when you need the

extra attributes from the dimension.

Typically the lowest granularity of the Date dimension is Day, and the lowest granularity

of the Time dimension is Second. Many warehouses do not need a Time dimension, but

a Date dimension is used in virtually every data warehouse system.

In general, create separate dimensions for Date and Time

If it is necessary to extract contiguous chunks of time that cross day boundaries (for

example 11/24/2000 10 p.m. to 11/25/2000 6 a.m.), then it is easier if the hour and day

are in the same dimension. Otherwise, separate dimensions for Date and Time are easier

to manage and to query.

It is easier to analyze cyclical and recurring daily events if Date and Time are separate

dimensions.

Example:

Number of meetings active at 9AM this week. This is an easy query if Date and Time are

separate; it is much more complex if Date and Time are combined into a single

dimension.

Following are issues of having date and time dimensions together in one DateTime

dimension

Issue#1:

Combining Date and Time into a single dimension creates an extremely large dimension,

particularly if you use Seconds as the grain. The problem is not as great if you use Hour

as the grain of the combined table

Cardinality of Date and Time Dimensions (separated) for 10 years of data

Date Dimension: 10 * 365 = 3650

Time Dimension (granularity: seconds): 24 * 60 * 60 = 86400

Cardinality of Date and Time Dimensions (combined) for 10 years of data

DateTime Dimensions (granularity Hourly): 10 * 365 * 24 = 87600

DateTime Dimensions (granularity Seconds): 10 * 365 * 24 * 60 * 60 = 315,360,000

The more records in the DateTime dimension, the slower query performance will be.

Issue #2

Having both date and time dimension in the same dimension can cause interpretation

problems for fact tables that have granularity of a Day. It would be easy to inadvertently

enter two records into the fact table for the same day, because of the extra granularity in

the dimension table.





When do you need Time information in the Fact Table?

Business analysis needs time granularity for roll up or for measures/aggregations

based on transaction date time

When the business analysis has requirement such as number of transactions

happened between 3 PM and 4 PM on 1/1/2006

If no date time can be extracted related to a transaction date time, alternately Synch

Date Time is used in presentation. Example: There is a change of employee location

in one of the employee master tables at source but source didn’t capture the

date/time of the change. To have Type II affect on employee dimension, synch

date/time is used as a representation of location change occurred date/time, it is

opted only in case of date/time information is needed for ware house dimension and

the same is not present in source tables.

When source transaction has date and time information and fact is loaded with no

aggregations

Incase of date time information is not present, synch date time is used as a record

changing/event occurred time

Store Date keys in the Fact Table

Consider using a meaningful surrogate key for the Date dimension as an integer with the

structure YYYYMMDD. This makes it easy for ad hoc users and analysts to understand

the underlying date. You cannot use this method if both date and time are stored in one

dimension, because the structured key would be too long.

Typically a surrogate key should not contain any business value and should not be used

for any specific business rules. Dates are an exception for ease of use where date

dimension surrogate key format is suggested to have YYYYMMDD having data type as

INT(4)

Similarly for time dimension, the suggested format is HHMMSS

Time "4:23:47 PM" would have a surrogate key as 162347

Though it is readable avoid doing any calculation using the surrogate keys directly

without mapping to the corresponding date/time dimension.

A date dimension with one record per day will suffice if users do not need time granularity finer than a single day

If the business analysis or reports does not need exact time of the day the transaction

occurred, then the Time dimension is not needed for this fact.

A separate Time dimension table should be constructed if a fine time granularity, such as minute or second, is needed

Consider mapping Time dimension key to the facts only when the business requirement

demands fine granularity of the facts. Having the granularity to the seconds may

significantly increase the data volume in the warehouse. If Time granularity is not

required, you can pre-aggregate the fact table to the Day grain.





If exact event time is needed in the warehouse, the source system must provide data at the necessary grain

Business needs such as the number of books purchased between 3 and 4 PM of a day

then the extracted transaction information from source transaction tables should have

granular information at least up to the hourly. Business analysis requirement demands

are met but limited to with what is available in source systems.

Guidelines on when to add timekey information to facts/dimensions

Just because there is a time-related field is available to put into the fact table, that doesn’t

mean it needs to be there. The following guidelines would help to decide whether to add

timekey information to dimensions

Do not consider mapping to time dimension if there is need to aggregate based on

Time dimension attributes. For example, a business requirement to group the report

by hour would require the Hour attribute from the Time dimension.

If there is no granularity to the hour, minute, or second level in the fact table, do not

consider mapping time dimension

Summarizing data for a range of days requires joining only the date dimension table to the fact table

Even if fact table rows have a grain of seconds, you do not need to join to the Time

dimension if you only need to aggregate at the day or higher grain.

Example

For example, suppose that the fact table contains the detailed information that Book1

sold at 5 hours 30 minutes 28 seconds on 2/1/2006

When these transactions are aggregated at the end of the week to the day level, it would

take away the time information but keep only the day information. In this case, there is no

need to join to the fact table.

Example (aggregated)

Number of books sold on 2/1/2006

Add convenient attributes to Date and Time dimensions

It is often convenient to maintain attribute columns in a date dim and time dim to provide

additional convenience to business users such as calculating whether a transaction

happened on a holiday, a specific day in the financial cycle, at a peak time (for examlle

from 10 AM to 2 PM). You could even store aggregated values—such as the total

number of holidays in the month—in each row of the Date dimension.

With multiple time zones, store local and GMT times.

If data comes from multiple time zones, adjust all times to a standard time sone, such

as GMT. You can also store an additional attribute for how to handle a specific

Daylight Savings time calculation. With multiple time zones, consider using GMT

time in the dimension, and storing the time zone delta in a fact table column. In this

case, the time zone shift is similar in function to a Rapidly Changing Dimension.

Special care needs to be given when reporting multi-zone times as the date can

easily be misrepresented if the wrong Time attribute (GMT or local) is used.





Take particular care when dealing with day aggregations when multiple time zones

are involved. Does the day begin at local time, or at a world-wide standard time?

Fact tables Fact tables contain data that describes specific events within a business, such as bank

transactions or product sales

Types of Fact tables

There are basically two ways transaction-based fact table records are sourced, plus

another way for non-transaction counts.

Transaction table: This is the transactions as it happens on the source. Each transaction and its measure such as price is extracted, transformed and loaded to facts at regular intervals of time. Ideally, if there are 100 transactions happened on a given day on the source and if fact is designed at the same level of granularity, in transaction fact table you would see 100 records added once ETL process finishes loading.

Snapshot table: This is not exactly a transaction on the source. It is generally to get the snapshot of the system. Example: Getting all customer account balance as of end of the day. Even when there is no transaction on any particular account, the snapshot fact table would gather all the current account balance information and keep it in the fact table to exactly know what the account balance of a given account is on a given date.

Factless Fact: This fact generally is not sourced from an OLTP transaction. Occasionally you need only to capture whether an event happened. Example: You might want to capture employee attendance, you only need to know which employees came to work and which employees were absent. Absent employees will not come from attendance sheet. This is prepared separately with the help of employee table and attended employees to load into factless fact. Factless facts, as the name says, will not have any measures. Factless facts are often used with counts, and particularly distinct counts. For example, how many distinct employees worked during a given week.

Ideal fact measures are additive in nature but it is not a requirement that a fact table have numeric measures. For example, consider list of computers added and deleted from an environment. The fact table may capture the computers that are added and deleted. There is no measure here but count of computers becomes the measure. There is no measure stored in the fact table for this kind of transaction..

Identifying Facts and Measures

Attributes of fact table are in general containing surrogate keys of the respective dimension tables

The OLTP transactions or OLTP snapshot, are the source of fact records and would decide what business keys would make up to OLTP transaction. During the load process, these business keys are mapped to the business key of the respective dimensions to extract the corresponding surrogate keys.

Numeric values that resulted from an OLTP transaction typically end up as measures in the fact table. One fact table can have more than one measure, including custom calculated measures based on measures derived from the OLTP system..

OLTP transactions are almost always appended. Existing records are not updated.





Calculated measures that include only values from a single row may be pre computed during the load process and stored in the fact table, or they may be computed on the fly as they are used. Determination of which measures should be pre computed is a design consideration. There are other considerations in addition to the usual tradeoff between storage space and computational time.

o All the facts have a specific granularity

o A set of foreign keys constitutes a logical concatenated key

o The fact table typically contains the largest volume of data

The needs of the analyst must be supported by the facts in the fact table, there must be measures which have relevance to the business goals which the organization seeks to fulfill

FACTS are measurable

Facts are by nature dynamic and variable over time

Fact tables have an unlimited number of rows—as opposed to dimension tables, which have a finite cardinality, even if very large.

Fact data represents the measurable results of a business event, or a business state and Dimensional data provides the context for that event or state.

Fact events vary depending on the business

It is important to understand the basic business process and identify the business events and states when developing a dimensional data warehouse.

Fact tables should not contain descriptive information or any data other than the numerical measurement fields and the foreign key pointers to the corresponding entries in the dimension tables.

Identifying Measures. The most useful measures to include in a fact table are numbers that are additive.

o Additive measures. allow summary information to be obtained by adding various quantities of the measure, such as the sales of a specific item at a group of stores for a particular time period.

o Semi-additive measure: Is additive across all dimensions except time. The classic example of a semi-additive fact is a daily account balance. It's perfectly legitimate to add the balances of every account on a specific day—the fact is additive across the Account dimension—but the end-of-month balance of my savings account is not the sum of the balances for the past month—the fact is non-additive across time. When aggregating across time you need a special function such as the last non-empty value, or the first non-empty value, or the average. if you include semi-additive facts in a dimensional database, you should ideally use a front end tool that manage the appropriate semi-additive aggregation. At the very least, the front end tool should warn users if they attempt to summarize across a semi-additive dimension.

o Non-additive measures: Non additive measures are those which cannot be summarized like additive/semi additive measures. In general, you want to convert non-additive measure to additive measures. You can often do that by adding a weighting factor. For example, rather than store a rate, you can multiply the rate by a weight. After summing the weighted rate and also the weights, you can easily calculate a weighted average rate for any level of summarization.





Each fact table also includes a multipart index that contains as foreign keys the primary

keys of related dimension tables, which contain the attributes that describe the fact

records.

Granularity

Granularity is defined as the lowest level of detail of the data in the fact tables particularly

the measures. The more granular the data, the more detailed your reports and analysis

can be. However, excess granularity consumes space and increases complexity. It is a

trade off because once you aggregate a fact table, you cannot decompose the

aggregates.

There are three types of grain: Transaction, Periodic Snapshots and Accumulating

Snapshots.

Transactions - This is the basic detail record of the warehouse as it happened in the OLTP systems. Example: Customer purchases a book or money transfer from savings to checking. The Time grain of a transaction is essentially continuous.

Periodic Snapshots - This is a set of detail records that are repeated over time. It is to get number of customers existing, inventory of the system etc., at the end of the day. The grain of a snapshot is the interval at which the snapshot is taken—whether hourly, daily, weekly, monthly, or yearly.

Accumulating Snapshot - This is a snapshot that can change over time. The common example is the student enrollment example. The student's data doesn't change but the dates for enrollment, admission, etc.

Fact tables Guidelines Consider the following guidelines when designing your Fact tables:

Aggregating Data in Fact Tables

Data is generally aggregated to store the summarized information over a period of time of

the fact measures such as the number of units of a product sold in a given month. The

source for these aggregations is the detailed granular fact which collects transactional

information from source at regular intervals of time. In some data warehouses, the

aggregation is performed directly by the ETL process to move it into an aggregated fact

table where the detailed transactional records do not have any value for the report users.

For example, executives may not need to see the number of units sold at 6th hour of a

day. Consider the following guidelines when designing your Fact tables that contain

aggregate data:

Aggregating data in the fact table should only be done after considering the consequences. Aggregation is the process of calculating summary data from detail records. It is often tempting to reduce the size of fact tables by aggregating data into summary records when the fact table is created. Detailed information is no longer directly available to the analyst. If detailed information is needed, the detail rows that were summarized will have to be identified and located, possibly in the source system that provided the data. Fact table data should be maintained at the finest granularity feasible.

Avoid aggregation in warehouse tables if you will use an OLAP cube as the source for reports and analysis. In this case, the cube provides the aggregation functionality, but if you need to get to details it still available in the fact table.





If the reports query the underlying relational structure then aggregation will make queries faster

It would make sense to archive only aggregated information rather than all the warehouse records.

To enhance relational queries, you may want to store the summary records in separate tables

Mixing aggregated and detailed data in the fact table can cause issues and complications when using the data warehouse. For example, a sales order often contains several line items and may contain a discount, tax, or shipping cost that is applied to the order total instead of individual line items, yet the quantities and item identification are recorded at the line item level. Summarization queries become more complex in this situation. If you are using a tool such as Analysis Services, leave the aggregation to the tool

Determine if a source numeric field is to be measure or an attribute

Sometimes during the design process, it is unclear whether a numeric data field from a

production data source is a measured fact or an attribute. Generally, if the numeric data

field is a measurement that changes each time we sample it, it is a fact. If it is a discretely

valued description of something that is more or less constant, it is a dimension attribute.

One way of making the distinction is whether it is a value you would want to use as a

grouping item for a report. If it is, then it is an attribute. Often, an aggregatable measure

also appears as an attribute by bucketing the values. For example, order quantity is

typically a measure, but Order Size (i.e., Order Quantity bucketed into 5 discrete buckets)

is an attribute.

Storage is cheap and time is expensive:

This is true in most of the data warehouse processes. This means that you should store

measures needed for reports as part of the fact measures. Ideally, all the measures

would directly be coming from source transactions but any calculations on these

measures can be computed and stored in another measure attribute of the fact. This

reduces the report query time as there is no more computation involved. If reports

directly query relational tables, create aggregation tables with pre-computed totals to

enable reports to query at different grains with no computation during the reporting query.

Retain all measures coming from source in your fact

Have the original details of the measurements which are used for calculating the

additive/non additive fields, can be used for later aggregations. Example:

If you keep all the measurements that are available from source in the fact table, any

possible custom measurements later on the fact table can be done without changing the

design of the fact table. This ensures if one calculation is feasible to extract from source

system then the warehouse system provides the same feasibility.

EmployeeHours OLTP:

EmployeeId Date StartTime EndTime

34 1/1/2006 9 17

21 1/2/2006 12 19

56 1/1/2006 13 20.5





EmployeeHoursFact:

EmployeeKey DateKey NumberOfHours

11 45 8

12 46 7

13 45 7.5

Here in the above fact the original measures such as StartTime and EndTime are not

included, because of this the measures such as number of employee start working on

1/1/2006 before 10AM cannot be answered though it is possible to deduct the number

from the source schema.

Here the guideline is when calculating custom measures on fact, retain the source

measures as part of the fact.

Keeping the source measures also makes it possible to show an audit trail of how the

value was calculated.

If the granularity of two facts tables is the same, consider combining them

As a general rule when you start your dimensional modeling each business function

should have its own fact table. As your dimensional model evolves try combining the fact

tables where the granularity is the same between the fact tables. Combing the facts

makes analysis easier and eliminates the need for unnecessary joins.

InventoryFact

ProductKey DateKey Inventory

23 44 289

43 44 435

65 44 658

InvoiceFact

ProductKey DateKey Sold

23 44 32

43 44 45

65 44 59

Both InventoryFact and InvoiceFact both have same granularity (Product – Daily) and

represent single business entity products SKU number. Even if there is nothing sold for a

product on a given day, the record can be added to the combined fact as inventory fact is

a snapshot fact, not based on whether the product is sold today or not, with

corresponding sold measure as 0. The cardinality of the combined fact would be same as

Inventory Fact.

The combined fact would look like the following





ProductSKUFact

ProductKey DateKey Inventory Sold

23 44 289 32

43 44 435 45

65 44 658 59

Combining these two fact tables into one also enables calculated measures such as

percentage of SKUs are sold today, that would need to use measurements from both the

facts, without this combination it is going to be cross fact and would lead to performance

problems.

Use indexed views to improve the performance of relational queries

The performance of queries against the relational data warehouse can be significantly

improved when you use indexed views. Indexed view is a pre-computed table comprising

aggregated or joined data from fact and possibly dimension tables. A indexed view is one

way of implementing particulary when the requested data set is summary values that

were not pre computed or aggregated

Fact table Index Guidelines

Consider having a composite primary key for your Fact table. Consider composite key (typically composite of all related dimension surrogate keys) as the primary key for Fact tables. Conversely, you can add an Identity column as a primary key in the fact table.

Some dimensional modeling will have business keys also being part of facts for enabling report performance. These business keys do not need to be part of the composite key of the fact as their corresponding dimension SKs are declared in composite key. There can be non clustered index to cover these business keys for better reports performance.

Business keys which are used by ad hoc queries should be part of non clustered index. Non-clustered indexes will the increase efficiency and responsiveness of ad-hoc queries. In the absence non key attributes covered by an index, would result in a clustered index scan (provided clustered index is created & available) which is expensive

Avoid updating fact records

Fact table represents the transactional record that was created by source business

processes. There should not be any record in the fact table that does not originate in a

business transaction at the source. At the same time, any change of the value of a

measure on the source system should result a new record in the fact, never an update of

the fact table. If you use an OLAP tool, any update of fact table records (apart from end

date if exists) would require refreshing the cube. Refreshing the cube having huge data

volume may take longer time and may result in down time of the system.

Keep the data types of the fact table columns as short as possible

Fact records are already huge in size; for ex. use tiny int where possible instead of INT. It

is suggested to adopt the same data type for measurements that source system has. If

you create aggregate tables, be sure to change the data types to be large enough to

handle the maximum aggregated values.





Fact loading should not create records that are not part of OLTP system

Capture only transactional records from OLTP. Avoid having the ETL process add new

records that were not exactly sourced from OLTP systems. Example, if Customer1

purchased Book1 then fact table should have only one entry for this transaction. Adding

records such as Customer1 did not purchase Book2, Customer1 did not purchase Book3,

will not add any value and should be avoided.

Fact records are considered as event records – i.e. transactions processed on the OLTP.

No fact records should be created by the ETL, it is just a transformation.

Business needs must determine the minimum granularity of the fact table

Fact table granularity comes from business needs, you can’t have further granularity than

OLTP records. It is driven by the business needs. Example: – if the business requires to

track only aggregated values for the entire day instead of hourly basis, then use day

aggregate as the base granularity

Ensure fact table measures have relevance to the business goals

Though it is possible to create multiple measurements from the source measures, the

fact measures should represent business requirement. Create only those measures that

are required for the reports/analytic needs of the business users.

Ex: If Percentage sold is not a requirement for business people, do not create this as a

measurement just because it is possible to create from the available data

It is suggested to try creating additive measures and avoid having non additive measures in the fact tables.

The measures of the fact table are numeric in nature and additive. But some

measurements are non additive and semi additive. Semi additive measures are those

which cannot be summed up across all the products but can be summed of the same

product across the year. Non additive measurements are such as ratio, percentage etc.,

but all these measurements are to be numeric in nature. For non additive and semi

additive, custom aggregations are created but for this original source additive

measurements are to be part of fact table to calculate custom aggregations.

Aggregation to higher grain

Aggregations of the facts can be performed either going to higher grain within the

dimension, or by removing a dimension. If going to higher grain, you may need to create

a shrunken dimension table in the dimension or split out the higher grain attributes into a

snowflaked table. As an example of aggregations on non-hierarchy attributes, a fact table

may track the products with color, location so the dimension has a surrogate key on the

products which has two different records for two different colors for the same location and

same product. But when the fact table is aggregated to remove the color hierarchy, then

there should be another dimension with only one record for product and location (with no

color). As an example of aggregations on hierarchy attributes, consider a dimension

having city, state and country. If the aggregation of the fact table is performed to the





state level then this dimension cannot be used as the granular is higher. So, use a

shrunken dimension or a showflaked dimension table at the appropriate grain.

Determine when and where calculated measures are pre-computed.

Calculated measures are those measures which are not directly available from source

but computed from the measures coming from source.

Example: Number Sold is the source measure. Percentage of the items sold per city is

the measure calculated based on number of items sold in the city to the total items old in

the given day.

Calculated measures can be created in the facts while loading enabling the reports which

query the relational data warehouse table directly.

If you are using an OLAP tool such as Analysis Services, it may be easier to do

calculations that involve aggregated values (such as percent of total) in the logical cube

design where both detailed and aggregated values are accessible.

Keys and Indexes This section describes guidelines for creating keys and indexes on different objects of

physical warehouse schema.

Types of Keys and Indexes

Consider the following types of keys:

Surrogate Key: A key to represent uniqueness in dimensions and acts as a method

of being referenced from fact tables. It generally an integer data type and get

incremented for each key, to act as Identity column of the dimension. Details are

surrogate key is explained in the earlier sections of the chapter.

Natural Key: Also termed as Business key. This is the primary key or the unique

key on the source system to represent the identification of entity uniquely. Typically

this is also part of physical implementation of warehouse schema participating in

dimension structure but will not be as a key. Natural key is just another attribute of

dimension structure.

Consider the following types of Indexes:

Clustered: Need of having indexes on the fact tables and dimensions are very

crucial in warehouse and should be optimized for data access queries. Typically

clustered index on dimensions is created on surrogate key and surrogate keys used

in the fact will be a composite key for fact tables.

Non-Clustered: There is a requirement of creating non clustered indexes on

measures and on other widely accessed columns of the fact table.

Covering Index: Covering index will contain all the columns of the table, in this

context it is for fact table. The idea behind of creating covering index is boosting

performance for any queries need not be part of any other index At the cost of index

size.

Key and Index Guidelines

Consider the following Key and Index guidelines for your dimensional model:





Avoid using GUIDs as keys. GUIDs may be used in data from distributed source

systems, but they are difficult to use as table keys. GUIDs use a significant amount of

storage (16 bytes each), cannot be efficiently sorted, and are difficult for humans to

read. Indexes on GUID columns may be relatively slower than indexes on integer

keys because GUIDs are four times larger. GUIDs are not generally accessed in the

query and so there is no need of creating indexes on GUIDs

Have non clustered indexes created on the business keys if they are part of facts

Have clustered index created on dimension surrogate keys in facts (or create

clustered index on Identity column and use non-clustered for SK FK’s)

There is no need to create separate indexes on measurements as they are accessed

via the surrogate keys or business keys

For each dimension table, add a clustered index on the surrogate keys (primary key)

Make sure to use the index in the same order it is created. For example, if an index is

created on columns F1, F2 in the order, try having your query filters out first on F1

and then F2

Analyze the query execution plan for specific report queries and sample ad hoc

queries before optimizing the usage

Keep index keys smaller, do not include all the columns of the fact in the covering

index

Use Index Tuning Wizard to tune your indexes based on query load

Uses Indexed views to improve query performance that access data through views


08 - Dimensional Modeling

Documents

Transcript of 08 - Dimensional Modeling