Dimensional Modeling Schema

download Dimensional Modeling Schema

of 23

Transcript of Dimensional Modeling Schema

  • 7/24/2019 Dimensional Modeling Schema

    1/23

    Dimensional Modeling Schema

    Written by DWBIConcepts TeamLast Updated: 19 September 2014

    Now that we know thebasic approach to do dimensional modelingfrom our earlier article,

    let us spend some time to understand various possible schema in dimensional modeling.

    Requirement of different design schema

    In Dimensional modeling, we can create different schema to suit our requirements. We need

    various schema to accomplish several things like accommodating hierarchies of a dimension

    or maintaining change histories of information etc. In this article we will discuss about 3

    different schema, namely - Star, Snowflake and Conformed and we will also discuss how

    hierarchical information are modeled in these schemata. We will reserve the discussion on

    maintaining change histories for our next article.

    Storing hierarchical information in dimension tables

    From our previous article, we already know what is a dimension. Simply put, a dimension is

    something that qualifies a measure (number). For example, if I say, "McDonalds sell 5000" -

    that won't make any sense. But if I say, "McDonalds sell 5000 burgers per month" - then

    that would make perfect sense. Here, "burger" and "month" are the members of dimensions

    and they are qualifying the number 5000 in this sentence.

    It is important to notice that "burger" and "month" are not dimension themselves - they are

    just the members of the dimensions "food" and "time" respectively. "Burger" is just one of

    many different "food" that McDonalds sell and "month" is just one of different units by which

    time is measured. Typically a dimension will have several members and those members will

    be stored in separate rows in the dimension table. So the "food" dimension table of

    McDonalds will have one row for burger, one row for fries, one row for "drinks" etc.

    Similarly, "time" dimension may contain 12 different months as the members of that

    dimension.

    Often we may find that there are hierarchical relations among the members of a dimension.

    That is certain members of the dimension can be grouped under one group whereas other

    members can be grouped into a separate group. Consider this - french fries and twister fries

    both are "fries" and hence can be grouped under the same group "fries". Similarly chicken

    burger and fish burger both can be grouped as "burger".

    http://dwbi.org/data-modelling/dw-design/1-dimensional-modeling-guide.htmlhttp://dwbi.org/data-modelling/dw-design/1-dimensional-modeling-guide.htmlhttp://dwbi.org/data-modelling/dw-design/1-dimensional-modeling-guide.htmlhttp://dwbi.org/data-modelling/dw-design/1-dimensional-modeling-guide.html
  • 7/24/2019 Dimensional Modeling Schema

    2/23

    French Fries Twister Fries

    This type of hierarchical relations can be stored in the model by following two different

    approaches. We can either store them in the same "food" dimension table (star schema

    approach) or we can create a separate dimension table in addition to "food" dimension -

    just to store the type of the foods (snowflake schema approach).

    STAR SCHEMA DESIGN

    Star schema is the most simple kind of schema where one fact table is present in the center

    of the schema surrounded by multiple dimension tables.

    In a star schema all the dimension tables are connected only with the fact table and no

    dimension table is connected with any other dimension table.

  • 7/24/2019 Dimensional Modeling Schema

    3/23

    Benefit of Star Schema Design

    Star schema is probably most popular schema in dimensional modeling because of its

    simplicity and flexibility. In a Star schema design, any information can be obtained just by

    traversing a single join, which means this type of schema will be ideal for information

    retrieval (faster query processing). Here, note that all the hierarchies (or levels) of the

    members of a dimension are stored in the single dimension table - that means, lets say if

    you wish to group (veggie burger and chicken burger) in "burger" category and (french fries

    and twister fries) in "fries" category, you have to store that category information in the

    same dimension table.

    Star schema provides a de-normalized design

    Storing Hierarchy in star schemaAs depicted above, we will store hierarchical information in a flattened pattern in the single

    dimension table in star schema. So our food dimension table will look like this: Food

    KEY NAME TYPE

    1 Chicken Burger Burger

    2 Veggie Burger Burger

    3 French Fries Fries

    4 Twister Fries Fries

    SNOW-FLAKE SCHEMA DESIGN

    Snow flake schema is just like star schema but the difference is, here one or moredimension tables are connected with other dimension table as well as with the central fact

    table. See the example of snowflake schema below.

    Here we are storing the information in two dimension tables instead of one. We are storing

    the food type in one dimension ("type" table as shown below) and food in other dimension.

    This is a snowflake design. Type

  • 7/24/2019 Dimensional Modeling Schema

    4/23

    KEY TYPE_NAME

    1 Burger

    2 Fries

    Food

    KEY TYPE_KEY NAME

    1 1 Chicken Burger

    2 1 Veggie Burger

    3 2 French Fries

    4 2 Twister Fries

    If you are familiar with the concept of data normalization, you can understand that snow

    flaking actually increase the level of normalization in the data. This has obvious

    disadvantage in terms of information retrieval since we need to read more tables (and

    traverse more SQL joins) in order to get the same information. Example, if you wish to find

    out all the food, food type sold from store 1, the SQL queries from star and snowflake

    schemata will be like below:

  • 7/24/2019 Dimensional Modeling Schema

    5/23

    SQL Query For Star Schema

    SELECT DISTINCT f.name, f.type

    FROM food f, sales_fact t

    WHERE f.key = t.food_key

    AND t.store_key = 1

    SQL Query For SnowFlake Schema

    SELECT DISTINCT f.name, tp.type_name

    FROM food f, type tp, sales_fact t

    WHERE f.key = t.food_key

    AND f.type_key = tp.key

    AND t.store_key = 1

    As you can see in this example, compared to star schema, snowflake schema requires one

    more join (to connect one more table) to retrieve the same information. This is why

    snowflake schema is not good performance wise.

  • 7/24/2019 Dimensional Modeling Schema

    6/23

    Then why do we use snowflake schema? Let me give a quick and short answer to that. I

    won't explain it in detail right now but I will leave it to you for your comprehension. The

    reason we do it is, suppose we have another fact table with granularity store, food type and

    day. This fact will use the key of "type" dimension table instead of "food" dimension table.

    Unless you have this dimension table in your schema, you won't get the "type" key. This is

    the reason we need to snowflake the "food" dimension to "type" dimension.

    In our next article we will talk aboutpreserving history in dimension tables (slowly or

    rapidly changing dimensions etc.).

    History Preserving in Dimensional Modeling

    Written by DWBIConcepts Team

    Last Updated: 15 September 2014

    In our earlier article we have seenhow to design a simple dimensional data modelfor a

    point-of-sale system (as an example we took the case of McDonald's fast-food shop). In this

    article we will begin with the same model and we will see how we may enhance the model

    to store historical changes in the attributes of dimension table.

    Nothing Lasts Forever

    One of the important objectives while doing data modeling is, to develop a model which can

    capture the states of the system with respect to time. You know, nothing lasts forever!

    Product prices change over time, people change their addresses, marital status, employers

    and even their names. If you are doing data modeling for a data warehouse where we

    are particularly interested about historical analysis - it is crucial that we develop some

    method of capturing these changes in our data model. As an example, let's say we store the

    price of products in the "Food" dimension table that we created earlier and we want to be

    able to capture the historical changes in "Food" price. In this article we will see what change

    we need to do in our data model to be able to do this.

    Note: The simple "Food" dimension we created earlier did not have any "Price" information.

    But to illustrate the point of this article, we will add a "price" column to our "Food"

    dimension table. So henceforth our "Food" dimension table will look like this: Food

    KEY NAME TYPE_KEY PRICE

    1 Chicken Burger 1 3.70

    http://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.htmlhttp://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.htmlhttp://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.htmlhttp://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.htmlhttp://dwbi.org/data-modelling/dimensional-model/1-dimensional-modeling-guide.htmlhttp://dwbi.org/data-modelling/dimensional-model/1-dimensional-modeling-guide.htmlhttp://dwbi.org/data-modelling/dimensional-model/1-dimensional-modeling-guide.htmlhttp://dwbi.org/data-modelling/dimensional-model/1-dimensional-modeling-guide.htmlhttp://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.htmlhttp://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.html
  • 7/24/2019 Dimensional Modeling Schema

    7/23

    KEY NAME TYPE_KEY PRICE

    2 VeggieBurger 1 3.20

    3 French Fries 2 2.00

    4 Twister Fries 2 2.20

    In case if you have not read my previous article and wondering what "TYPE_KEY" means,

    this is a foreign key coming from one other table that contains the type of the food e.g.,

    Burger, Fries etc. Also notice, above table only tells us the price of the food as of current

    point in time. It does not tell us what the price was, let's say 6 months ago. If the price of

    Veggie Burger changes from $3.20 to $3.25 tomorrow, the new price will be updated in the

    table and then we will have no way to know what was the earlier price. So our objective isto change the above table structure in such a way so that we can store all the historical and

    future prices of the foods.

    Types of Changing Dimensions

    There are a few different ways to store the historical changes of values in data model. And

    any particular way you want to adopt will depend on the typeof changing dimension. For

    example, some dimensions can change quite rapidly, some dimensions do not change at all

    but most dimensions change very slowly. That is why we can differentiate dimensions in

    these 3 types depicted below.

    Unchanging Dimension

    There are some dimensions that do not change at all. For example, let's say you have

    created a dimension table called "Gender". Below are the structure and data of this

    dimension table: Gender

    ID VALUE

    1 Male

    2 Female

    The "Value" column in the above dimension is the attribute of this table that won't normally

    change. This is an unchanging dimension - "male" will be always called "male" and "female"

  • 7/24/2019 Dimensional Modeling Schema

    8/23

    will be always called "female". Off course, for some crazy reason, one may wish to change

    the texts "Male"/"Female" to something else e.g. "man"/"woman". But that's really not a

    change that we should be concerned about as such changes do not alter the "meaning" of

    the attribute (the words man/male still mean the same thing). So if some changes need to

    be done, we can simply update the "Value" column in dimension table. For all practical

    intent and purpose, this dimension remains as an "Unchanging dimension".

    Slowly Changing Dimension

    Here comes the most popular dimension - "slowly changing dimension". These are the

    dimensions where one or more attributes can change slowly with respect to time. Look at

    the "food" dimension from our earlier example. "Price" is one such attribute which is

    variable in this dimension. But "price" of french fries or burgers do not change very often,

    may be they change once in a season. This is an example of slowly changing dimension.

    Let me give you one more example. Let's say you have created a dimension table onemployees. And in the "employee" dimension you have a column called "Marital_Status".

    This can definitely change (from unmarried to married for example) with respect to time.

    But again, like the previous example, this is a slowly changing attribute. Doesn't change so

    often.

    Later in the article, we will see how to make necessary changes in our dimension table

    design to store history for such slowly changing dimensions.

    Rapidly Changing Dimensions

    If you design a dimension table that has a rapidly changing attribute, then your dimension

    table will become rapidly changing dimension.

    As for example, let's say you have a "Subscriber" dimension where you store the details of

    all the subscribers to a particular pre-paid mobile service plan. You have a "status" column

    in the "Subscriber" dimension table which can have several different values based on the

    current account balance of the subscriber. For example, if your balance is less than $0.1,

    the status becomes "No Outgoing call". If your balance is less than $5, the status becomes

    "Restricted Call Service". If your balance is less than $10, the status becomes "No Long

    Distance Call" and if the balance is greater than $10 then status becomes "Full Service",

    etc. Every month, the status of any subscriber keeps on changing multiple times based onhis or her account balance thereby making the "Subscribers" dimension one rapidly

    changing dimension.

    One must remember the way we design a rapidly changing dimension is often quite

    different from the way we design a slowly changing dimension. In the next article however,

    we will only look intodesigning of slowly changing dimension.

    http://dwbi.org/data-modelling/dimensional-model/19-modeling-for-various-slowly-changing-dimension.htmlhttp://dwbi.org/data-modelling/dimensional-model/19-modeling-for-various-slowly-changing-dimension.htmlhttp://dwbi.org/data-modelling/dimensional-model/19-modeling-for-various-slowly-changing-dimension.htmlhttp://dwbi.org/data-modelling/dimensional-model/19-modeling-for-various-slowly-changing-dimension.html
  • 7/24/2019 Dimensional Modeling Schema

    9/23

    Dimensional Modeling Approach for Various SlowlyChanging Dimensions

    Written by DWBIConcepts TeamLast Updated: 15 September 2014

    In our earlier article we have discussed theneed of storing historical information in

    dimensional tables.We have also learnt about various types of changing dimensions. In this

    article we will pick "slowly changing dimension" only and learn in detail about various types

    of slowly changing dimensions and how to design them.

    Slowly changing dimensions, referred as SCD henceforth, can be modeled basically in 3

    different ways based on whether we want to store full histories, partial histories or no

    history. These different types are called Type 2, Type 3 and Type 1 respectively. Next we

    will learn them in detail.

    Also note, there are slight variations to the basic 3 SCD types that I show here. These

    variations (sometimes labelled as type 4, 5, 6, 7 etc.) are mostly in terms of

    implementation and use-cases. Don't worry about them now.

    SCD Type 1

    As mentioned above, we design a dimension as SCD type 1 when we do not want to store

    the history. That is, whenever some values are modified in the attributes, we just want to

    update the old values with the new values and we do notcare about storing the previous

    history.

    We do not store any history in SCD Type 1

    Please mind, this is not same as "Unchanged Dimension" discussed in the previous article.

    In case of an unchanged dimension, we assume that the values of the attributes of that

    dimension will not change at all. On the other hand, here in case of a SCD Type 1

    dimension, we assume that the values of the attributes will change slowly, however, we are

    not interested to store those changes. We are only interested to store the current or latest

    value. So every time it changes we will update the old value with new ones.

    Handling SCD Type 1 Dimension in ETL Process

    Technically, from ETL design perspective (Now, if you don't know what is ETL, you don't

    have to bother about this paragraph - you can go to the next section) SCD Type 1

    dimensions are loaded using "Merge" operation which is also known as "UPSERT" as an

    abbreviation of "Update else Insert".

    http://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.htmlhttp://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.htmlhttp://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.htmlhttp://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.htmlhttp://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.htmlhttp://dwbi.org/data-modelling/dimensional-model/18-history-preserving-in-dimensional-modeling.html
  • 7/24/2019 Dimensional Modeling Schema

    10/23

    SCD Type 1 dimensions are loaded by Merge operations

    In "UPSERT" method, each row coming from the source is compared will all the records

    present in the target dimension table based on the natural key and checked if the sourcerecord already exists in the target or not. If the row exists in the target, the target row is

    updated with new values coming from source system. However if the row is not present in

    the target system, the source row is inserted in the target table.

    In pure ANSI SQL syntax, there is a particular statement that help you achieve the UPSERT

    operation. It's called "MERGE" statement

    MERGE INTO Target_Dimension_Table tgt

    USING source_table src

    ON

    tgt.natural_key = src.natural_key

    WHEN MATCHED THEN

    UPDATE

    SET tgt.column1 = src.value1,

    tgt.column2 = src.value2, ...

    WHEN NOT MATCHED THEN

    INSERT ( tgt.column1 , tgt.column2 ...)

    VALUES ( src.value1 , src.value2 ...)

    As obvious from this example, you have to store the natural key of the data in the target

    dimension table in order to perform this comparison. Later, I will write a separate article on

    ETL architecture design, where I will talk about this in more detail. But from a modeling

    perspective, please note that as a data modeler you should add one extra column in your

    target dimension table as a place holder to store the natural key of the data.

    SCD Type 2

    Arguably, this is the most popular type of slowly changing dimensions. So we will try to

    learn this as clearly as possible.

  • 7/24/2019 Dimensional Modeling Schema

    11/23

    Let me come one step backward here and remind you again about what is our objective

    here. As you can recall, in the previous articles we have learnt how the values of the

    attributes (or columns) in the dimension table change with time. We are trying to store the

    histories of such changes for the purpose of analysis.

    In Type 1, we were not storing any history. However, now we are going to learn how may

    we design a dimension table so that we can store the full history and always extract the

    history of changes as and when we require that. We will take our "Food" dimension table as

    an example here, where "Price" is a variable factor. Food

    KEY NAME TYPE_KEY PRICE

    1 Chicken Burger 1 3.70

    2 VeggieBurger 1 3.20

    3 French Fries 2 2.00

    4 Twister Fries 2 2.20

    Design of SCD Type 2 Dimension

    In order to design the above table as SCD Type 2, we will have to add 3 more columns inthis table, "Date From", "Date To" and "Latest Flag". These columns are called type 2

    metadata columns. See below: Food

    KEY NAME TYPE_KEY PRICE DATE_FROM DATE_TO LATEST_F

    1 Chicken Burger 1 3.70 01-Jan-11 31-Dec-99 Y

    2 VeggieBurger 1 3.20 01-Jan-11 31-Dec-99 Y

    3 French Fries 2 2.00 01-Jan-11 31-Dec-99 Y

    4 Twister Fries 2 2.20 01-Jan-11 31-Dec-99 Y

    Notice here, how the values of these 3 new columns are populated. In the very beginning,

    when any new record is loaded in the table, we automatically default the values of "date

  • 7/24/2019 Dimensional Modeling Schema

    12/23

    from" to the date of the day of the loading, "Date To" to some far future date (e.g., 31st

    December 2099) and "Latest Flag" to "Y".

    What is the meaning of these 3 metadata columns?

    These 3 columns basically tell us whether a particular record in the table is latest or not and

    what is the time period during which the record was latest (Also known as active period).

    For example, data in the above table basically says that all the 4 records are latest (active)

    and they are active from the day of loading (in this case 1st January 2011) until an

    indefinite future date (31st December 2099).

    But how does these columns help us store the change history?

    Lets assume, today is 15 March 2011, and McDonald has decided to increase the price of"Veggie Burger" from $3.20 to $3.25. If this happens we will not straight away update the

    price from $3.20 to $3.25. Instead to store this new information (and also the old

    information), we will insert a new record in the "Food" dimension table which will look like

    below: Food

    KEY NAME TYPE_KEY PRICE DATE_FROM DATE_TO LATEST_F

    1 Chicken Burger 1 3.70 01-Jan-11 31-Dec-99 Y

    2 VeggieBurger 1 3.20 01-Jan-11 14-Mar-11 N

    3 French Fries 2 2.00 01-Jan-11 31-Dec-99 Y

    4 Twister Fries 2 2.20 01-Jan-11 31-Dec-99 Y

    5 VeggieBurger 1 3.25 15-Mar-11 31-Dec-99 Y

    Observe the change in the records with Key 2 and 5. Record 2, which was the originalrecord for the veggie burger, has now got updated as its latest flag has become 'N' and

    "Date To" column value has changed to "14-Mar-2011". This means, Record 2 is no longer

    latest or active (Latest Flag = "N") and it was active earlier during the period 1st Jan

    2011(Date From) to 14 Mar 2011(Date To).

    So, if Record 2 is not active, what is the latest record for "Veggie Burger" now? Record 5!

    Its latest flag is set to "Y" and it says that that the record is active since 15 March 2011.

  • 7/24/2019 Dimensional Modeling Schema

    13/23

    This record will remain active many years in the far-off future (until 31 Dec 2099) or at

    least unless a new record is inserted again with latest flag Y and this record is updated

    again with Latest Flag N. So next time again, let's say on 20 Dec 2011, McDonalds again

    decide to change the price of Veggie Burger back to $3.20 and increase the price of the

    chicken burger from $3.70 to $3.90, we will see 2 more new records in the table as

    below: Food

    KEY NAME TYPE_KEY PRICE DATE_FROM DATE_TO LATEST_F

    1 Chicken Burger 1 3.70 01-Jan-11 19-Dec-11 N

    2 VeggieBurger 1 3.20 01-Jan-11 14-Mar-11 N

    3 French Fries 2 2.00 01-Jan-11 31-Dec-99 Y

    4 Twister Fries 2 2.20 01-Jan-11 31-Dec-99 Y

    5 VeggieBurger 1 3.25 15-Mar-11 19-Dec-11 N

    6 Chicken Burger 1 3.80 20-Dec-11 31-Dec-99 Y

    7 VeggieBurger 1 3.20 20-Dec-11 31-Dec-99 Y

    As you can see from the design above, it is now possible to go back to any date in the

    history and figure out what was the value of the "Price" attribute of "Food" dimension at

    that point in time.

    Surrogate key for SCD Type 2 dimension

    Note from the above example that, each time we generate a new row in the dimension

    table, we also assign a new key to the record. This is the key that flows down to the fact

    table in a typical Star schema design. The value of this key, that is the numbers like 1, 2, 3,

    . , 7 etc. are not coming from the source systems. Instead those numbers are just like

    sequential running numbers which are generated automatically at the time of inserting

    these records. These numbers are unique, so as to uniquely identify each record in the

    table, and are called "Surrogate Key" of the table.

    As obvious, multiple surrogate keys may be related to the same item, however, each key

    will relate to one particular state of that item in time. In the above example, keys 2, 5 and 7

    are all linked to "Veggie Burger" but they represent the state of the record in 3 different

  • 7/24/2019 Dimensional Modeling Schema

    14/23

    time spans. It's worth noting that there would be only one record with latest flag = "Y"

    among multiple records of the same item.

    Alternate Design of SCD Type 2: Addition of Version

    Number

    A slight variation of design of SCD Type 2 dimension is possible where we can store the

    version numbers of the records. The initial record will be called version 1 and as and when

    new records are generated, we will increment the version number by 1. In this design

    pattern, the records with highest version will always be the latest record. If we utilize this

    design in our earlier example, the dimension table will look like this:Food

    KEY NAME TYPE_KEY PRICE DATE_FROM DATE_TO VERS

    1 Chicken Burger 1 3.70 01-Jan-11 19-Dec-11 1

    2 VeggieBurger 1 3.20 01-Jan-11 14-Mar-11 1

    3 French Fries 2 2.00 01-Jan-11 31-Dec-99 1

    4 Twister Fries 2 2.20 01-Jan-11 31-Dec-99 1

    5 VeggieBurger 1 3.25 15-Mar-11 19-Dec-11 2

    6 Chicken Burger 1 3.80 20-Dec-11 31-Dec-99 2

    7 VeggieBurger 1 3.20 20-Dec-11 31-Dec-99 3

    Off course, we can also keep the "Latest Flag" column in the above table if we wish.

    Handling SCD Type 2 Dimension in ETL ProcessAgain, if you do not know what is ETL - you can safely skip this section. But if you have

    some ETL background then I suppose you have already pin-pointed the fact that, unlike

    SCD Type 1, Type 2 requires you to insert new records in the table as and when any

    attribute changes. This is obviously different from SCD Type 1. Because in case of SCD Type

    1, we were only updating the record. But here, we will need to update old record (e.g.

  • 7/24/2019 Dimensional Modeling Schema

    15/23

    changing the latest flag from "Y" to "N", updating the "Date To") as well as we will need to

    insert a new record.

    Like before, we can use the "natural key" to first compare if the source record is existing in

    the target or not. If not, we will simply insert the record in the target with new surrogate

    key. But if it already exists in the target, we will have to check if any value of the attributes

    has changed between source and target - if not, we can ignore the source record. But if yes,

    we will have to update the existing record as "N" and insert a new record with new

    surrogate key. As I mentioned before, I will write a separate article on the ETL handling

    later.

    Performance Considerations of SCD Type 2 Dimension

    SCD type 2, by design, tend to increase the volume of the dimension tables considerably.

    Think of this: Let's say you have an "employee" dimension table which you have designed

    as SCD Type 2. The employee dimensions has 20 different attributes and there are 10attributes in this table which change at least once in a year on average (e.g. employee

    grade, manager's name, department, salary, band, designation etc.). This means if you

    have 1,000 employees in your company, at the end of just one year, you are going to get

    10,000 records in this dimension table (i.e. assuming on an average 10 attributes change

    per year - resulting into 10 different rows in the dimension table).

    As you can see, this is not a very good thing performance wise as this can considerably slow

    down loading of your fact table as you will require to "look up" this dimension table during

    your fact loading. One may argue that, even if we have 10,000 records, we will actually

    have only 1,000 records with Latest_Flag = 'Y' and since we will only lookup records with

    Latest_Flag = 'Y', the performance will not detoriate. This is not entirely true. While utilizingthe Latest_Flag = 'Y' filter may decrease the size of the lookup cache, but database will

    generally need to do a full table scan (FTS) to identify latest records. Moreover, in many

    cases ETL developer will not be able to make use of Latest_Flag = 'Y' column if the

    transactional records do not always belong to the latest time (e.g. late arriving fact records

    or loading fact table at later point in time - month end load / week end load etc.). In those

    cases, putting latest_flag = 'Y' filter will be functionally incorrect as you should determine

    the correct return key on the basis of "Date To", "Date From" columns. (If you do not

    understand what I am talking about in this para, just ignore me for now. I am going to

    explain these things later in some other article)

    SCD Type 3

    As I mentioned before, type 3 design is used to store partial history. Although theoretically

    it is possible to use the type 3 design to store full history, that would be not possible

    practically. So, what is type 3 design? In Type 2 design above, we have seen that whenever

  • 7/24/2019 Dimensional Modeling Schema

    16/23

    the values of the attributes change, we insert new rows to the table. In case of type 3,

    however, we add new column to the table to store the history.

    So let's say, we have a table where we have 2 column initially - "Key" and "attribute".

    KEY ATTRIBUTE

    1 A

    2 B

    3 C

    If the record 1 changes its attribute from A to D, we will add one extra column to the table

    to store this change.

    KEY ATTRIBUTE ATTRIBUTE_OLD

    1 D A

    2 B

    3 C

    If the record again change attribute values, we will again have to add columns to store the

    history of the changes

    KEY ATTRIBUTE ATTRIBUTE_OLD ATTRIBUTE_OLD_1

    1 E D A

    2 B

    3 C

    Isn't then SCD Type 3 very cumbersome?

  • 7/24/2019 Dimensional Modeling Schema

    17/23

    As you can see, storing the history in terms of changing the structure of the table in this

    way is quite cumbersome and after the attributes are changed a few times the table will

    become unnecessarily big and fat and difficult to manage. But that does not mean SCD Type

    3 design methodology is completely unusable. In fact, it is quite usable in a particular

    circumstance - where we just need to store the partial history information.

    Let's think about a special circumstance where we only need to know the "current value"

    and "previous value" of an attribute. That is, even though the value of that attribute may

    change numerous times, at any time we are only concerned about its current and previous

    values. In such circumstances, we can design the table as type 3 and keep only 2 columns -

    "current value" and "previous value" like below.

    KEY CURRENT_VALUE PREVIOUS_VALUE

    1 D A

    2 B

    3 C

    I can't find a very good example of this scenario right away, however, I can give you one

    example from one of my previous projects in telecom domain, wherein a certain calculated

    field in the report used to depend on the latest and previous values of the customer status.

    That calculated attribute was called "Churn Indicator" (churn in telecom business generally

    means leaving a telephone connection) and the rule to populate the churn indicator was (in

    a very very simplified way) like below:

    Churn Indicator

    = "Voluntary Churn"

    (if customer's current status = 'Inactive' and previous status = 'Active')

    = "Involuntary Churn",

    (if customer's current status = 'Inactive' and previous status = 'Suspended')

    As you can guess, in order to find out the correct value of churn indicator, you do not need

    to know complete history of changes of customer's status. All you need to know is the

    current and previous status. In this kind of partial history scenario, SCD Type 3 design is

    very useful.

    Note here, compared to SCD Type 2, type 3 does not increase the number of records in the

    table thereby easing out performance concerns.

  • 7/24/2019 Dimensional Modeling Schema

    18/23

    Now that we have already learnt about slowly changing dimensions, next we will

    discusshow to design "Rapidly Changing Dimension" or RCD

    What are Slowly Changing Dimensions?

    Slowly Changing Dimensions (SCD) - dimensions that change slowly over time, rather thanchanging on regular schedule, time-base. In Data Warehouse there is a needtotrackchangesin dimension attributes in order to reporthistorical data.In other words,implementing one of the SCD types should enable users assigning proper dimension'sattribute value for given date. Example of such dimensions could be: customer, geography,employee.

    There are many approaches how to deal with SCD. The most popular are:

    Type 0- The passive method Type 1- Overwriting the old value

    Type 2- Creating a new additional record

    Type 3- Adding a new column

    Type 4- Using historical table

    Type 6- Combine approaches of types 1,2,3 (1+2+3=6)Type 0- The passive method. In this method no special action is performed upondimensional changes. Somedimension datacan remain the same as it was first timeinserted, others may be overwritten.

    Type 1- Overwriting the old value. In this method no history of dimension changes is kept

    in the database. The old dimension value is simply overwritten be the new one. This typeis easy to maintain and is often use for data which changes are caused by processingcorrections(e.g. removal special characters, correcting spelling errors).

    Before the change:

    Customer_ID Customer_Name Customer_Type

    1 Cust_1 Corporate

    After the change:

    Customer_ID Customer_Name Customer_Type1 Cust_1 Retail

    Type 2- Creating a new additional record. In this methodology all history of dimensionchanges is kept in the database. You capture attribute change by adding a new row with anew surrogate key to the dimension table. Both the prior and new rows contain asattributes the natural key(or other durable identifier). Also 'effective date' and 'current

    http://dwbi.org/data-modelling/dimensional-model/20-implementing-rapidly-changing-dimension.htmlhttp://dwbi.org/data-modelling/dimensional-model/20-implementing-rapidly-changing-dimension.htmlhttp://dwbi.org/data-modelling/dimensional-model/20-implementing-rapidly-changing-dimension.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://dwbi.org/data-modelling/dimensional-model/20-implementing-rapidly-changing-dimension.html
  • 7/24/2019 Dimensional Modeling Schema

    19/23

    indicator' columns are used in this method. There could be only one record with currentindicator set to 'Y'. For 'effective date' columns, i.e. start_date and end_date, the end_datefor current record usually is set to value 9999-12-31. Introducing changes to thedimensional model in type 2 could be very expensive database operation so it is notrecommended to use it in dimensions where a new attribute could be added in the future.

    Before the change:

    Customer_ID Customer_Name Customer_Type Start_Date End_Date Current_Flag

    1 Cust_1 Corporate 22-07-2010 31-12-9999 Y

    After the change:

    Customer_ID Customer_Name Customer_Type Start_Date End_Date Current_Flag

    1 Cust_1 Corporate 22-07-2010 17-05-2012 N

    2 Cust_1 Retail 18-05-2012 31-12-9999 Y

    Type 3- Adding a new column. In this type usually only the current and previous value ofdimension is kept in the database. The new value is loaded into 'current/new' column andthe old one into 'old/previous' column. Generally speaking the history is limited to thenumber of column created for storing historical data. This is the least commonly neededtechinque.

    Before the change:

    Customer_ID Customer_Name Current_Type Previous_Type

    1 Cust_1 Corporate Corporate

    After the change:

    Customer_ID Customer_Name Current_Type Previous_Type

    1 Cust_1 Retail Corporate

    Type 4- Using historical table. In this method a separate historical table is used to trackall dimension's attribute historical changes for each of the dimension. The 'main' dimensiontable keeps only the current data e.g. customer and customer_history tables.

    Current table:

    Customer_ID Customer_Name Customer_Type

    1 Cust_1 Corporate

    Historical table:

    http://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.html
  • 7/24/2019 Dimensional Modeling Schema

    20/23

    Customer_ID Customer_Name Customer_Type Start_Date End_Date

    1 Cust_1 Retail 01-01-2010 21-07-2010

    1 Cust_1 Oher 22-07-2010 17-05-2012

    1 Cust_1 Corporate 18-05-2012 31-12-9999

    Type 6- Combine approaches of types 1,2,3 (1+2+3=6). In this type we have indimension table such additional columns as:

    current_type - for keeping current value of the attribute. Allhistory recordsfor givenitem of attribute have the same current value.

    historical_type - for keeping historical value of the attribute. All history records for givenitem of attribute could have different values.

    start_date - for keepingstart dateof 'effective date' of attribute's history.

    end_date - for keepingend dateof 'effective date' of attribute's history.

    current_flag - for keeping information about the most recent record.

    In this method to capture attribute change we add anew recordas in type 2. Thecurrent_type information is overwritten with the new one as in type 1. We store the historyin a historical_column as in type 3.

    Customer_ID Customer_Name Current_Type Historical_Type Start_Date End_Date Current_Flag

    1 Cust_1 Corporate Retail 01-01-2010 21-07-2010 N

    2 Cust_1 Corporate Other 22-07-2010 17-05-2012 N

    3 Cust_1 Corporate Corporate 18-05-2012 31-12-9999 Y

    (C) 2008-2009 www.datawarehouse4u.info

    All Rights ReservedJunk Dimension

    In data warehouse design, frequently we run into a situation where there areyes/no indicator fields in the source system. Through business analysis, weknow it is necessary to keep such information in the fact table. However, ifkeep all those indicator fields in the fact table, not only do we need to buildmany small dimension tables, but the amount of information stored in the facttable also increases tremendously, leading to possible performance andmanagement issues.

    Junk dimension is the way to solve this problem. In a junk dimension, wecombine these indicator fields into a single dimension. This way, we'll onlyneed to build a single dimension table, and the number of fields in the facttable, as well as the size of the fact table, can be decreased. The content inthe junk dimension table is the combination of all possible values of theindividual indicator fields.

    http://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://stat.4u.pl/?maciejam1http://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.htmlhttp://datawarehouse4u.info/SCD-Slowly-Changing-Dimensions.html
  • 7/24/2019 Dimensional Modeling Schema

    21/23

    Let's look at an example. Assuming that we have the following fact table:

    In this example, TXN_CODE, COUPON_IND, and PREPAY_IND are allindicator fields. In this existing format, each one of them is a dimension. Usingthe junk dimension principle, we can combine them into a single junkdimension, resulting in the following fact table:

    Note that now the number of dimensions in the fact table went from 7 to 5.

    The content of the junk dimension table would look like the following:

  • 7/24/2019 Dimensional Modeling Schema

    22/23

    In this case, we have 3 possible values for the TXN_CODE field, 2 possiblevalues for the COUPON_IND field, and 2 possible values for thePREPAY_IND field. This results in a total of 3 x 2 x 2 = 12 rows for the junkdimension table.

    By using a junk dimension to replace the 3 indicator fields, we havedecreased the number of dimensions by 2 and also decreased the number offields in the fact table by 2. This will result in a data warehousing environmentthat offer better performance as well as being easier to manage.

    Conformed Dimension

    A conformed dimension is a dimension that has exactly the same meaningand content when being referred from different fact tables. A conformeddimension can refer to multiple tables in multiple data marts within the sameorganization. For two dimension tables to be considered as conformed, theymust either be identical or one must be a subset of another. There cannot beany other type of difference between the two tables. For example, twodimension tables that are exactly the same except for the primary key are notconsidered conformed dimensions.

    Why is conformed dimension important? This goes back to thedefinition ofdata warehousebeing "integrated." Integrated means that even if a particularentity had different meanings and different attributes in the source systems,there must be a single version of this entity once the data flows into the datawarehouse.

    http://www.1keydata.com/datawarehousing/data-warehouse-definition.htmlhttp://www.1keydata.com/datawarehousing/data-warehouse-definition.htmlhttp://www.1keydata.com/datawarehousing/data-warehouse-definition.htmlhttp://www.1keydata.com/datawarehousing/data-warehouse-definition.htmlhttp://www.1keydata.com/datawarehousing/data-warehouse-definition.htmlhttp://www.1keydata.com/datawarehousing/data-warehouse-definition.html
  • 7/24/2019 Dimensional Modeling Schema

    23/23

    The time dimension is a common conformed dimension in an organization.Usually the only rule to consider with the time dimension is whether there is afiscal year in addition to the calendar year and the definition of a week.Fortunately, both are relatively easy to resolve. In the case of fiscal vs.calendar year, one may go with either fiscal or calendar, or an alternative is to

    have two separate conformed dimensions, one for fiscal year and one forcalendar year. The definition of a week is also something that can be differentin large organizations: Finance may use Saturday to Friday, while marketingmay use Sunday to Saturday. In this case, we should decide on a definitionand move on. The nice thing about the time dimension is once these rules areset, the values in the dimension table will never change. For example,October 16th will never become the 15th day in October.

    Not all conformed dimensions are as easy to produce as the time dimension.An example is the customer dimension. In any organization with some history,there is a high likelihood that different customer databases exist in differentparts of the organization. To achieve a conformed customer dimension meansthose data must be compared against each other, rules must be set, and datamust be cleansed. In addition, when we are doing incremental data loads intothe data warehouse, we'll need to apply the same rules to the new values tomake sure we are only adding truly new customers to the customerdimension.

    Building a conformed dimension also part of the process inmaster datamanagement,or MDM. In MDM, one must not only make sure the master datadimensions are conformed, but that conformity needs to be brought back tothe source systems.

    http://www.1keydata.com/datawarehousing/master-data-management.htmlhttp://www.1keydata.com/datawarehousing/master-data-management.htmlhttp://www.1keydata.com/datawarehousing/master-data-management.htmlhttp://www.1keydata.com/datawarehousing/master-data-management.htmlhttp://www.1keydata.com/datawarehousing/master-data-management.htmlhttp://www.1keydata.com/datawarehousing/master-data-management.html