Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin...

44
Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie

Transcript of Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin...

Page 1: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server

2008 – Level 200

Colin Hardie

Page 2: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Agenda……or how long do I have to sit here for?

• Who am I?

• Introduction – What is a Slowly Changing Dimension (SCD)?

• Types of SCD

• The Kimball Approach to SCDs – modelling an SCD in the Data Warehouse

• Alternative designs and why you might want to use them

• SCDs in the OLAP layer

• Loading an SCD 1 – The SSIS SCD Wizard

• Loading an SCD 2 – BYO (Build Your Own!)

•Loading an SCD 3 – BYO Reloaded - The one that works! (I hope)

• Loading an SCD 4 – SQL Server 2008 TSQL Approach – the MERGE statement!

• Questions (and possibly answers)

Page 3: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Who am I?The name is familiar from somewhere…

• Senior Business Intelligence Consultant with IMGROUP• Specialising in SSIS and SSAS with a smattering of RDBMS• MCTS & MCITP qualified for Business Intelligence• Co-Author of the most popular post on Chris Webb’s Blog – OLAP Jokes

http://cwebbbi.spaces.live.com/Blog/cns!1pi7ETChsJ1un_2s41jm9Iyg!290.entry

A girl goes speed dating and after a run of duff prospects a dimension sits down opposite her and they engage in sprightly conversation.

After the event is over, her friend asks "Did you get that dimension's number? You two really seemed to hit it off"

"No" replies the girl "He was really deep at times - had a lot of levels - but you know parent-child dimensions, just too much of a mummy's boy"

Page 4: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Subject Matter Specific Humour……otherwise known as the tenuous link

Q: Why did the dimension take all day to take off its suit and put on a pair of jeans?

A: It was a slowly-changing dimension!

(Yeah it was Chris’s)

Page 5: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

What is a Slowly Changing Dimension?Time to Ask Kimball…or not

• Somewhat cryptically, Kimball has started defining SCDs as “nearly constant dimensions”*

• A more helpful definition might be “a dimension that records change over time”

• For most people in the BI community there are nominally 3 different types of SCD, each with a different method of handling change

• Type 1 where change is handled by overwriting old data, hence you always get the latest view

• Type 2 where change is stored using a combination of effective dates and new rows of data (more to follow!)

• Type 3 where change is handled by adding a new column!

•Wikipedia also defines Types 0, 4 and 6

• Type 0 – where nothing is done at all (so this is a dimension then???)

• Type 4 – where “history tables” are used to store old records

• Type 6 – which is a hybrid (invented by Kimball) of Types 1 + 2 + 3 = 6

* The Data Warehouse Toolkit, Second Edition (Ralph Kimball et al, 2002)

Page 6: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

TYPES OF SCD

Page 7: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Types of SCDType 1 SCD Handling

TECHNIQUE: Replace an old dimension attribute value with a new one

• Example is the Customer Dimension in the Adventure Works DW database

• After a freak accident one day, Jon Yang becomes Jane Yang and the row then becomes

Page 8: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Types of SCDType 2 Handling

TECHNIQUE: Add a new row with all of the same values apart from the changed one

•Kimball states that “A type 2 response is the predominant technique for supporting [the representation of prior data correctly]”• Example is the Employee dimension in the Adventure Works Data Warehouse database

• One day after an unfortunate incident at the office party, Guy get re-assigned to a new department with a new boss and a new title.

• Note that the new row receives a new surrogate key, the old row now has CurrentFlag status of 0 and a non-null EndDate

Page 9: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Types of SCDType 3 Handling

TECHNIQUE: Add a new column to capture the previous state of the attribute

• This technique is used in those cases when we want the ability to compare results using both the old and current values of an attribute• Type 2 cannot support this as it partitions the Fact table by attribute value. It is equivalent to a UNION of results for all attribute values.• Type 3 is the equivalent of a JOIN of results for current and prior attribute values.

Page 10: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Types of SCD Type 6 Handling

TECHNIQUE: Add new rows and columns with updates (1 + 2 + 3)

• To be used solely in the case where the solution is required to preserve all historical changes whilst also allowing you to compare historical data with the current view

Page 11: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

SummaryWhat do we know so far?

• Slowly Changing Dimensions are a method of handling change in dimensional data over time

• There are various “Types” of SCD that each involve one or more techniques including updates, new rows and new columns

• The most commonly used are Type 1 and Type 2

• For the remainder of the session we will be focusing on Type 2 and how they can be implemented

Page 12: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

SCDs IN THE RELATIONAL LAYER

Page 13: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Modelling Type 2 SCDsThe Relational Layer – The Kimball Approach

• SCDs like all other Dimensions in the Kimball approach to Data Warehouse design are predominantly modelled using a de-normalized star schema.• This can lead to some data redundancy (e.g. repeated attribute values) but generally on the basis of ease of use and query performance (TSQL that is) this is acceptable.• Kimball specifically recommends against the use of normalized or snowflake schemas on the grounds that “since dimension tables are geometrically smaller than fact tables, improving storage efficiency…has virtually no impact“*

* Data Warehouse Toolkit, Second Edition (Ralph Kimball et al, 2002)

Page 14: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Modelling Type 2 SCDsThe Relational Layer – The Kimball Approach

• An important point to note about the Dimension table is the use of a surrogate key (EmployeeKey) rather than the natural business key• Without a surrogate key in place, Type 2 changes would not be possible as the business key could not be repeated in a new row without change, at which point it would represent something else• The surrogate key becomes the basis for the join to the fact table and for the partitioning of the fact table data based on dimensional change• Note that in a single dimension table, both Type 1 and Type 2 attributes can be present

Page 15: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Modelling Type 2 SCDsWhen one table is never enough…

• In general, using a Star-Schema dimension is the best approach• They are easier to understand and in a relational sense perform better• ETL is more straightforward• Hierarchy relationships can be easily seen by reading across a row• However, it should be noted that Kimball tends to exclusively talk about relational BI systems• The use of an OLAP layer will significantly mitigate against the problems of a multi-table design• In particular there are two main scenarios where a multi-table approach offers a better solution

i. A long chain of Type 2 hierarchy attributesii. A large, rapidly changing Type 2 dimension

Page 16: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Modelling Type 2 SCDsA Long Type 2 Hierarchy Chain

• Consider the following scenario…• A large multinational has implemented a central BI tool to manage its global sales force• The sales force has the following structure

Global Head of Sales

Regional Head

Country Manager

Area Manager

Team Manager

Sales Rep

Page 17: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Modelling Type 2 SCDsA Long Type 2 Hierarchy Chain

• In a Star Schema format this type of chain is modelled as one table

• If all of the managers and efficiency targets are modelled as type 2, then data explosion can occur very quickly, especially for large numbers of granularity members• Large amounts of data redundancy is introduced

Page 18: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Modelling Type 2 SCDsA Long Type 2 Hierarchy Chain

• In a Snowflake Scheme this is modelled as a series of tables

• This approach has the advantage that if a change occurs in a mid-hierarchy attribute, then the subsequent changes are restricted to that attribute and below• This not only reduces storage space but also the burden placed on the ETL system• Analysis Services also tends to process slightly more performantly against Snowflake schemas than Star schemas

Page 19: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Modelling Type 2 SCDsA Long Type 2 Hierarchy Chain

• Suppose we have the following row of data from the Star schema implementation of the dimension

SalesRepKey 1SalesRepName Joe BloggsPostalCodeAssignment WC1AssignedProduct1 ACME Vacuum CleanerAssignedProduct2 ACME Dust BusterTeamManagerKey 1TeamManagerName Bill BloggsDistrictName LondonTeamEfficiencyTarget 25AreaManagerKey 1AreaManagerName Bruce BloggsAreaName LondonAreaEfficiencyTarget 25CountryManagerKey 1CountryManagerName Jack BloggsCountryName UKCountryEfficiencyTarget 25RegionManagerKey 1RegionManagerName Jean-Claude BloggsRegionName EuropeRegionEfficiencyTarget 25GlobalHeadKey 1GlobalHeadName Wayne BloggsGlobalEfficiencyTarget 25

2Joe BloggsWC1ACME Vacuum CleanerACME Dust Buster1Bill BloggsLondon301Bruce BloggsLondon251Jack BloggsUK251Jean-Claude BloggsEurope251Wayne Bloggs25

1Joe BloggsWC1ACME Vacuum CleanerACME Dust Buster1Bill BloggsLondon301Bruce BloggsLondon251Jill BloggsUK251Jean-Claude BloggsEurope251Wayne Bloggs25

Page 20: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

TeamManagerKey 1 2TeamManagerName Bill Bloggs Bill BloggsDistrictName London LondonTeamEfficiencyTarget 25 30AreaManagerKey 1 1

TeamManagerKey 1 2 3TeamManagerName Bill Bloggs Bill Bloggs Bill BloggsDistrictName London London LondonTeamEfficiencyTarget 25 30 30AreaManagerKey 1 1 2

CountryManagerKey 1 2CountryManagerName Jack Bloggs Jill BloggsCountryName UK UKCountryEfficiencyTarget 25 25RegionManagerKey 1 1

Modelling Type 2 SCDsA Long Type 2 Hierarchy Chain

SalesRepKey 1SalesRepName Joe BloggsPostalCodeAssignment WC1AssignedProduct1 ACME Vacuum CleanerAssignedProduct2 ACME Dust BusterTeamManagerKey 1

TeamManagerKey 1TeamManagerName Bill BloggsDistrictName LondonTeamEfficiencyTarget 25AreaManagerKey 1

AreaManagerKey 1AreaManagerName Bruce BloggsAreaName LondonAreaEfficiencyTarget 25CountryManagerKey 1

CountryManagerKey 1CountryManagerName Jack BloggsCountryName UKCountryEfficiencyTarget 25RegionManagerKey 1

RegionManagerKey 1RegionManagerName Jean-Claude BloggsRegionName EuropeRegionEfficiencyTarget 25GlobalHeadKey 1

GlobalHeadKey 1GlobalHeadName Wayne BloggsGlobalEfficiencyTarget 25

SalesRepKey 1 2SalesRepName Joe Bloggs Joe BloggsPostalCodeAssignment WC1 WC1AssignedProduct1 ACME Vacuum Cleaner ACME Vacuum CleanerAssignedProduct2 ACME Dust Buster ACME Dust BusterTeamManagerKey 1 2

SalesRepKey 1 2 3SalesRepName Joe Bloggs Joe Bloggs Joe BloggsPostalCodeAssignment WC1 WC1 WC1AssignedProduct1 ACME Vacuum Cleaner ACME Vacuum Cleaner ACME Vacuum CleanerAssignedProduct2 ACME Dust Buster ACME Dust Buster ACME Dust BusterTeamManagerKey 1 2 3

AreaManagerKey 1 2AreaManagerName Bruce Bloggs Bruce BloggsAreaName London LondonAreaEfficiencyTarget 25 25CountryManagerKey 1 2

Page 21: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Modelling Type 2 SCDsA Rapidly Changing Type 2 Dimension

• Kimball describes a scenario where you find yourself with what he terms a “rapidly changing dimension” - essentially very large dimensions with many Type 2 attributes that change frequently e.g. a Customer dimension• For multi-million row dimension tables, the standard Type 2 handling techniques fall down due to the data explosion they generate• Kimball’s suggested solution is to split the dimensions in two, with 2 keys attached to the fact table instead of one• In one table the original surrogate key and all of the slowly changing Type 2 attributes are left. •In the other, all of the remaining rapidly changing Type 2 attributes are stored with another surrogate key. The relationship between the two is then maintained through the fact table• This technique should only be considered for those scenarios described as “rapidly changing monster dimension”

Page 22: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Modelling Type 2 SCDsA Rapidly Changing Type 2 Dimension

• The split dimension technique involves taking one dimension with a combination of slowly and rapidly changing Type 2 attributes and creating two

Page 23: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

SCDs IN THE OLAP LAYER

Page 24: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Modelling Type 2 SCDsThe OLAP Layer

• Modelling Type 2 SCDs in SSAS is extremely straight forward• Because we use a surrogate key for the granularity of the dimension in the relational layer, all of the hard work has already been done• The changing surrogate key means that over time the “parent” attributes to which this granularity attribute belongs will change over time seamlessly and with no extra effort on our part• If we want to see a view across time for a particular member then we can also create an attribute hierarchy based on the business key of the dimension

Page 25: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Modelling Type 2 SCDsThe OLAP Layer – An Example

• Suppose we are a pharmaceutical company who want to track sales of a particular drug over time• We have a Type 2 SCD Product dimension that records details of the drug over time• And a Fact table detailing drug sales over time

Page 26: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Modelling Type 2 SCDsThe OLAP Layer – An Example

DEMO

Page 27: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

LOADING SCDs

Page 28: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Loading a Hybrid Type1/Type 2 SCDOur example dimension

• Over the next few slides and demos we’ll be looking at a fairly realistic scenario. What are the methods we can use to load an SCD and in this case a hybrid Type 1/ Type 2 SCD• The dimension table I’ll using as my test subject is one I created by using the new Red Gate SQL Generator.• It looks like this

• It has 1 million rows• Each method of loading the table will employ the same data “feed”, a SQL table containing• 100, 000 new rows• 100, 000 existing rows• 100, 000 update rows

Page 29: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Loading a Hybrid Type1/Type 2 SCDThe SCD Wizard

• Out of the box, SSIS comes with the Slowly Changing Dimension task• It allows you to rapidly develop Slowly Changing Dimension handling in a very short amount of time• It has one major drawback though…

IT’S RUBBISH!!!

Page 30: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Loading a Hybrid Type1/Type 2 SCDThe SCD Wizard

• In all but the smallest of tables with very few updates per cycle the task does not perform at all • The reason for this poor performance is twofold: 1. The Algorithm - column by column comparison between the destination and

incoming rows can quickly become incredibly inefficient and a performance bottleneck

2. Row by Row Processing - When a row has been identified as an update row it is sent

to an OLE DB Command which performs a single update statement for that row. All the traditional advantages of SQL set based processing are lost

Page 31: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Loading a Hybrid Type1/Type 2 SCDThe SCD Wizard

DEMO

Page 32: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Loading a Hybrid Type1/Type 2 SCDReplacing the SCD Task

• Any solution to the performance problem of the SCD wizard must be able to solve the indicated problems, whilst still retaining the accuracy of the logic• Essentially this means that any solution must be able to do the following: 1. Be able to differentiate between new or existing rows2. Be able to remove the need for column by column comparison3. Be able to leverage set based rather than row by row based processing

Page 33: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Loading a Hybrid Type1/Type 2 SCDUse MERGE JOINS & HASHING

• We can create our own solution using MERGE JOIN tasks that will mimic the behaviour of the SCD Task• We can combine this with a workflow that allows us to use set based rather than row based processing • We can also use CHECKSUM or HASHBYTES to ensure we only have to compare one column

Page 34: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Loading a Hybrid Type1/Type 2 SCDCHECKSUM or HASHBYTES

• CHECKSUM and HASHBYTES are both hashing algorithms within TSQL• Used for combining many columns together in one “hash” value and building hash indexes over• For our purposes this is important – it allows us to combine all of the incoming attributes into one column, thus instead of checking multiple columns we only need to check one• There is a problem though, the number of “collisions” with CHECKSUM – the number of times the same hash value will be created by different combinations of values is too high• Microsoft recommends the use of the HASHBYTES function instead, along with the SHA1 algorithm

Page 35: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Loading a Hybrid Type1/Type 2 SCDUse Lookups

DEMO

Page 36: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Loading a Hybrid Type1/Type 2 SCDUse Lookups

• There are obviously problems with this technique!

1. HASHBYTES, the VARBINARY(8000) it creates and the function used to transform it is a major bottleneck..

2. MERGE JOIN is not a particularly efficient JOIN algorithm

Page 37: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Loading a Hybrid Type1/Type 2 SCDTweaking the Solution

• To get around the second point we can use Lookup tasks• The Lookup task essentially caches a whole table or query in memory at runtime and then compares incoming rows with the result of that cache• This means queries can be very quickly resolved• However, in very large dimensions the MERGE JOIN may perform better due to the memory limitation of the Lookup• As for the first point, in practice due to the size of hashing output we are better off comparing the separate columns (assuming they total less than 8000 bytes!)

Page 38: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Loading a Hybrid Type1/Type 2 SCDTweaking the Solution

DEMO

Page 39: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Loading a Hybrid Type1/Type 2 SCDSQL Server 2008 New TSQL Features

• In SQL Server 2008 (as you have doubtless seen elsewhere today) there are additions to Transact SQL (TSQL) that are relevant for Slowly Changing Dimension handling• Specifically these are

1. The MERGE statement2. The INSERT from DML output

Page 40: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Loading a Hybrid Type1/Type 2 SCDSQL Server 2008 New TSQL Features - MERGE

• The MERGE statement performs insert, update, or delete operations on a target table based on the results of a join with a source table.• The MERGE statement provides three types of WHEN clauses:

1. WHEN MATCHED enables you to UPDATE or DELETE the given row in the target table when the source and target rows match some criteria or criterion.

2. WHEN NOT MATCHED [BY TARGET] enables you to INSERT a row into the target when it exists in the source but not in the target.

3. WHEN NOT MATCHED BY SOURCE enables you to UPDATE or DELETE the given row in the target table when it exists in the target but not in the source.

• You can specify a search condition with each of the WHEN clauses to choose which type of DML operation should be performed on the row. • The OUTPUT clause for the MERGE statement includes a new virtual column called $action that you can use to identify the DML action that was performed on each row.

Page 41: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Loading a Hybrid Type1/Type 2 SCDSQL Server 2008 New TSQL Features - INSERT

• In SQL Server 2008, INSERT has the new ability to consume the results of DML statements•If this new capability is combined with the output of a MERGE, we have the ability to perform Type 2 inserts based on Type 2 updates performed in the MERGE• If these pieces of syntax are nested in the right way, then we can perform hybrid updates as we have been doing previously

Page 42: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

Loading a Hybrid Type1/Type 2 SCDSQL Server 2008 New TSQL Features - INSERT

DEMO

Page 43: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

SummaryWhat have we learned?

• Slowly Changing Dimensions and in particular Type 2 SCDs are a powerful tool in tracking history in a data warehouse• There are various ways of modelling Type 2 SCDs in the relational layer• The easiest to understand is the Star Schema approach advocated by Kimball and generally this is the best choice. However in certain scenarios, a Snowflake or Split Dimension technique is more suitable• Whichever method you choose to model, in the OLAP layer the hard work has already been done and the Type 2 model should just drop out• SQL Server 2008 offers several options in loading SCDs• The in built SCD Wizard should only be used for the smallest of dimensions• It is possible to construct a performant alternative using Lookups (or MERGE JOINS for very large dimensions) and set based UPDATE statements• MERGE and INSERT functionality offers a viable alternative for loading in environments where a code based ETL approach has been taken

Page 44: Choices, Choices - Implementing Slowly Changing Dimensions in SQL Server 2008 – Level 200 Colin Hardie.

THE END

QUESTIONS?