IST722 Data Warehousing - Syracuse...
Transcript of IST722 Data Warehousing - Syracuse...
IST722 Data Warehousing
Dimensional Modeling
Michael A. Fudge, Jr.
Pop Quiz: T/F1. The business meaning of a fact table row is known
as a dimension.2. A dimensional data model is optimized for
maximum query performance / ease of use.3. An attribute is a business performance
measurement.4. Order date & Shipping date use the same data.
This is an example of a conformed dimension.5. A degenerate dimension represents a dimensional
key with no attributes.
Pop Quiz: T/F - Answers1. The business meaning of a fact table row is known
as a dimension. False (Fact table grain)2. A dimensional data model is optimized for
maximum query performance / ease of use. True3. An attribute is a business performance
measurement. False (Fact)4. Order date & Shipping date use the same data.
This is an example of a conformed dimension. True5. A degenerate dimension represents a dimensional
key with no attributes. True
Objective:Define and Explain
“dimensional modeling”
Recall: Kimball Lifecycle
Dimensional Modeling• A Logical design technique for
structuring data with the following objectives:1.Intuitive: Easy for business users to
understand2.Fast: Excellent query performance
E-R Models vs. Dimensional Models
Entity-Relationship Dimensional• Complex.• Designed to eliminate
data redundancy.• Optimized for storage.• Supports transaction
processing.• Operational Data.• Highly Normalized.
• Easy to Understand.• Designed to support
data redundancy.• Optimized for
information retrieval.• Decision support
processing.• De-Normalized.
Components of the Dimensional Model
• Fact Table – A database table of quantifiable performance measurements (facts).• Ex. Sales Amount, Days To Ship, Quantity on Hand.
• Dimension Table – A table of contexts for the facts.• Ex. Date/Time, Location, Customer, Product
• Attribute – A characteristic of a dimension.• Ex. Product: Name, Category, Department
o Star Schema – Connections among facts and dimensions which define a business process.• Ex: Sales, Inventory Management
I like to think about it this way:
• Facts are the business process measurement events• Dimensions provide the context for that event.
“How many sneakers did we sell last week?”
Quantity (Fact)
ProductType
(Attribute ofa Product
Dimension)
Duration of Time
(Attribute ofa Sales DateDimension)
Business Process(Sales)
3 Types of Facts• Additive
o Fact can be summed across all dimensions. o The most useful kind of fact.o Ex. Quantity Sold, Hours Billed
• Semi-Additiveo Cannot be summed across all dimensions, such as time periods.o Sometime these are averaged across the time dimension.o Ex. Account Balance, Quantity on Hand
• Non-Additiveo Cannot be summed across any dimension.o These do not belong in the fact table, but with the dimension.o Ex. Building square footage, Product retail price
Is that a Fact?• Not every numeric value is a fact.
• Good Fact-Detecting Rules• Is it Additive (does it sum-up across dimensions),
then it is a fact.• If it is used for filtering or labeling then it’s not a fact
but an attribute of a dimension.o Ex: Basketball Player’s height.
• If it is used in calculations, then it should be treated as a fact.o Ex: Employee hourly wage is used to calculate weekly pay.
Facts or Attributes?Additive? Semi? Non?
1. Number of page views on a website?2. The amount of taxes withheld on an employee’s
weekly paycheck?3. Credit card balance.4. Pants waist size? 32, 34, etc…5. Tracking when a student attends class?6. Product Retail Price?7. Vehicle’s MPG rating?8. The number of minutes late employees arrive to
work each day.
Facts or Attributes?Additive? Semi? Non?
1. Number of page views on a website? F/A2. The amount of taxes withheld on an employee’s
weekly paycheck? F/A3. Credit card balance. F/S4. Pants waist size? 32, 34, etc… N/A5. Tracking when a student attends class? F/A6. Product Retail Price? N/A7. Vehicle’s MPG rating? N/A8. The number of minutes late employees arrive to
work each day. F/A
Fact Table Design• The Primary Key of your fact table uses the
minimum number columns possible & no surrogate keys. (Made up of FK’s and Degenerate Dimensions)
• Referential Integrity is a must. Every foreign key in the fact table must have a value.
• Avoid NULLs in the foreign key by using flags which are special values in place of null.o Ex. “No Shopper Card” in Customer Dimension
• The granularity of your fact table should be at the lowest, most detailed atomic grain captured by a business process. (more on this later)
Dimensions• Dimensions provide context for our facts.• We can easily identify dimensions because of the
“by” and/or “for” words.o Ex. Total accounts receivables for the IT Department by Month.
• Dimensions have attributes which describe and categorize their values.o Ex. Student: Major, Year, Dormitory, Gender.
• The attributes help constrain and summarize facts.
Dimension Table DesignCharacteristics of a Good Dimension table Verbose labels with full words Descriptive columns Complete – no null / empty values Discretely values – one value per row. Quality Assured – data is clean and consistent. Always have a Surrogate Primary Key
What's Wrong w/This Dimension?
Prod Id Prod Name Prod Cat Prod Price Prod Region Code
A Apple Fruit $2.00 EB Carrot Veg $1.50 SC Cherries Friut $3.00 SD Lettuce Veg $1.50E Apple Fruit $2.00 E
Can you find the 6 things wrong with the implementation of this dimension?
What's Wrong w/This Dimension?
Prod Id Prod Name Prod Cat Prod Price Prod Reg CodeA Apple Fruit $2.00 EB Carrot Veg $1.50 SC Cherries Friut $3.00 SD Lettuce Veg $1.50E Apple Fruit $2.00 E
No Surrogate
Key
Not Verbose(What
do S & E mean?)
IncompletePoor DataQuality
Not DiscretelyValued
Poor Descriptions
Dimension Table Key• Surrogate keys (identities, sequences e.g. 1,2,3,…)
are used for the primary key constraint.• They yield best performance for Star Schema
o most efficient joins, o smaller indexes in fact table, o more rows per block in the fact table
• They have no dependency on primary key in operational source data.o Makes it easier to deal with changes to the source data.
• Dimension table always has a natural key used to identify a unique row.o Ex: Customer’s email address, Employee’s SSN.
Conformed Dimensions• Master or common reference dimensions.• Shared across business processes (fact tables) in the
DW.• Reusable, can be used for drill-across, lower time to
develop next star schema.• Two types:
o Identical Dimensions – exactly the same dimensions (Ex. Dates)
o Perfect Subset of an existing dimension.
Ex. Conformed DimensionsSales Fact Table
Date key FKProduct key FK
… other FKeys…Sales quantitySales amount
Product DimensionProduct key PK
Product descriptionSKU number
Brand descriptionClass description
Department description
Sales Forecast Fact TableMonth key FKBrand key FK
… other FKeys…Forecast quantityForecast amount
Brand DimensionBrand key PK
Brand descriptionClass description
Department description
Subset
Date and Time Dimensions
• Just about every fact table as a date dimension. • This is the most common of conformed dimensions.• Usually generated programmatically during the ETL
process or imported from a spreadsheet.• Acceptable to use PK in the form YYYMMDD• In you need time of day, use a separate dimension.• Time of day should only be used if there are
meaningful textual descriptions of time o Ex. Lunch, Dinner, 1st shift, 2nd Shift, Etc…
• Elapsed times intervals are facts, not attributes.o Ex. Minutes between when order was received and shipped
Handling Time Zones?• Express time in coordinated universal time (UTC)• Express in local time, too.• Other options: use a single time zone (for example,
ET) to express all times in this zone.
Call Center Activity FactLocal call date key FKUTC call date key FK
Local call time of day FKUTC call time of day FK
…
local call datedimension
UTC call datedimension
Local call time of day dimension
UTC call time of day dimension
Degenerate Dimensions• Occur in transaction fact tables that have a parent
child (One to Many) structure.o Ex. Order Order Detail, Airline Ticket Flights
• Dimensions we store in the fact table (because there’s too many of them for their own a dimension)
• Allow us to drill-through to operational data.• Usually ends up as part of the primary key of the
fact table.
Slowly Changing Dimensions
• Dimensional data changes infrequently but when it does you need a strategy for addressing the change.o Ex: Customer has a new address, Employee has a name change
4 Popular strategies Type 1: Overwrite the existing attribute Type 2: Add a new Dimension row Type 3: Add a new Dimension attribute Mini-Dimension: Add a new Dimension
• These strategies are not mutually exclusive!
Type 1: Overwrite• Appropriate for:
o correcting mistakes or errors o changes where historical associations do not mattero the old value has no significance
• If the previous value matters, don’t use this strategy.• Problems will occur with data aggregated on old
values. • Ex. Employee Name Changes, Corrections, Natural
Key Edits.
Type 2: Add New Dimension Row
• Most popular strategy, preserves history• Natural key is repeated.• Old and new values are stored along with effective
dates and indicator of current row
Product Key
Product Descr.
Product Code
Department Effective Date
Expiration Date
Current Row
11981 Stapler, Red ST901 Accessories 4/7/2010 9/1/2011 N20344 Stapler, Red ST901 Supplies 9/2/2011 3/31/2013 N45393 Stapler, Red ST901 Office
Supplies4/1/2013 12/31/9999 Y
Type 3: Add A New Dimension Attribute
• Infrequently used, preserves history• Useful for “Soft” changes where users might want to
choose between the old and new attribute• The new value is written to the existing column, the
old value is stored in a new column.• This way queries do not have to be re-written to
access the new attribute.• Ex. Redistricting sales territories. Re-charting
accounting codes.
Mini-Dimensions: Add a new Dimension
• If attributes change frequently consider placing them in their own “mini-dimensions”
• Most effective when you have banded values, or ranges of discrete values.
Fact TableCustomer Key FK
Customer Demographics Key FK… other FKeys…
… Facts…
Customer DimensionCustomer key PK
Customer ID (Nat. Key)Customer Name
…
Customer Demographics DimensionCustomer Demographics Key PK
Customer Age BandCustomer Gender
Customer Income Band…
Role-Playing Dimensions• The same physical dimension plays more than one
logical dimensional role.• Common among the date dimension• Stored in the same physical table, just aliased as a
view.• Examples:
o Date: Order Date, Shipping Date, Delivery Date o Address: Ship to, Bill too Airport: Arrival, Departure
Junk Dimensions• Miscellaneous Flags and text attributes which do
not fit within any other dimension.• Place them in their own “Junk” dimension
InvoiceIndicator Id
Payment Terms
OrderMode
ShipMode
1 Net 10 Web Freight2 Net 10 Web Air3 Net 10 Fax Freight4 Net 10 Fax Air5 Net 10 Phone Freight6 Net 10 Phone Air7 Net 15 Web Freight8 Net 15 Web Air
Don’t Create a
Junk Dimension Row Until
You Need It
Snowflake & Outrigger Dimensions
• When the redundant attributes are moved to a separate table to eliminate redundancy we get a snowflaked dimension.
• Pros: Data is back in 3NF, saves space• Cons: More complex for users, decreased
performance.• Sometimes this is desirable when there are a
significant number of attributes in the outrigger dimension. These are the exception not the rule!
Product DimensionProduct Key FKProduct Name
Product Size Key FK
Product Size DimensionProduct Size Key PKProduct Size (S,M,L)
Product Size Fee
Hierarchies in Dimensions
• Fixed hierarchies – Simply de-normalize as attributeso Ex. Product: Department -> Type
• Variable-depth hierarchies - implement with a bridge table (used to resolve M-M relationships)
• Should be used only when absolutely necessaryo Negatively affects usabilityo Decreases performance Customer Dimension
Customer Key PKCustomer Name
….
Fact TableDate Key FK
Customer Key FKMore Foreign Keys…
Facts …. Customer Hierarchy BridgeParent Customer Key PK,FKSubsidiary Cust. Key PK,FK
# Levels from ParentBottom Flag
Top Flag
Multi-Valued Dimensions• Almost all Fact-Dimension relationships are M-1• Sometimes there’s a M-M relationship between fact
and Dimension.• The Weighing factor is between 0 and 1 and should
add up to 1 for each unique group key.
Diagnosis DimensionDiagnosis Key PK
ICD-9 CodeDiagnosis Description
….
Health Care Billing FactBilling Date Key FK
Patient Key FKDiagnosis Group Key FK
Bill AmountMore Facts …. Diagnosis Group Bridge
Diagnosis Group Key PK,FKDiagnosis Key PK,FK
Weighing Factor
What Kind of Dimension?1. Customers (for orders and
sales leads)2. The various classrooms on a
college campus?3. Items on a restraint menu?4. Parts required to repair an
automobile as part of a service record?
5. The instructors who teach a college class?
• Conformed?• Degenerate?• Slowly Changing?
& Type?• Role Playing?• Junk?• Outrigger?• M-M (Bridge)?
Transaction Fact• The most basic fact grain• One row per line in a transaction• Corresponds to a point in space and time• Once inserted, it is not revisited for update• Rows inserted into fact table when transaction
occurs• Examples:
o Sales, Returns, Telemarketing, Registration Events
Periodic Snapshot Fact• At predetermined intervals snapshots of the same
level of details are taken and stacked consecutively in the fact table
• Snapshots can be taken daily, weekly, monthly, hourly, etc…
• Complements detailed transaction facts but does not replace them
• Share the same conformed dimensions but has less dimensions
• Examples: o Financial reports, Bank account values, Semester class
schedules, Daily classroom Lab Logins
Accumulating Snapshot Fact• Less frequently used, application specific.• Used to capture a business process workflow.• Fact row is initially inserted, then updated as
milestones occur • Fact table has multiple date FK that correspond to
each milestone • Special facts: milestone counters and lag facts for
length of time between milestones• Examples:
o Order fulfillment, Job Applicant tracking, Rental Cars
Which Fact Table Grain?1. Concert ticket purchases?2. Voter exit polls in an election?3. Mortgage loan application and
approval?4. Auditing software use in a computer
lab?5. Daily summaries of visitors to websites?6. Tracking Law School applications?7. Attendance at sporting events?8. Admissions to sporting events at 15
minute intervals?
Transaction
Periodic Snapshot
Accumulating Snapshot
Which Fact Table Grain?1. Concert ticket purchases? T2. Voter exit polls in an election? T3. Mortgage loan application and approval?
AS4. Auditing software use in a computer lab? T5. Daily summaries of visitors to websites? PS6. Tracking Law School applications? AS7. Attendance at sporting events? T8. Admissions to sporting events at 15 minute
intervals? PS
Transaction
Periodic Snapshot
Accumulating Snapshot
Facts of Different Granularity == NO
• A single fact table cannot have facts with different levels of granularity
• All measurements must be in the same level of details
• Example: o Measurements are captured for each line order except for
the shipping charge which is for the entire order
• Solutions:o Allocating higher level facts to a lower granularity
(split shipping charge among each item)o Create two separate fact tables
(Orders fact & Line Order fact)
Multiple currencies / Units of Measure
• Measurements are provided in a local currency• Measurements should be converted to a
standardized currency or else conversion rates must be stored
• Similarly, in case of multiple units of measure, conversions to all different units of measure should be provided o Ex. Items received are by box (12 in a box =Received unit factor)
Received Price = Received unit factor x unit price
Factless Fact tables• Business processes that do not generate
quantifiable measurementso Ex: Student attendance, College adminssions
• Can be easily converted into traditional fact tables by adding an attribute Count, which is always equal to 1.
• Helps to perform aggregationso Ex: Attendance Count
Consolidated fact tables• Fact tables populated from different sources may
consolidated into single fact tableo Level of granularity must be the sameo Measurements are listed side-by-sideo Ex. by combining forecast and actual sales amounts, a
forecast/actual sales variance amount can be easily calculated and stored
Sales FactDate Key FK
Customer Key FKRegion Key FK
Actual Sales
Forecast FactDate Key FK
Customer Key FKRegion Key FKForecast Sales
Sales & Forecast FactDate Key FK
Customer Key FKRegion Key FK
Actual SalesForecast SalesSales Variance
Finally: Do’s and Don'ts• Do not take a “report centric” approach
o Reuse your dimensional models for multiple reports
• Dimensional models should not be departmentally bound.o Reuse your dimensional models for multiple departments
• Create dimensional models with the finest level of granularity. o This will be the most flexible and scalable option.
• Use Conformed dimensionso Helps with integration effortso Simplified the process of creating the next data mart.