ISQS 6339, Business Intelligence Dimensional Modeling

55
ISQS 6339, Business ISQS 6339, Business Intelligence Intelligence Dimensional Dimensional Modeling Modeling Zhangxi Lin Texas Tech University 1 1

description

ISQS 6339, Business Intelligence Dimensional Modeling. Zhangxi Lin Texas Tech University. 1. Outline. Principles of Dimensional Modeling Data Warehousing Methodology Three Phases of Dimensional Modeling. Principles of Dimensional Modeling. Dimensional Model. - PowerPoint PPT Presentation

Transcript of ISQS 6339, Business Intelligence Dimensional Modeling

ISQS 6339, Business IntelligenceISQS 6339, Business IntelligenceDimensional Dimensional

ModelingModelingZhangxi LinTexas Tech University

11

OutlineOutlinePrinciples of Dimensional ModelingData Warehousing MethodologyThree Phases of Dimensional Modeling

2

PRINCIPLES OF PRINCIPLES OF DIMENSIONAL DIMENSIONAL MODELINGMODELING

3

Dimensional ModelDimensional Model Also called star schema (but snowflake schema is also

fine)◦ Fact table is in the middle and dimensions serving

as the points on the star.◦ A normalized fact table plus denormalized

dimension tables

Reference: database normalization◦ Edgar F. Codd, the inventor of the relational model, introduced

the concept of normalization and what we now know as the First Normal Form (1NF) in 1970. Codd went on to define the Second Normal Form (2NF) and Third Normal Form (3NF) in 1971, and Codd and Raymond F. Boyce defined the Boyce-Codd Normal Form (BCNF) in 1974. 

◦ Informally, a relational database table is often described as "normalized" if it is in the Third Normal Form. Most 3NF tables are free of insertion, update, and deletion anomalies.

44

5

Star Schema ModelStar Schema Model

5

Product TableProduct_idProduct_disc,...

Time TableDay_idMonth_idYear_id,...

Sales Fact TableProduct_idStore_idItem_idDay_idSales_amountSales_units, ...

Item TableItem_idItem_desc,...

Store TableStore_idDistrict_id,...

Central fact table

Denormalizeddimensions

6

Snowflake Schema ModelSnowflake Schema Model

6

Time TableWeek_idPeriod_idYear_id

Dept TableDept_idDept_descMgr_id

Mgr TableDept_idMgr_idMgr_name

Product TableProduct_id

Product_desc

Item TableItem_idItem_descDept_id

Sales Fact TableItem_idStore_id

Product_idWeek_id

Sales_amountSales_units

Store TableStore_idStore_descDistrict_id

District TableDistrict_idDistrict_desc

Snowflake Schema ModelSnowflake Schema Model◦Direct use by some tools◦More flexible to change◦Provides for speedier data loading◦Can become large and

unmanageable◦Degrades query performance◦More complex metadata

77

Country State County City

FactsFactsDefinition

◦ Measure – a numeric quantity expressing some aspect of the organization’s performance

◦ Aggregate – formed by combining values from a given dimension or set of dimensions to create a single value.

Measurements associated with a specific business process.

Most facts are additive (calculative); others are semi-additive, non-additive, or descriptive (e.g. factless fact table).

Many facts can be derived from other facts. So, non-additive facts can be avoided by calculating it from additive facts.

Fact Table CharacteristicsFact Table Characteristics◦ Contain numerical metrics of the business◦ Can hold large volumes of data◦ Can grow quickly◦ Can contain base, derived,

and summarized data◦ Are typically additive◦ Are joined to dimension

tables through foreign keys that reference primary keys in the dimension tables

99

Sales Fact TableProduct_idStore_idItem_idDay_idSales_amountSales_units...

The Three Fact Table TypesThe Three Fact Table Types Transaction fact table

◦ The most basic and fundamental◦ “One row per line in a transaction", e.g., every line on a receipt◦ A transactional fact table holds data of the most detailed level◦ have a great number of dimensions associated with it

Periodic snapshot fact table◦ Takes a "picture of the moment“◦ Cumulative performance over specific time intervals◦ Dependent on the transactional table◦ Valuable to combine data across several business processes in

the value chain Accumulating snapshot fact table

◦ Used to show the activity of a process that has a well-defined beginning and end

◦ Constantly updated over time

10

Types of factsTypes of factsWeek Date Trans# Change OldBal NewBal

1 1A1-1 100 1000 11001 2A1-2 -50 1100 10501 4A1-3 200 1050 12502 2A2-1 -120 1250 11302 2A2-2 200 1130 13303 1A3-1 -300 1330 10304 2A4-1 -20 1030 10104 3A4-2 100 1010 11104 3A4-3 250 1110 13604 5A4-4 -220 1360 1140

Transaction fact: each row

Periodic snapshot fact: (OldBal, NewBal) on each transaction

Accumulating snapshot fact: The average numbers in a week, such as average balance, number of transactions, average amount of transactions, the total amount of trading in a given period.

DimensionsDimensions Definition: a categorization used to spread out an

aggregate measure to reveal its constituent part

The foundation of the dimensional model to describe the objects of the business

The nouns of the DW/BI system◦ Business processes (facts) are the verbs of the

business Dimension tables link to all the business processes. A dimension shared across all processes is called

conformed dimension The analysis involving data from more than one

business process is called drill-across.

1212

AttributesAttributesAn additional piece of

information pertaining to a dimension member that is not the unique identifier or the description of the member.

Attributes can be used to more fully describe dimension members

Dimension Table Dimension Table CharacteristicsCharacteristics Dimension tables have the following characteristics:

◦ Contain textual information that represents the attributes of the business

◦ Contain relatively static data◦ Are joined to a fact table through

a foreign key reference

1414

Star Dimensional Model Star Dimensional Model CharacteristicsCharacteristics

◦The model is easy for users to understand.◦Primary keys represent a dimension.◦Nonforeign key columns are values.◦Facts are usually highly normalized.◦Dimensions are completely denormalized.◦Fast response to queries is provided.◦Performance is improved by reducing table

joins.◦End users can express complex queries.◦Support is provided by many front-end tools.

1515

The Time DimensionThe Time DimensionTime is critical to the data warehouse. A

consistent representation of time is required for extensibility.

1616

Where should the element of time be stored?

TimedimensionSales fact

HierarchiesHierarchiesMeaningful, standard ways to group the

data within a dimension◦Variable-depth hierarchies◦Frequently changing hierarchies

Examples of hierarchy in a dimension◦ Address: street, city, state, country◦ Organization: section, division, branch, region◦ Time: year, quarter, month, date

17

Data CubeData Cube Data cubes are multidimensional extensions of 2-D

tables, just as in geometry a cube is a three-dimensional extension of a square. The word cube brings to mind a 3-D object, and we can think of a 3-D data cube as being a set of similarly structured 2-D tables stacked on top of one another.

Data cubes aren't restricted to just three dimensions. Most OLAP systems can build data cubes with many more dimensions allows up to 64 dimensions.

In practice, we often construct data cubes with many dimensions, but we tend to look at just three at a time. What makes data cubes so valuable is that we can index the cube on one or more of its dimensions.

1818

Data CubeData Cube

Time

Region Product

OLAP systemOLAP system OLAP – allows users to retrieve information

from data quickly for analysis purposesFeatures

◦ Multidimensional database◦ Easily understood

◦ What is OLAP? 5’04”◦ SQL OLAP Tutorial - Data Warehouse Schema Design

9’45”

Dimensional Modeling Dimensional Modeling ProcessProcess High level dimensional model design

◦ Choosing business model in accordance with the analytic theme

◦ Declaring the grain◦ Choosing dimensions◦ Identifying the facts

Detailed dimensional model development Dimensional model review and validation

◦ IS◦ Core users◦ Business community

Final design iteration

ISQS 6339, Data Mgmt & BI, Zhangxi Lin 21

DATA WAREHOUSING DATA WAREHOUSING METHODOLOGYMETHODOLOGY

22

23

Data Warehouse Data Warehouse Development ApproachesDevelopment Approaches Data warehouse development approaches

◦ Kimball Model: Data mart approach Data marts - EDW

◦ Inmon Model: EDW approach EDW – Data Marts

Which model is better?◦ There is no one-size-fits-all strategy to data

warehousing ◦ One alternative is the hosted warehouse

ComparisonComparison Kimball Model

◦ Kimball’s model follows a bottom-up approach. The Data Warehouse (DW) is provisioned from Datamarts (DM) as and when they are available or required.

◦ The Datamarts are sourced from OLTP systems are usually relational databases in Third normal form (3NF).

◦ The Data Warehouse which is central to the model is a de-normalized star schema. The OLAP cubes are built on this DW.

Inmon Model◦ Inmon’s model follows a top-down approach. The Data

Warehouse (DW) is sourced from OLTP systems and is the central repository of data.

◦ The Data Warehouse in Inmon’s model is in Third Normal Form (3NF).

◦ The Datamarts (DM) are provisioned out of the Data Warehouse as and when required. Datamarts in Inmon’s model are in 3NF from which the OLAP cubes are built.

Strengths and Strengths and WeaknessesWeaknesses Scalable vs. structural

◦ Kimball’s model is more scalable because of the bottom-up approach and hence you can start small and scale-up eventually. The ROI is usually faster with Kimball’s model. Because of this approach it is difficult to created re-usable structures/ ETL for different data marts.

◦ On the other hand Inmon’s model is more structured and easier to maintain while it is rigid and takes more time to build. The significant advantage of Inmon’s model is because the DW is in 3NF; it is easier to build data mining models.     

Both Kimball and Inmon models agree and emphasis that DW is the central repository of data and OLAP cubes are built of de-normalized star schemas.     

In conclusion, when it comes to data modeling, it is irrelevant which camp you belong to as long as you understand why you are adopting a specific model. Sometimes it makes sense to take a hybrid approach. 

General Data Warehouse General Data Warehouse Development ApproachesDevelopment Approaches“Big bang” approach

Incremental approach:◦Top-down incremental approach◦Bottom-up incremental approach

ISQS 6339, Data Mgmt & BI, Zhangxi Lin 27

““Big Bang” ApproachBig Bang” Approach

ISQS 6339, Data Mgmt & BI, Zhangxi Lin 28

Analyze enterpriserequirements

Build enterprisedata warehouse

Report in subsets orstore in data marts

Incremental Approach Incremental Approach to Warehouse Developmentto Warehouse DevelopmentMultiple iterationsShorter implementationsValidation of each phase

ISQS 6339, Data Mgmt & BI, Zhangxi Lin 29

Strategy

Definition

Analysis

Design

Build

Production

Increment 1

Iterative

Top-Down ApproachTop-Down Approach

ISQS 6339, Data Mgmt & BI, Zhangxi Lin 30

Analyze requirements at the enterprise levelDevelop conceptual information modelIdentify and prioritize subject areas

Complete a model of selected subject areaMap to available dataPerform a source system analysis

Implement base technical architectureEstablish metadata, extraction, and load processes for the initial subject area

Create and populate the initial subject area data mart within the overall warehouse framework

Bottom-Up ApproachBottom-Up Approach

ISQS 6339, Data Mgmt & BI, Zhangxi Lin 31

Define the scope and coverage of the data warehouse and analyze the source systems within this scope

Define the initial increment based on the political pressure, assumed business benefit and data volume

Implement base technical architecture and establish metadata, extraction, and load processes as required by increment

Create and populate the initial subject areas within the overall warehouse framework

THREE PHASES OF THREE PHASES OF DATA WAREHOUSE DATA WAREHOUSE DESIGNDESIGN

Note: There are many details about data warehouse design, which need a lot effort to learn. Because of limited time to spend for this part, here are only some of the details.

32

Data Warehouse Database Data Warehouse Database Design PhasesDesign PhasesPhase 1: Defining the business

modelPhase 2: Defining the dimensional

modelPhase 3: Defining the physical

model

3333

Phase 1: Defining the Phase 1: Defining the Business ModelBusiness Model

◦Performing strategic analysis◦Define business analytic theme

◦Creating the business model◦Documenting metadata

3434

Performing Strategic AnalysisPerforming Strategic Analysis

Identify crucial business processesUnderstand business processesPrioritize and select the business

processes to implement

3535

BusinessBenefit

Low High

Low

High

Feasibility

Creating the Business ModelCreating the Business Model Defining business requirements:

◦Identifying the business measures◦Identifying the dimensions◦Identifying the grain◦Identifying the business definitions

and rules Verifying data sources

3636

37

Business Requirements Drive Business Requirements Drive the Design Processthe Design Process

◦Primary input

◦Secondary input

Existing Metadata Production ERD Model

BusinessRequirements

Research

37

Identifying Measures and Identifying Measures and DimensionsDimensions

The attribute varies continuously:◦ Balance◦ Units Sold◦ Cost◦ Sales

38

The attribute is perceived as constant or discrete:◦ Product◦ Location◦ Time◦ Size

38

Measures

Dimensions

39

Using a Business Process Using a Business Process MatrixMatrix

39

Sample of business process matrix

Business Dimensions

Business Processes

Sales Returns Inventory

Customer

Date

Product

Channel

Promotion

40

Determining GranularityDetermining Granularity

40

YEAR?

QUARTER?

MONTH?

WEEK?

DAY?

41

Identifying Business RulesIdentifying Business Rules

41

StoreStore > District > Region

Location

Geographic proximity

0 - 1 miles1 - 5 miles > 5 miles

Product

Type Monitor Status PC 15 inch NewServer 17 inch Rebuilt

19 inch CustomNone

TimeMonth > Quarter > Year

Documenting MetadataDocumenting MetadataDocumenting metadata should

include:◦Documenting the design process◦Documenting the development

process◦Providing a record of changes ◦Recording enhancements over time

4242

Metadata Documentation Metadata Documentation ApproachesApproaches

◦Automated Data modeling tools ETL tools End-user tools

◦Manual

4343

Phase 2: Defining the Phase 2: Defining the Dimensional ModelDimensional Model

◦Identify fact tables: Translate business measures into fact

tables Analyze source system information for

additional measures◦Identify dimension tables◦Link fact tables to the dimension

tables◦Model the time dimension

4444

Illustrative case – IMW DataIllustrative case – IMW DataTransaction fact – The

transaction tablePeriodic snapshot fact table –

current records in Land & Office facts

Accumulating snapshot fact table – N/A in this case

Steps in designing a fact Steps in designing a fact tabletable Identify a business process for analysis (like sales). Identify measures or facts (sales dollar), by asking questions like

'What number of XX are relevant for the business process?', replacing the XX with various options that make sense within the context of the business.

Identify dimensions for facts (product dimension, location dimension, time dimension, organization dimension), by asking questions that make sense within the context of the business, like 'Analyse by XX', where XX is replaced with the subject to test.

List the columns that describe each dimension (region name, branch name, business unit name).

Determine the lowest level (granularity) of summary in a fact table (e.g. sales dollars).

An alternative approach is the four step design process described in Kimball. – Check what it is

Using Time in the Data Using Time in the Data WarehouseWarehouse

◦Defining standards for time is critical.◦Aggregation based on time is

complex.

4747

Using Data Modeling ToolsUsing Data Modeling Tools◦ Tools with a GUI enable definition, modeling, and

reporting.◦ Avoid a mix of modeling techniques caused by:

Development pressures Developers with lack of knowledge No strategy

◦ Determine a strategy.◦ Write and publish formally.◦ Make available electronically.

4848

Phase 3: Defining the Phase 3: Defining the Physical ModelPhysical Model Why

◦ Huge amount of data must be effectively processed and retrieved in realtime.

How◦ Translate the dimensional design to a physical model

for implementation.◦ Define storage strategy for tables and indexes.◦ Perform database sizing.◦ Define initial indexing strategy.◦ Define partitioning strategy.◦ Update metadata document with physical information.

4949

Storage and Performance Storage and Performance ConsiderationsConsiderationsDatabase sizingData partitioningIndexingStar query optimization

5050

Database Sizing - Test Load Database Sizing - Test Load SamplingSamplingAnalyze a representative sample of the data

chosen using proven statistical methods. Ensure that the sample reflects:

◦Test loads for different periods◦Day-to-day operations◦Seasonal data and worst-case scenarios◦ Indexes and summaries

5151

Data PartitioningData PartitioningBreaking up of data into separate physical

units that can be handled independentlyTypes of data partitioning

◦Horizontal partitioning. ◦Vertical partitioning

5252

IndexingIndexingIndexing is used for the following reasons:

◦ It is a huge cost saving, greatly improving performance and scalability.

◦ It can replace a full table scan by a quick read of the index followed by a read of only those disk blocks that contain the rows needed.

5353

ParallelismParallelism

5454

Parallel Execution Servers

Sales table

Customerstable

P3

P3

P1

P1

P2

P2

Using Summary DataUsing Summary DataDesigning summary tables offers the

following benefits:◦Provides fast access to precomputed

data◦Reduces use of I/O, CPU, and memory

5555