Class Agenda: 02/13/2014

30
1 Class Agenda: 02/13/2014 Review Goals of assignments. Technology: SQL Server, Tableau Internal Data Project Questions about assignments Discuss process of data warehouse design Discuss issues in data warehouse design Contrast different approaches to data warehouse design Design a data warehouse

description

Class Agenda: 02/13/2014. Review Goals of assignments. Technology: SQL Server, Tableau Internal Data Project Questions about assignments Discuss process of data warehouse design Discuss issues in data warehouse design Contrast different approaches to data warehouse design - PowerPoint PPT Presentation

Transcript of Class Agenda: 02/13/2014

Page 1: Class Agenda:   02/13/2014

1

Class Agenda: 02/13/2014

Review Goals of assignments. Technology: SQL Server, Tableau

Internal Data Project

Questions about assignmentsDiscuss process of data warehouse designDiscuss issues in data warehouse designContrast different approaches to data

warehouse designDesign a data warehouse

Page 2: Class Agenda:   02/13/2014

2

Goals for data warehouse design

Make complete and accurate information easily accessible.

Present information consistently.Be adaptive and flexible to change.Provide reasonable and expected performance

for information to support decision making.Protect/secure information.

Page 3: Class Agenda:   02/13/2014

3

How do we achieve those goals?

Use systems analysis and design techniques.

Have domain knowledge of required decision support systems.

Model the data in a variety of different forms.

Appropriate use (or non-use) of normalization.

Use an appropriate DBMS for implementation.

Page 4: Class Agenda:   02/13/2014
Page 5: Class Agenda:   02/13/2014

5

Three different “general” data modelsTransaction (operational) data model: Contains current

data required by separate and/or integrated operational systems. Supports the transactional processing of the organization. Is frequently used to support day-to-day decision making. 3rd normal form. Does not usually contain external data.

Reconciled (enterprise data warehouse) data model: Contains detailed, current data intended to be the single, authoritative source for all decision support applications. Usually in 3rd normal form. May contain data generated externally from the organization.

Derived (data mart) data model: Contains data that are selected, formatted and aggregated for end-user decision support applications. Star or snowflake schema. May not be normalized. May contain data generated externally from the organization.

Page 6: Class Agenda:   02/13/2014

Reconciled and Derived Data Models

Reconciled (EDW) Independent of specific

decisions Centralized control;

usually owned by IT Historical Not usually summarized Normalized Flexible Many data sources Long life Starts large, becomes

larger

Derived (Data Mart) Specific decisions One central subject Usually accessed directly by

users; usually decentralized into user area

Closely defined subject area Detailed and/or

summarized Usually denormalized Restrictive – few sources Short life span Starts small, becomes large

Page 7: Class Agenda:   02/13/2014

Two general approaches to designEnterprise Data Warehouse

(Bill Inmon) Focus is on enterprise

subjects that will be needed to support comprehensive decision making.

Emphasis on creating design that is consistent among subject areas.

Implementation is of a data mart.

Uses ERD for modeling. Relies on comprehensive

blueprint for interrelation of data.

Interrelated Data Marts (Ralph Kimball)

Focus is on business subject area for data warehouse.

Emphasis on creating simple design that can be implemented quickly.

Implementation is of a data mart.

Uses “dimensional model” for modeling. Kind of like an ERD with UML-type aspects.

Relies on consistent interrelation of data by integration of existing data models.

Page 8: Class Agenda:   02/13/2014

8

Compare/Contrast Approaches

Similarities: Both focus on subject areas for development of data model. Both require extensive input from data warehouse stakeholders. Both produce a subject-oriented, non-volatile, time-related data

warehouse. Both try to quickly implement a prototype data mart.

Differences: Inmon creates a more integrated and consistent data warehouse by

attempting to design an enterprise-wide warehouse at the beginning of the first data warehouse project. This is called a “reconciled” DW design.

Kimball relies on future project teams referencing existing data warehouse models for new projects.

Page 9: Class Agenda:   02/13/2014

9

What do both approaches yield?

A design for a data mart. The design for a data mart is based on the

concept of a data warehouse “cube.”A cube is a logical construct containing a “fact”

table that is accessed on multiple “dimension” tables.

A fact table contains values that a manager uses to make decisions.

A dimension table is used as a reference for the values in the fact table.

Page 10: Class Agenda:   02/13/2014

10

Process of data warehouse design1. Identify the stakeholders that need data to support

their decisions.2. Define and describe the data needs of those

stakeholders.3. Define the subject area.4. Choose (EDW and data mart) or just data mart, or

some combination thereof. 5. Select the data of interest. May be internal, external.

May be purchased. May be stored in a transaction database – may not. May be generated just for the data warehouse.

6. Identify the dimensions (master data/strong entities).7. Add element of time.8. Determine granularity level.9. Identify the fact data.10. Add derived data if necessary or desired.

Page 11: Class Agenda:   02/13/2014

11

How do you identify those people within an

organization who require data to support their

decision making processes?

Page 12: Class Agenda:   02/13/2014

12

Define and describe the data needs

Usually termed “stakeholder analysis”. Differing levels of decision making require differing sets of

data. Internal vs. external data. Integrated vs. non-integrated data. Detailed vs. summarized data.

Different stakeholders require different access mechanisms. Online vs. reports. Pre-formatted vs. ad-hoc availability of data.

Different stakeholders require different timing. Online, real time vs. delay. Relative size of delay/timeliness is always an issue.

Page 13: Class Agenda:   02/13/2014

Stakeholder Analysis Table Example – Replica ToysStakeholder Decision Making

ResponsibilitiesExisting Information?

Additional Information?

Availability of Additional Information?

Marketing Analyst

Decide what features are most valuable to which customers.Determine trends in toy purchases.

No data related to features currently available.Customer order data by distribution outlet.

Features selected by customers.Purchases by toy by customer.

Not in existing system and cannot be compiled manually. Maybe telephone survey? Maybe registration system?

Distribution Manager

Determine trends in use of distribution outlets.Determine distribution outlet profitability.

Customer order data by distribution outlet.

Purchases by toy by customer by distribution outlet.Purchase price by toy by customer by distribution outlet.

Need customer order data with more specific parameters. See if available in customer order system.

Quality control specialist

Evaluate comparative defects of toys within and across product lines.

Support call data.Product return data.

Detailed problem reports including date, toy, problem, extent of damage.

Not available in current support call and product return systems. Could be added.

Development engineer

Evaluate relative safety issues with existing product line.Determine potential safety issues with new product development.

Support call data.Product return data.Safety test data.

Detailed problem reports including date, toy, problem, injury, relative impact of injury, potential responsibility.

Not available in current support call and product return systems. Could be added.Engineering safety test data is available.

Page 14: Class Agenda:   02/13/2014

14

Define the subject area

Potential subject areas in common to many businesses: Customers: people and organizations who acquire and/or use the

company’s products. Equipment: Machinery, devices, tools and their components. Facilities: Real estate and their components. Sales: Transactions that move a product from company to a

customer. Suppliers: Entities that provide a company with goods and services. Products: Goods and services that the company, or its competitors,

provide to customers. Materials: Goods and services that the company uses to produce its

products. Financials: Information about money that is received, retained,

expended, invested or in any way tracked by the company. Human resources: Individuals who perform work for the company –

may be employees, contracts, or simply positions.

Page 15: Class Agenda:   02/13/2014

15

Select the data of interest

Use the existing transaction database model. Identify and understand the necessary

business decisions. Identify external data that could help support

decisions.Use tables to help sort available attributes.

Page 16: Class Agenda:   02/13/2014

Sample Informational Questions that might help answer Decision Question

Data Attributes Required to Inform the Decision

Additional Systems/Processes

that could be used to Create and/or Access

Data     

     

     

     

     

     

Decision: Which toys will sell best next year? in three years?

Page 17: Class Agenda:   02/13/2014

Additional Data Attributes Required

Existing or New?

Potential Data Sources

Data Costs Data Ownership

         

         

         

         

         

         

         

Page 18: Class Agenda:   02/13/2014

18

Transform operational data to DW

Transient vs. Periodic Data Transient: Data in which changes to existing records are

written over previous records, thus destroying the previous data content. (Type 1 change) Most transaction systems are based on

transient data. Most data warehouses avoid transient

data. Periodic: Data that are never physically altered or

deleted once they have been added to the data store. (Type 2 change) Most data warehouses are based on

periodic data.

Page 19: Class Agenda:   02/13/2014

19

Data warehouse Periodic Data

Fact vs. dimension A “fact” is a numeric measure.

Replica example: A registration is a “fact” along with the price that was paid for the purchase that spawned the registration.

Facts are “weak entities” Facts are usually transactions

A “dimension” is reference information that relates to the fact. Replica examples: customer, product model, feature,

place of purchase. Dimensions are “strong entities” Dimensions are also considered the “master data” of

an organization

Page 20: Class Agenda:   02/13/2014

Dimensions are different in DW-land

Slowly changing dimension: Dimension will change values over time. How to maintain knowledge of the past

Approaches: Type 1: just replace old data with new (lose historical

data) Type 2: for each changing attribute, create a current value

field and several old-valued fields (multivalued) Type 3: create a new dimension table row each time the

dimension object changes, with all dimension characteristics at the time of change. Most common approach.

Page 21: Class Agenda:   02/13/2014

21

Other dimensional issues

Degenerative dimension: A dimension that has no interesting dimension attributes (e.g. serial number)

Multi-valued dimension: A dimension that needs to be qualified by a set of values (e.g. feature) May have a related hierarchy Example: group-> category -> family ->

product

Page 22: Class Agenda:   02/13/2014

Dimensions can be hierarchical

Page 23: Class Agenda:   02/13/2014

Dimensions are usually normalized

Page 24: Class Agenda:   02/13/2014

24

Conformed Dimensions for growth

Conformed dimension: One or more dimension tables associated with two or more fact tables. Dimensions must have the same meaning for all related fact tables. Very hard to achieve without good planning.

Goal of any data warehouse is to plan the dimensions so that they span business processes/decision areas.

Enhances consistency of facts.Allows integration of diverse systems.Helps a designer to create data warehouse

systems incrementally.

Page 25: Class Agenda:   02/13/2014

A Bus Matrix to help plan for Conformed Dimensions

Business Process or Decision Issue

Date Cust-omer

Product Model

Purchase Place

Emp-loyee

Registering a toy X X X XAccepting a return X X X X XAccepting a complaint X X X XMarketing toys to distributors

X X X

Accepting an order from a distributor

X X X

Page 26: Class Agenda:   02/13/2014

26

Time is a dimension

Data warehouse is a historical model rather than a current “point in time” model.

Must have a way to incorporate changes that occur over time.

Important issues: Fact table must include a time component. Ranges of time vs. effective period in time Time also relates to dimension tables May have to deal with differing time periods. Examples

are fiscal years, “holiday rush,” billing cycle, etc.

Page 27: Class Agenda:   02/13/2014

Time is complex

Page 28: Class Agenda:   02/13/2014

28

Fact tables

Measures: Sale Flag Quantity

Can have a “factless” fact table

Page 29: Class Agenda:   02/13/2014

29

Determine granularity level

What are the benefits and drawbacks of a low level of granularity?

What are the benefits and drawbacks of a high level of granularity?

What factors should be considered when determining the level of granularity in the data warehouse?

Page 30: Class Agenda:   02/13/2014

30

Might have to “derive” facts

Derived data includes any kind of calculated field.

Usually derive facts when there will be an overwhelming amount of data if not derived.

Examples: total sales; net sales amount; total funds raised; total cost of products.

Issues: Must be identified, defined and agreed upon by data

warehouse stakeholders. Must be documented in metadata. Must be consistent.