1 Class Agenda: 03/13 – 3/15 Review Database design – core concepts Review design for ERD...
-
Upload
egbert-reeves -
Category
Documents
-
view
217 -
download
0
Transcript of 1 Class Agenda: 03/13 – 3/15 Review Database design – core concepts Review design for ERD...
1
Class Agenda: 03/13 – 3/15
Review Database design – core concepts
Review design for ERD Scenarios #3 & #4
Review concepts of normalization.
Do practice design from forms using the Replica Toy database (ERD scenario #5).
Discuss issues in database design and normalization.
Discuss concepts of data warehouse design.
Establish environment surrounding DW design.
Contrast methods of DW design.
2
Goals for Transaction Database Design
Protect the integrity of the data.
Reduce data redundancy. Prevent data anomalies.
Provide for change.
Prevent inflexible data structures. Anticipate changes.
Provide access to complete data for decision making.
3
What is normalization?
Normalization is a formal, process-oriented approach to data modeling.
Normalization is the process of:
examining groups of data attributes; splitting them into appropriate entities; identifying the relationships between the entities;
and identifying appropriate primary and foreign keys.
4
Two methods of applying normalization
1.Use it to help in designing a database. Normalization starts with a single entity. Normalization breaks that entity into a series of additional
entities. More entities are discovered and named during the process. Entities are linked during the process.
2.Use it to validate the design of a database. Identify entities from the meaning of the data. Create conceptual and logical data models. Apply the rules of normalization to ensure a stable, non-
redundant design.
Normalization Vocabulary: Functional Dependency and Determinants
A social security number determines your name and address. SSN name, address.
A vehicle id number determines the make and model of a car. VIN make, model.
Name and address are “functionally dependent” on SSN.
SSN “determines” name and address.
Functional dependency diagram format: CrsNum CrsDescription, CrsCredits ZipCode City, State (this implies that a zip code uniquely identifies a city
and state in the U.S. postal system) PatID, TrtDateTime TstResults, TrtID, LocID,
6
Normal forms relevant to business oriented databases
First normal form: Remove repeating groups.
Second normal form: Remove partial functional dependencies.
Third normal form: Remove transitive dependences
7
First Normal Form
First normal form: Remove repeating groups. A repeating group is an attribute or group of
attributes that can have more than one value for an instance of an entity.
Example of repeating groups:
StudentID StudentName, StudentAddress, courseID1, DateTaken1, Grade1, courseID2, DateTaken2, Grade2, courseID3, DateTaken3, Grade3, CourseID4, DateTaken4, Grade4…
Other examples of a repeating group
Serial# model#, customer name, customer address, feature 1 chosen, feature 2 chosen, feature 3 chosen…
PatientID name, address, zip, first insurance company, second insurance company, third insurance company…
8
9
To remedy a problem discovered with normalization
To get a data model into an appropriate normal form: Identify the problem (repeating group, partial
functional dependency, or transitive dependency) and place the “problem” attributes in one or more new separate entities in the model.
Identify a primary key for the new entity. The key may be concatenated if it is an associative entity, rather than a strong entity.
Create relationships between existing and new entities.
Divide m:n relationships with appropriate intersection entities.
10
Second Normal Form
Second normal form: Remove partial functional dependencies.
A partial functional dependency is a situation in which one or more non-key attributes are functionally dependent on part, but not all, of the primary key. Partial functional dependencies occur only with
entities that have concatenated primary keys.
Examples of partial functional dependencies: PatID, TrtDateTime PatName, TstResults, TrtType,
TrtDescription, LocName, TrtID, LocID, CourseID, StudentID CourseTitle, Grade
11
Third Normal Form
Third normal form: Remove transitive dependencies.
A transitive dependency occurs when a non-key attribute is functionally dependent on one or more non-key attributes.
Examples of transitive dependencies:
TrackingNumber ShipmentDate, OrderID, ItemID ShipmentLocationID, LocationDescription, QuantityShipped
PatID, TrtDateTime TstResults, TrtType, TrtDescription, LocName, TrtID, LocID,
12
Issues in Database Design
Characteristics of business-oriented databases. Used to store transactions. Updated quickly and frequently, but not always accurately. Accessed online real-time. Support operational decision making.
Assuming that the data stored is accurate, what is the biggest potential problem with a transaction database in third normal form?
How do most organizations solve that problem?
What do organizations potentially lose when they solve that problem?
13
Major purposes of a data warehouse
To create a data storage designed to facilitate managerial decision making.
Integrated data. Subject-oriented. Time-variant. Non-volatile.
To create a data storage that has better quality, more consistent data than existing operational databases.
ExtractTransform
Load(ETL)
Processes
Reconciled Enterprise Data
Warehouse
ExtractLoad
Processes
Operational- Transaction and External Data Sources
Data Warehouse
ServerData Mart Tier
User Departments
15
Goals of data warehouse design
Make accurate information easily accessible.
Present information consistently.
Be adaptive and flexible to change.
Provide reasonable and expected performance for information to support decision making.
Minimize data redundancy.
Protect/secure information.
16
Three different data models
Transaction (operational) data model: Contains current data required by separate and/or integrated operational systems. Supports the transactional processing of the organization. Is frequently used to support day-to-day decision making. 3rd normal form.
Reconciled (enterprise data warehouse) data model: Contains detailed, current data intended to be the single, authoritative source for all decision support applications. Usually in 3rd normal form.
Derived (data mart) data model: Contains data that are selected, formatted and aggregated for end-user decision support applications. Star schema. Probably not normalized.
17
Comparison – Replica Toys
Transaction data model
Reconciled data model
Derived (data mart) data model
Reconciled and Derived Data Models
Reconciled (EDW)
Independent of specific decisions
Centralized control; usually owned by IT
Historical
Not summarized
Normalized
Flexible
Many data sources
Long life
Starts large, becomes larger
Derived (Data Mart)
Specific decisions
One central subject
Usually accessed directly by users; usually decentralized into user area
Closely defined subject area
Detailed and/or summarized
Usually denormalized
Restrictive – few sources
Short life span
Starts small, becomes large
Two approaches to design
Enterprise Data Warehouse (Inmon)
Focus is on enterprise subjects that will be needed to support comprehensive decision making.
Emphasis on creating design that is consistent among subject areas.
Implementation is of a data mart.
Uses ERD for modeling.
Relies on comprehensive blueprint for interrelation of data.
Interrelated Data Marts (Kimball)
Focus is on business subject area for data warehouse.
Emphasis on creating simple design that can be implemented quickly.
Implementation is of a data mart.
Uses “dimensional model” for modeling. Kind of like an ERD with UML-type aspects.
Relies on consistent interrelation of data by integration of existing data models.
20
Compare/Contrast Approaches
Similarities: Both focus on subject areas for development of data model. Both require extensive input from data warehouse stakeholders. Both produce a subject-oriented, non-volatile, time-related data
warehouse. Both try to quickly implement a prototype data mart.
Differences: Inmon creates a more integrated and consistent data warehouse by
attempting to design an enterprise-wide warehouse at the beginning of the first data warehouse project. This is called a “reconciled” DW design.
Kimball relies on future project teams referencing existing data warehouse models for new projects.
21
What do both approaches yield?
A design for a data mart.
The design for a data mart relies on the concept of a data warehouse “cube.”
A cube is a logical construct containing a “fact” table that is accessed on multiple “dimension” tables.
A fact table contains values that a manager uses to make decisions.
A dimension table is used as a reference for the values in the fact table.
22
Steps of data warehouse design
1. Identify the stakeholders that need data to support their decisions.
2. Define and describe the data needs of those stakeholders.
3. Define the subject area.
4. Choose (EDW and data mart) or just data mart.
5. Select the data of interest.
6. Add element of time.
7. Add derived data.
8. Determine granularity level.
9. Summarize data.
10. Identify and attempt to solve potential performance issues.
23
How do you identify those people within an
organization who require data to support their
decision making processes?
24
Define and describe the data needs
Usually termed “stakeholder analysis”.
Differing levels of decision making require differing sets of data. Internal vs. external data. Integrated vs. non-integrated data. Detailed vs. summarized data.
Different stakeholders require different access mechanisms. Online vs. reports. Pre-formatted vs. ad-hoc availability of data.
Different stakeholders require different timing. Online, real time vs. delay. Relative size of delay/timeliness is always an issue.
Stakeholder Analysis Table Example – Replica ToysStakeholder Decision Making
ResponsibilitiesExisting Information?
Additional Information?
Availability of Additional Information?
Marketing Analyst
Decide what features are most valuable to which customers.
Determine trends in toy purchases.
No data related to features currently available.
Customer order data by distribution outlet.
Features selected by customers.
Purchases by toy by customer.
Not in existing system and cannot be compiled manually. Maybe telephone survey? Maybe registration system?
Distribution Manager
Determine trends in use of distribution outlets.
Determine distribution outlet profitability.
Customer order data by distribution outlet.
Purchases by toy by customer by distribution outlet.
Purchase price by toy by customer by distribution outlet.
Need customer order data with more specific parameters. See if available in customer order system.
Quality control specialist
Evaluate comparative defects of toys within and across product lines.
Support call data.
Product return data.
Detailed problem reports including date, toy, problem, extent of damage.
Not available in current support call and product return systems. Could be added.
Development engineer
Evaluate relative safety issues with existing product line.
Determine potential safety issues with new product development.
Support call data.
Product return data.
Safety test data.
Detailed problem reports including date, toy, problem, injury, relative impact of injury, potential responsibility.
Not available in current support call and product return systems. Could be added.
Engineering safety test data is available.
26
Define the subject area
Potential subject areas in common to many businesses: Customers: people and organizations who acquire and/or use the
company’s products. Equipment: Machinery, devices, tools and their components. Facilities: Real estate and their components. Sales: Transactions that move a product from company to a
customer. Suppliers: Entities that provide a company with goods and services. Products: Goods and services that the company, or its competitors,
provide to customers. Materials: Goods and services that the company uses to produce its
products. Financials: Information about money that is received, retained,
expended, invested or in any way tracked by the company. Human resources: Individuals who perform work for the company –
may be employees, contracts, or simply positions.
27
Select the data of interest
Use the existing transaction database model.
Identify and understand the necessary business decisions.
Identify external data that could help support decisions.
Use tables to help sort available attributes.
Example: Table 4.1 on pgs 104-106 of chapter 4 in “Mastering Data Warehouse Design.”
28
Add element of time
Data warehouse is a historical model rather than a current “point in time” model.
Must have a way to incorporate changes that occur over time.
Important issues: Fact table must include a time component. Ranges of time vs. effective period in time Time also relates to dimension tables May have to deal with differing time periods. Examples
are fiscal years, “holiday rush,” billing cycle, etc.
29
Add derived data
Derived data includes any kind of calculated field.
Examples: total sales; net sales amount; total funds raised; total cost of products.
Issues:
Must be identified, defined and agreed upon by data warehouse stakeholders.
Must be documented in metadata. Must be consistent.
30
Determine granularity level
What are the benefits and drawbacks of a low level of granularity?
What are the benefits and drawbacks of a high level of granularity?
What factors should be considered when determining the level of granularity in the data warehouse?
31
Summarize (aggregate) data
What is summarized data?
How is data summarized?
Does summarized data save disk space?
Why summarize data?