1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points...

40
1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30% for part 2 (60 points) and 40% for part 3 (80 points). Less risky approach: You can turn the parts in as scheduled: Part 2 due on 4/29 and Part 3 due on 5/12 (by noon). Riskier approach: You can turn in both parts 2 and 3 on 5/12 (by noon) and get one grade for the entire 140 points. Discuss topics related to physical database design.

Transcript of 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points...

Page 1: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

1

Agenda: 04/22 and 04/24

Answer questions about Replica Toys post-sales project.

Total points remaining for the project = 140. Currently split into 30% for part 2 (60 points) and 40% for part 3 (80 points).

Less risky approach: You can turn the parts in as scheduled: Part 2 due on 4/29 and Part 3 due on 5/12 (by noon).

Riskier approach: You can turn in both parts 2 and 3 on 5/12 (by noon) and get one grade for the entire 140 points.

Discuss topics related to physical database design.

Page 2: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

2

What is physical database design?

The process of translating a logical description of data into technical specifications for storing and retrieving data.

Preparing documentation for actual implementation of tables in a database.

Page 3: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

3

Physical vs. logical design

A physical design can look exactly like a logical design.

Small database: Logical design usually is the same as physical design.

Or a physical design can look different than a logical design.

Large database: Physical design will probably change entity structure to ensure good performance.

Differences between physical and logical design stem from:

Goals.Constraints.

Page 4: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

4

Database design goals

Review question: What are the design goals for logical database design?

New question: What are the design goals for physical database design?

Page 5: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

5

Tasks in physical design

Convert entities into tables.Identify all necessary data attributes.Determine correct size and data type for each data attribute.Choose an appropriate primary key.Identify foreign keys necessary to sustain relationships.Define necessary constraints.

Enhance performance.Identify size and access methods of data.Choose appropriate hardware.Create indices.De-normalize the design as necessary.Create design and procedures for archiving data.

Page 6: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

6

Page 7: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

7

As a SQL programmer, what are some of the problems you might have with that database?

How could those problems be alleviated with a better physical design?

Page 8: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

Physical design questions

How should the super-type of PERSON be related to the required sub-types? Separate tables or the same table?

How do you relate a sub-type of a generalization relationship (FACULTY) with a weak entity (COURSEOFFERING)?

What should you do with the concatenated keys in COURSEOFFERING and STUDENTENROLL?

What is the projected number of rows per table?

How often will the data be updated?

When will the data be removed from the system?

How will the rows in each table be accessed?

8

Page 9: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

9

Page 10: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

10

Page 11: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

11

Choosing datatypes for attributes

A datatype is a name or label for a set of values and some operations which one can perform on that set of values.

Examples in SQL: varchar, date, number, money

Concept of “strongly data typed.”

Objectives for choosing an appropriate data type:

Minimize storage space.

Represent all possible values.

Improve data integrity.

Support all data manipulations.

Page 12: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

12

Choosing an appropriate primary key

General rules:Must be a unique value for each row in the table.

Cannot be null.

Should be static over the life of the row.

Physical primary key design heuristics:Should be a single attribute.

Should be numeric.

Should not be “intelligent.”

Page 13: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

13

Overview of Database Performance

Key metrics for database performanceMinimize response time to access data in a database.

Minimize response time to change contents in a database.

Most concerned with balancing disk access and memory capacity.

Page 14: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

14

Improving performance

1. By optimizing use of existing resources.

2. By using better or more resources.

3. By creating indexes.

4. By denormalizing the database.

5. By storing derived data.

6. By creating procedures to archive data.

Page 15: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

Group together files to better use memory and disk access time

LR1

LR2

LR3

Application buffers:Logical records (LRs)

DBMS Buffers:Logical records (LRs) inside of physical records (PRs)

LR1

LR2

LR3

LR4

LR4

Operating system:Physical records (PRs) on disk

read read

writewrite

PR1

PR2

PR1

PR2

Page 16: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

CREATE CLUSTER ordering (CLUSTERKEY CHAR(6))

CREATE TABLE tbl_customer(customer_id CHAR(6) NOT NULL,Address VARCHARs(25))CLUSTER ordering (customer_id);

CREATE TABLE tbl_order(order_id CHAR(6) NOT NULL,Customer_id CHAR(6) NOT NULL,Order_date date)CLUSTER ordering (customer_id);

Page 17: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

Add or change resources to improve performance.

Will help a little: more processor power. Will help the most: more more more memory.Will really help: Faster, more efficient disk.

Solid-state drives

RAID: Redundant arrays of inexpensive (or independent) disks.

A set of multiple physical disk drives that appear to the designer and user as a single storage unit.Segments of data, called stripes, cut across all of the disk drives.Access can occur concurrently.http://www.acnc.com/raidhttp://www.raidweb.com/

Page 18: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

18

Indexes are the single most important tool a

database programmer/administrato

r can use for improving the performance of a

database.

Page 19: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

Indexes are easy to create!!!

Can add an index to a database with a simple SQL command:

Create index index_name on table (column_name);

Understanding what happens when an index is created requires a basic understanding of indexing and file organization.

19

Page 20: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

20

File organization and access concepts

File organization.The physical arrangement of data in a file into records and pages on secondary storage.File organization dictates the physical placement of records.

File access methods. The steps involved in retrieving records from a file.File access methods dictate how data can be retrieved from secondary storage. Options include:

Sequential access from beginning. Sequential access from pre-defined point.Backwards from end. Backwards from pre-defined point.Direct. (not really direct – has to go through a series of indices)

Page 21: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

21

General file organization options

Sequential file organization. Records are stored one after another. Referred to as a “heap” or “pile.”

Indexed file organization. Records are stored either ordered or not as in sequential organization. Additional structure, index, is built based on pre-determined keys for the records.

Page 22: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

22

Page 23: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

23

Page 24: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

24

Page 25: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

25

Page 26: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

26

Page 27: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

27

What is an index?

An additional physical file.

An index is a sorted list of pointers stored along with the actual data.

Benefit: Indexes provide faster direct data access.

Drawbacks: Indexes create slower data updates.

Indexes require periodic reorganization.

Page 28: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

28

Rules of thumb for applying indexes

Use on larger tables.

Use when a relatively small percentage of the table will be accessed.

Index the primary key of each table.

Index frequently used search attributes.

Index attributes in SQL “ORDER BY” and “GROUP BY” commands.

Use indexes heavily for non-volatile databases; limit the use of indexes for volatile databases.

Avoid indexing attributes that consist of long character strings.

Page 29: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

29

Issues in indexing

Indexes affect table maintenance performance.Each time an add or delete is performed, the index must be updated along with the data.Depending on the size of the database, these index updates can be extremely time-consuming.Imagine the problems with having an index declared for every attribute.

Solutions:Remove indexes prior to batch updates.Recreate indexes after the batch update is finished.Consider using a batch procedure to create indexes after a table has been updated, and before queries are run.

Page 30: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

30

Improving performance with denormalization

Modify the degree of normalization.Recognize that joins require much time when used in queries.

More joins = more time.

Combine entities with 1:1 relationship into a single entity.

Combine entities with 1:m relationship into a single entity. Usually done with brief repeating groups.

Page 31: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

31

Example for denormalization

Example:A patient can have up to 4 insurance companies.

Patient is a strong entity. Insurance company is a strong entity.

Normally, the repeating group of insurance companies would be in a separate intersection entity relating a patient to one or more insurance companies.

Diagram on next page

Page 32: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

Patient

Patient ID

Patient NamePatient AddressPatient Start DatePatient Last Visit DateOfVisit

Patient_Insurance

Patient IDInsurance ID

Insured NameInsured DateCoverage

Has

Insurance

Insurance ID

Company Name

is of

Page 33: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

Insurance example - Denormalized

Patient

Patient ID

Patient NamePatient AddressPatient Start DatePatient Last Visit DateOfVisitInsurance ID 1Insured Name 1Insured Date 1Coverage 1Insurance ID 2Insured Name 2Insured Date 2Coverage 2Insurance ID 3Insured Name 3Insured Date 3Coverage 3Insurance ID 4Insured Name 4Insured Date 4Coverage 4

Insurance

Insurance ID

Company Name

has

Page 34: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

34

Or Totally De-Normalize it

back to a spreadsheet...

Page 35: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

35

Issues in denormalization

Can be risky.Introduces potential for data redundancy.

Can result in data anomalies.

Should be documented.This documentation must be maintained as an “audit path” to the actual implementation of the database.

Logical data model details fully normalized database with an ERD.

Physical data model will show denormalized database with an ERD.

Include in the documentation the reasons for denormalization.

Page 36: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

36

Improving performance with derived data

Derived or calculated data is usually not included in a database.

Not ever included on a logical data model.Examples of derived data include: extended price, total amount, total pay, etc.

Problems with including derived data in a database:

What happens when the underlying data is changed? How do you ensure that the derived data will also be changed?For example, let’s say that the total of an order is kept in the database. What happens when an item quantity changes, or an item price changes? The order total, if stored, must also be changed to reflect those changes in the underlying data.

Page 37: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

37

When to include derived data

Sometimes it is a good idea to include derived data in the physical database design:

Use when aggregate values are regularly retrieved.Use when aggregate values are costly to calculate.Permit updating only of source data.Do not put derived rows in same table as table containing source data.

Examples of derived data frequently stored on databases:

Student class standing.Order and invoice total.Credit card balance.Checking account balance.

Page 38: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

38

Organization must manage data resources

Types of data used by an organization:Current transaction data.

Historical data for decision making.

Audit data for accounting and/or governmental regulations.

All must be designed, implemented and maintained.

Page 39: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

39

Archive data for audit purposes

Not all data must be stored on a directly accessible data storage device (disk).

Examples of archived data:Checking transactions.Tax data.Accounting audit trail.

Can store data on cheaper, slower, less accessible media.

Must have procedures for extracting, transforming and loading (ETL) data as necessary.

Archive database design is usually a copy of the transaction database design.

Page 40: 1 Agenda: 04/22 and 04/24 Answer questions about Replica Toys post-sales project. Total points remaining for the project = 140. Currently split into 30%

40

Use a data warehouse

A Data warehouse differs from a transaction database.

Used to support decision making.Contains aggregated data.Is frequently denormalized to improve performance.Contains data in a format specific to answering queries.

Data warehouse is separate from transaction database.

A data warehouse is built from data stored in the transaction database.Different design.May use a data warehouse and a transaction database concurrently to answer queries.