Zvi’s changes, if any, are marked in green, they are not copyrighted by the authors, and the...

Zvi’s changes, if any, are marked in Zvi’s changes, if any, are marked in green, they are not copyrighted by the green, they are not copyrighted by the

authors, and the authors are not authors, and the authors are not responsible for themresponsible for them

©Silberschatz, Korth and Sudarshan7.2Database System Concepts

!!Chapter 7: Relational Database DesignChapter 7: Relational Database Design

Our goal in this chapter is to formalize your intuition from the entity-relationship design approach. It’s actually kind of fun.

If you are into the intellectual aesthetic of what we are doing, notice that with ER, we captured notions that are common to all database applications: 1-1, 1-many, many-to-many relationships, ISA relationships, etc.

When I design a database application I draw those entities and relationships, but when I get down to the details, I use “normalization theory” (aka database design theory) to be sure I get this right.


First Normal FormFirst Normal Form

Domain is atomic if its elements are considered to be indivisible units

• Examples of non-atomic domains:

• Set of names, composite attributes

• Identification numbers like CS101 that can be broken up into parts

A relational schema R is in first normal form if the domains of all attributes of R are atomic

Non-atomic values complicate storage and encourage redundant (repeated) storage of data

• E.g. Set of accounts stored with each customer, and set of owners stored with each account

• We assume all relations are in first normal form


First Normal Form (Contd.)First Normal Form (Contd.)

Atomicity is actually a property of how the elements of the domain are used.

• E.g. Strings would normally be considered indivisible

• Suppose that students are given course numbers which are strings of the form CS0012 or EE1127

• If the first two characters are extracted to find the department, the domain of course numbers is not atomic.

• Doing so is a bad idea: leads to encoding of information in application program rather than in the database.

In practice people do it to answer many queries and that is why SQL DML allows it, but we are in relational algebra…

Story of a certain French bank….


Pitfalls in Relational Database DesignPitfalls in Relational Database Design

Relational database design requires that we find a “good” collection of relation schemas. A bad design may lead to

• Repetition of Information.

• Inability to represent certain information.

Design Goals:

• Avoid redundant data

• Ensure that relationships among attributes are represented

• Facilitate the checking of updates for violation of database integrity constraints.

04/18/23 08:39 PM ©Silberschatz, Korth and Sudarshan7.6Database System Concepts

A canonical exampleA canonical example

We first consider the relation schema R=R(E,G,S), where

• E: Employee number

• G: Grade (as in rank within a company)

• S: Salary

The users specified for us a set of semantic constraints:

• Each value of E has a single value of G associated with it.

• Each value of G has a single value of S associated with it.

For simplicity, we will sometimes refer to relations schemes by listing their attributes; thus we may write EGS instead of R above.

We will spend a lot of time discussing this example, if we understand it, we understand more than half of normalization theory needed in practice!

©Zvi M. Kedem


A sample instanceA sample instance

Consider a sample instance of EGS:

• EGS:E G S A 1 B 2 A 1 C 1

By examining the above sample instance, we will note that the design suffers from various deficiencies.

©Zvi M. Kedem


Deficiencies of the designDeficiencies of the design

Redundancy: The fact (or more technically, the constraint) that G = A implies S = 1 is stored twice in the relation.

General problems:

• Inability to store partial information: If the salary schedule is augmented by a new grade, G = D, with corresponding salary of S = 3, this fact cannot be stored in the relation, unless E has the value of null.

(But it is quite clear that in this relation E is the only possibly primary key,so we cannot have “several” tuples potentially with the value of null in this column.)

• More difficult to enforce consistent updating.

• Information may disappear: If employee leaves, we forget what salary is given to grade B.

©Zvi M. Kedem


DecompositionDecomposition

The above problems can be solved by decomposing the relation EGS into two relations:

• R1(EG)=EG (R): SELECT E, G FROM R;

• R2(GS)=GS(R): SELECT G, S FROM R;

Obtaining,• R1:

E G A B A C

• R2:G SA 1B 2C 1

©Zvi M. Kedem


Reconstructing the original relationReconstructing the original relation

How to reconstruct R from R1, R2? It is easy to see that it is enough to compute R = R1 R2 (the

natural join of R1 and R2, written using easily displayed fonts), that is:SELECT E, G, SFROM R1, R2WHERE R1.G = R2.G;

and indeed we obtain.• E G S

A 1 B 2 A 1 C 1

Will this be true of other decompositions?

©Zvi M. Kedem


A second decompositionA second decomposition

Let us try decomposing EGS into EG and ES. We obtain:

• ES:E S

1 2 1 1

• EG:

E G A B A C

The natural join gives the original relation back: EG ES = EGS.

©Zvi M. Kedem


A third decompositionA third decomposition

Let us attempt another “reasonable'' decomposition: EGS into ES and GS. We obtain:

• ES:E S 1 2 1 1

• GS:G SA 1B 2C 1

©Zvi M. Kedem


The natural join of the “small” relationsThe natural join of the “small” relations

Let us compute the natural join of ES and GS. We obtain:

• E G S A 1 C 1 B 2 A 1 C 1 A 1 C 1

This relation is not equal to the original EGS relation

©Zvi M. Kedem


Decompositions vs. reconstructionsDecompositions vs. reconstructions

By examining the 3 decompositions we observe that:

• Some decompositions allow us to reconstruct the original relation.

• Some decompositions do not allow us to reconstruct the original relation

• We want a decomposition that allows us to reconstruct the original relation (otherwise we get false facts).

©Zvi M. Kedem


Lossless join decompositionsLossless join decompositions

We say that some decomposition of a relation R into relations R1, ..., Rm, where each Ri is the projection of R on some attributes is a lossless join decomposition, if and only if:

• Each attribute of R appears in some Ri

• R is the natural join of R1, ..., Rm

We will also use the term “valid'' for lossless join

©Zvi M. Kedem


ExampleExample Consider the relation schema:

Lending-schema = (branch-name, branch-city, assets, customer-name, loan-number, amount)

Redundancy:• Data for branch-name, branch-city, assets are repeated for each loan that a

branch makes

• Wastes space

• Complicates updating, introducing possibility of inconsistency of assets value

Null values• Cannot store information about a branch if no loans exist

• Can use null values, but they are difficult to handle.


DecompositionDecomposition

Decompose the relation schema Lending-schema into:

Branch-schema = (branch-name, branch-city,assets)

Loan-info-schema = (customer-name, loan-number, branch-name, amount)

All attributes of an original schema (R) must appear in the decomposition (R1, R2):

R = R1 R2

Lossless-join decomposition.For all possible relations r on schema R

r = R1 (r) R2 (r)

A very important property: If R = R1 R2, always:

r R1 (r) R2 (r) Therefore lossless means: you do not gain spurious tuples by joining, which seems the only reasonable way to reconstruct the original relation.


Example of Non Lossless-Join Decomposition Example of Non Lossless-Join Decomposition

Decomposition of R = (A, B)R2 = (A) R2 = (B)

A B

121

A

B

12

rA(r) B(r)

A (r) B (r)A B

1212


Goal — Devise a Theory for the FollowingGoal — Devise a Theory for the Following

Decide whether a particular relation R is in “good” form.

In the case that a relation R is not in “good” form, decompose it into a set of relations {R1, R2, ..., Rn} such that

• each relation is in good form

• the decomposition is a lossless-join decomposition

• And dependencies are preserved (more about this later)

Our theory is based on:

• functional dependencies

• multivalued dependencies (optional) Dennis doesn’t like these because he has never seen them hold in practice. But look up in the book if you like.


Functional DependenciesFunctional Dependencies

Constraints on the set of legal relations.

Require that the value for a certain set of attributes determines uniquely the value for another set of attributes.

A functional dependency is a generalization of the notion of a key.


Functional Dependencies (Cont.)Functional Dependencies (Cont.) Let R be a relation schema

R and R

The functional dependency

holds on R if and only if for any legal relations r(R), whenever any two tuples t1 and t2 of r agree on the attributes , they also agree on the attributes . That is,

t1[] = t2 [] t1[ ] = t2 [ ]

Example: Consider r(A,B) with the following instance of r.

On this instance, A B does NOT hold, but B A does hold.

1 41 53 7


An exampleAn example

Although functional dependencies are properties of a schema, and not only of the individual instances, we will practice, and check if certain FDs are satisfied by some sample instance:• A B C D E F

a1 b1 c1 d1 e1 f1a2 b1 c1 d2 e2 f2a2 b2 c3 d3 e3 f3a1 b1 c1 d1 e1 f4a1 b2 c2 d2 e4 f5a2 b3 c3 d2 e5 f6

• A C No

• AB C Yes

• E CD Yes

• D B No

• F ABC Yes

©Zvi M. Kedem


Relative power of some FDsRelative power of some FDs

An important note: A C is always at least as powerful as AB C

that is If a relation satisfies A C it must satisfy AB C

What we are really saying is that if C = f(A), then of course C = f(A,B)

If not obvious to you, then look at it this way. A C says that any two rows that agree on their A values also agree on their C values. AB C says that any two rows that agree on both their A and B values also agree on their C values. Well, if they agree on both their A and B values, then surely they agree on their A values, so if AC held, we would be done. The reverse does not hold.

©Zvi M. Kedem


Functional Dependencies (Cont.)Functional Dependencies (Cont.)

K is a superkey for relation schema R if and only if K R K is a candidate key for R if and only if

• K R, and

• for no proper subset K, R

Functional dependencies allow us to express constraints that cannot be expressed using superkeys. Consider the schema:

Loan-info-schema = (customer-name, loan-number, branch-name, amount).

We expect this set of functional dependencies to hold:

loan-number amountloan-number branch-name

but would not expect the following to hold:

loan-number customer-name


Use of Functional DependenciesUse of Functional Dependencies

We use functional dependencies to:

• test relations to see if they are legal under a given set of functional dependencies.

• If a relation r is legal under a set F of functional dependencies, we say that r satisfies F.

• specify constraints on the set of legal relations

• We say that F holds on R if all legal relations on R satisfy the set of functional dependencies F.

Note: A specific instance of a relation schema may satisfy a functional dependency even if the functional dependency does not hold on all legal instances. For example, a specific instance of Loan-schema may, by chance, satisfy loan-number customer-name.


Trivial FDsTrivial FDs

An FD X Y, where X and Y are sets of attributes is trivial

if and only if

YX.

(Such FD gives no constraints, as it is always satisfied.)

Example: AB A is trivial.

©Zvi M. Kedem


Decomposition and union of some FDsDecomposition and union of some FDs

An FD X A1 A2 ... Am, where Ai's are individual attributes

is equivalent to

the set of FDs: X A1, X A2, ..., X Am.

Example: ABC DE is equivalent to ABC D and ABC E.

©Zvi M. Kedem


Logical implications of FDsLogical implications of FDs

It will be important to us to determine if a given set of FDs forces other FDs to be true (necessarily implies that additional FDs must hold).

Consider again the EGS relation. Which FDs are satisfied?

E G, G S, E S are all true in the real world.

If the real world tells you only:

E G and G S

Can you deduce on your own, without understanding the application thatE S?

©Zvi M. Kedem


Logical implications of FDs (cont.)Logical implications of FDs (cont.)

Yes, by simple logical argument: transitivity.

Take any (set of) tuples that are equal on E. Then given E G we know that they are equal on G. Then given G S we know that they are equal on S. So we have shown that E S must hold

We say that E G, G S logically imply E S and we write E G, G S |= E S.

©Zvi M. Kedem


Logical implications of FDs (cont.)Logical implications of FDs (cont.)

If the real world tells you only:

• E G and E S,

Can you deduce on your own, without understanding the application that

• G S

No, because of a counterexample:

• E G S A 1 A 2

This relation satisfies E G and E S, but violates G S.

©Zvi M. Kedem


Closure of a Set of Functional Closure of a Set of Functional DependenciesDependencies

Given a set F set of functional dependencies, there are certain other functional dependencies that are logically implied by F.

• E.g. If A B and B C, then we can infer that A C

The set of all functional dependencies logically implied by F is the closure of F.

We denote the closure of F by F+.


Closure of Attribute SetsClosure of Attribute Sets

Given a set of attributes define the closure of under F (denoted by +) as the set of attributes that are functionally determined by under F:

is in F+ +

Algorithm to compute +, the closure of under F (guilt by association approach)

result := ;while (changes to result) do

for each in F dobegin

if result then result := result end


Example of Attribute Set ClosureExample of Attribute Set Closure R = (A, B, C, G, H, I) F = {A B

A C CG HCG IB H}

(AG)+

1. result = AG

2. result = ABCG (A C and A B)

3. result = ABCGH (CG H and CG AGBC)

4. result = ABCGHI (CG I and CG AGBCH)

Is AG a candidate key? 1. Is AG a super key?

1. Does AG R? Or equivalently: (AG)+ = R?

2. Is any subset of AG a superkey?

1. Does A R? or equivalently: A+ = R?

2. Does G R? or equivalently: G+ = R?


!!Example Example (order of FDs slightly changed)(order of FDs slightly changed)

Let R = ABCDEFGHIJ Given FDs:

• F BG

• A DE

• B D

• C F

• I J Then

ABC+ = ABCDEFG

and ABC Z (for a set of attributes Z), if and only if Z ABCDEFG

Therefore:• ABC FC is true

• ABC IC is false

©Zvi M. Kedem


EGS againEGS again

We can easily show that

• {E G, G S}+ contains E S

• {E G, E S}+ does not contain G S

So what we did in an ad-hoc manner previously can be converted to an algorithmic procedure.

©Zvi M. Kedem


Uses of Attribute ClosureUses of Attribute Closure

There are several uses of the attribute closure algorithm:

Testing for superkey:

• To test if is a superkey, we compute +, and check if + contains all attributes of R.

Testing functional dependencies

• To check if a functional dependency holds (or, in other words, is in F+), just check if +.

• That is, we compute + by using attribute closure, and then check if it contains .

• Is a simple and cheap test, and very useful

Computing closure of F

• For each R, we find the closure +, and for each S +, we output a functional dependency S.


Keys of relationsKeys of relations

The notion of an FD allows us to formally define keys.

Given R, satisfying a set of FDs, a set of attributes X of R is a key, if and only if:

• X+ = R.

• For every Y X such that Y X, we have Y+R.

Note that if R does not satisfy any (nontrivial) FDs, then R is the only key of R.

In our example E G, G S, E S

E was the only key of EGS.

©Zvi M. Kedem


Review of EGSReview of EGS

Let us review the relation EGS. The “new relations” were:

• EG, with the key E and a non-trivial FD E G.

• GS, with the key G and a non-trivial FD G S.

• ES, with the key E and a non-trivial FD E S.

Each of those relations was “good,”

• I.e., there were no redundancies in any of them, each nontrivial FD contained a key on the left side

It turns out tha EG, GS is better than EG, ES as a decomposition. Can you see why?

©Zvi M. Kedem


!!Third Normal FormThird Normal Form

A relation schema R is in third normal form (3NF) if for all:

in F+

at least one of the following holds: is trivial (i.e., )

is a superkey for R, that is check whether + = R

• Each attribute A in – is contained in a candidate key for R.

(NOTE: each attribute may be in a different candidate key)

We now proceed to show how to obtain third normal form with a lossless join decomposition.


Canonical CoverCanonical Cover

Sets of functional dependencies may have redundant dependencies that can be inferred from the others• Eg: A C is redundant in: {A B, B C, A C}

Parts of a functional dependency may be redundant• E.g. on RHS: {A B, B C, A CD} can be simplified to

{A B, B C, A D} Note, this is really a redundant dependency issue, as we can say that in {A B, B C, A C, A D} , A C is redundant

• E.g. on LHS: {A B, B C, AC D} can be simplified to {A B, B C, A D}

This is NOT redundant dependency issue, but unnecessarily weak description of dependencies

Intuitively, a canonical cover of F is a “minimal” set of functional dependencies equivalent to F, with no redundant dependencies or having redundant parts of dependencies


Extraneous AttributesExtraneous Attributes Consider a set F of functional dependencies and the functional

dependency in F.• Attribute A is extraneous in if A

and F logically implies (F – { }) {( – A) }.

• Attribute A is extraneous in if A and the set of functional dependencies (F – { }) { ( – A)} logically implies F.

Note: implication in the opposite direction is trivial in each of the cases above, since a “stronger” functional dependency always implies a weaker one

Example: Given F = {A B, AB C }• B is extraneous in AB C because A C is implied by F

(equivalently C is in A+)

Example: Given F = {A C, AB CD}• C is extraneous in AB CD since AB C can be inferred even

after deleting C


Testing if an Attribute is ExtraneousTesting if an Attribute is Extraneous

Consider a set F of functional dependencies and the functional dependency in F.

• To test if attribute A (left side) is extraneous in

1. compute the attribute closure ( {} – A)+ using the dependencies in F including ,

2. check that ( {} – A)+ contains ; if it does, A is extraneous

• To test if attribute A (right side) is extraneous in

1. compute + using only the dependencies in F’ = (F – { }) { ( – A)},

2. check that + contains A; if it does, A is extraneous


Canonical Cover Algorithm (IMPORTANT)Canonical Cover Algorithm (IMPORTANT)

A canonical cover for F is a set of dependencies Fc such that

• F logically implies all dependencies in Fc, and

• Fc logically implies all dependencies in F, and

• No functional dependency in Fc contains an extraneous attribute, and

• Each left side of functional dependency in Fc is unique.

To compute a canonical cover for F:repeat

Use the union rule to replace any dependencies in F 1 1 and 1 2 with 1 1 2

Find a functional dependency with an extraneous attribute either in or in

If an extraneous attribute is found, delete it from until F does not change

Note: Union rule need not be applied until the end. Dennis prefers delaying the union because it makes it easier to spot redundant relations. But the book uses this approach.


Canonical Cover Algorithm (Dennis Preference)Canonical Cover Algorithm (Dennis Preference)

To compute a canonical cover for F:

Convert functional dependencies so right hand side of every functional dependency has just one attribute

repeatRemove redundant functional dependenciesFind extraneous attributes from the left side of functional

dependencies and remove themuntil F does not change

Apply the union rule.

You may use either algorithm. They are close but not the same.


The EmToPrHoSkLoRo relationThe EmToPrHoSkLoRo relation

The relation deals with employees who use tools on projects and work a certain number of hours per week. An employee may work in various locations and has a variety of skills. All employees having a certain skill and working in a certain location meet in a specified room once a week.

The attributes of the relation are:

• Em: Employee

• To: Tool

• Pr: Project

• Ho: Hours per week

• Sk: Skill

• Lo: Location

• Ro: Room for meeting

©Zvi M. Kedem


The FDs of the relationThe FDs of the relation

The relation satisfies the following FDs:

• Each employee uses a single tool: Em To

• Each employee works on a single project: Em Pr.

• Each tool can be used on a single project only: To Pr.

• An employee uses each tool for the same number of hours each week: EmTo Ho.

• All the employees working in a location having a certain skill always meet in the same room: SkLo Ro.

• Each room is in one location only: Ro Lo.

©Zvi M. Kedem


Run the algorithmRun the algorithm

No RHS can be combined, so we check whether there are any redundant attributes.

We start with FD 1.

• We check whether we can remove To from Em To. This is possible if we can derive Em To using

Em Pr

To Pr

EmTo Ho

SkLo Ro

Ro Lo

Computing the closure of Em using the above FDs gives us only EmPr, so the attribute To must be kept.

©Zvi M. Kedem



• We check whether we can remove Pr (from FD2). This is possible if we can derive Em Pr using

Em To

To Pr

EmTo Ho

SkLo Ro

Ro Lo

Computing the closure of Em using the above FDs gives us EmToPrHo, so FD 2 can be removed..

©Zvi M. Kedem©Zvi M. Kedem



We now have

1. Em To

2. To Pr

3. EmTo Ho

4. SkLo Ro

5. Ro Lo

The next one is 3.

• We check if Em can be removed from the left. This is possible if we can derive To Ho using all the FDs. Computing the closure of To using the FDs gives ToPr, and therefore Em cannot be removed

• We check if To can be removed. This is possible if we can derive Em Ho using all the FDs. Computing the closure of Em using the FDs gives EmToPrHo, and therefore To can be removed

©Zvi M. Kedem



We now attempt to remove an attribute from the LHS of 4, and an entire FD, but neither is possible. We try again, but we cannot change anything.

Therefore we are done To form the tables, simply combine the left hand and right hand

sides (after combining tables FDs having a common left hand side via the union rule). Then remove any relation whose attributes are subsets of some other relation.

In our example: EmToHo ToPr SkLoRo If any of these tables contains a superkey of all attributes, we

have a lossless join decomposition. That wasn’t the case so we add EmSkLo. (Check that attribute cover gives all attributes and this is minimal.)

©Zvi M. Kedem


Why is it necessary to store a global keyWhy is it necessary to store a global key

Why is it necessary to store the global key?

Consider the relation: LnFnVaSa:

• Ln: Last Name

• Fn First Name

• Va Vacation days per year

• Sa Salary

The functional dependencies are:

• Ln Va

• Fn Sa

The key is

LnFn

The relation is not in 3NF

©Zvi M. Kedem


Why is it necessary to store . . . (cont.)Why is it necessary to store . . . (cont.)

Our algorithm (without the key being included) will produce the decomposition

• LnVa

• FnSa

This is not a lossless-join decomposition

• In fact we do not know who the employees are (what are the valid pairs of LnFn)

So we decompose

• LnVa

• FnSa

• LnFn

©Zvi M. Kedem


How about EGSHow about EGS

Applying to algorithm to

1. E G

2. G S

3. E S

The only thing we can do is to find that in FD 3, attribute S is redundant, and therefore we get as canonical cover:

1. E G

2. G S

©Zvi M. Kedem


Decomposition of EGSDecomposition of EGS

Applying the algorithm to EGS, we get our desired decomposition:

• EG

• GS

©Zvi M. Kedem


Back to SQL DDLBack to SQL DDL

How are we going to express in SQL what we have learned?

We need to express:

• keys

• functional dependencies

Expressing keys is very easy, we use the PRIMARY KEY and UNIQUE keywords

Expressing functional dependencies is possible also by means of a CHECK condition

• What we need to say for the relation SkLoRo is that each tuple satisfies the following condition

There are no tuples in the relation with the same value of Ro and different values of Lo

©Zvi M. Kedem


Back to SQL DDL (cont.)Back to SQL DDL (cont.)

CREATE TABLE SkLoRo(Sk …,Lo …,Ro…,PRIMARY KEY (Sk,Lo),CHECK (NOT EXISTS SELECT *

FROM SkLoRo AS copyWHERE (SkLoRo.Ro = copy.Ro

AND NOT SkLoRo.Lo = copy.Lo)); Clever but I’ve never seen this in practice.

©Zvi M. Kedem


!!Design GoalsDesign Goals

Goal for a relational database design is:

• Third Normal Form.

• Lossless join.

• Dependency preservation.


Denormalization for PerformanceDenormalization for Performance

May want to use non-normalized schema for performance

E.g. displaying customer-name along with account-number and balance requires join of account with depositor

Alternative 1: Use denormalized relation containing attributes of account as well as depositor with all above attributes

• faster lookup

• Extra space and extra execution time for updates

• extra coding work for programmer and possibility of error in extra code

Alternative 2: use a materialized view defined as account depositor

• Benefits and drawbacks same as above, except no extra coding work for programmer and avoids possible errors


Other Design IssuesOther Design Issues

Some aspects of database design are not caught by normalization

Sometimes performance dictates combining several rows into one:

Instead of earnings(company-id, year, amount), use

• earnings-2000, earnings-2001, earnings-2002, etc., all on the schema (company-id, earnings).

• Above are in third normal form (in fact a slightly stronger optional form called BCNF), but make querying across years difficult and needs new table each year

• company-year(company-id, earnings-2000, earnings-2001, earnings-2002)

• Also in BCNF, but also makes querying across years difficult and requires new attribute each year.

• Is an example of a crosstab, where values for one attribute become column names

• Used in spreadsheets, and in data analysis tools


Vertical PartitioningVertical Partitioning

Can you see any reason why

Acctbal(acctid, bal), acctaddress(acctid, street, city, zip)might be better thanacct(acctid, bal, street, city, zip)?

Both are normalized

Zvi’s changes, if any, are marked in green, they are not copyrighted by the authors, and the...

Documents

Transcript of Zvi’s changes, if any, are marked in green, they are not copyrighted by the authors, and the...