Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples....

41
Midterm Review II Midterm Review II
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    0

Transcript of Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples....

Page 1: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Midterm Review IIMidterm Review II

Page 2: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

• Redundancy. – Information may be repeated unnecessarily in several tuples. – E.g. length and filmType.

• Update anomalies. – We may change information in one tuple but leave it unchanged in other tuples. – E.g. we could change the length of Star Wars to 125, in the first tuple, and forget to

do the same in the second and third tuple.

• Deletion anomalies. – If a set of values becomes empty, we may lose other information as a side effect. – E.g. if we delete Emilio Estevez we will lose all the information about Mighty Ducks.

Anomalies

Mike MeyersParamountcolor951992Wayne’s World

Dana CarveyParamountcolor951992Wayne’s World

Emilio EstevezDisneycolor1041991Mighty Ducks

Harrison FordFoxcolor1241977Star Wars

Mark HamillFoxcolor1241977Star Wars

Carrie FisherFoxcolor1241977Star Wars

starNamestudioNamfilmTyplengthyeartitle

Page 3: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Decomposing Relations - Example

Star Wars

title

Wayne’s World

Foxcolor1241977

studioNamfilmTyplengthyear

Paramountcolor951992

Disneycolor1041991Mighty Ducks

Wayne’s World

Wayne’s World

Mighty Ducks

Star Wars

Star Wars

Star Wars

title

1992

1992

1991

1977

1977

1977

Mike Meyers

Dana Carvey

Emilio Estevez

Harrison Ford

Mark Hamill

Carrie Fisher

starNameyear• No true redundancy!

• The update anomaly disappeared. If we change the length of a movie, it is done only once.

• The deletion anomaly disappeared. If we delete all the stars from Movie2 we still will have the other info for a movie.

Movie1 relation

Movie2 relation

Page 4: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Boyce-Codd Normal Form• The goal of decomposition is to replace a relation by

several that do not exhibit anomalies.

• There is a simple condition under which the anomalies can be guaranteed not to exist.

• This condition is called Boyce-Codd Normal Form, or BCNF.

• A relation is in BCNF if: – Whenever there is a nontrivial dependency

A1A2…AnB1B2…Bm

for R, it must be the case that the left hand side

{A1 , A2 , … , An} is a superkey for R.

Page 5: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Boyce-Codd Normal Form - Example• Relation Movie in the previous figure is not in BCNF.

• Consider the FD: title yearlength filmType studioName

• Unfortunately, the left side of the above dependency is not a superkey.

– In particular we know that the title and the year does not functionally determine starName.

• On the other hand, Movie1 is in BCNF.– The only key is {title, year} and

– title year length filmType studioName is the only (non-trivial) FD that holds in the relation.

Violating BCNF

Page 6: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Decomposition into BCNF• The decomposition strategy is:

– Find a non-trivial FD A1A2…AnB1B2…Bm that violates BCNF, i.e. A1A2…An is not a superkey.

– Decompose the relation schema into two overlapping relation schemas:

• One is all the attributes involved in the violating dependency and

• the other is the left side and all the other attributes not involved in the dependency.

• By repeatedly, choosing suitable decompositions, we can break any relation schema into a collection of smaller schemas in BCNF.

• The data in the original relation is represented faithfully by the data in the relations that are the result of the decomposition.

– i.e. we can reconstruct the original relation exactly from the decomposed relations.

Page 7: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Boyce-Codd Normal Form - ExampleConsider relation schema:

Movies(title, year, studioName, president, presAddr)

and functional dependencies:

title year studioName

studioName president

president presAddr

Last two FDs violate BCNF. Why?

Compute {title, year}+, {studioName}+, {president}+ and see if you get all the attributes of the relation.

If not, you got some FD which is violates BCNF, and need to break relation.

Page 8: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Boyce-Codd Normal Form – ExampleLet’s decompose starting with:

studioName president

Let’s add to the right-hand side any other attributes in the closure of studioName (optional “rule of thumb”).

1. X={studioName} studioNamepresident

2. X={studioName, president} presidentpresAddr

3. X={studioName}+={studioName, president, presAddr}

Page 9: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Boyce-Codd Normal Form – ExampleFrom the closure we get:

studioNamepresident presAddr

We decompose the relation schema into the following two schemas:

Movies1(studioName, president, presAddr)

Movies2(title, year, studioName)

Movies2 is in BCNF. Because we can’t find a “bad” FD holding there.

What about Movies1?

The following dependency violates BCNF.

presidentpresAddr

Why it’s bad to leave Movies1 table as is?

If many studios share the same president than we would have redundancy when repeating the presAddr in all those studios.

Page 10: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Boyce-Codd Normal Form – Example

We must decompose Movies1, using the FD:

presidentpresAddr

The resulting relation schemas, both in BCNF, are:

Movies11(title, year, studioName)

Movies12(studioName, president)

In general, we must keep applying the decomposition rule as many times as needed, until all our relations are in BCNF.

So, finally we got Movies11, Movies12, and Movies2.

Page 11: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Finding FDs for the decomposed relations

•When we decompose a relation, we need to check that the resulting schemas are in BCNF.

•We can’t tell a relation is in BCNF, unless we can determine the FDs that hold for that relation.

Page 12: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

• Suppose S is one of the resulting relations in a decomposition of R.

For this:

• Consider each subset X of attributes of S.

• Compute X+ using the FD on R.

• At the end throw out the attributes of R, which aren’t in S.

• Then, for each attribute B such that:

• B is an attribute of S,

• B is in X+

we have that the functional dependency XB holds in S.

Finding FDs for the decomposed relations

Page 13: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Relational Algebra OperationsOperations of relational algebra fall into four broad classes:

1. The usual set operations

union

intersection

difference

2. Operations that remove parts of a relation:

selection eliminates some rows(tuples)

projection eliminates some columns

3. Operations that combine the tuples of two relations:

Cartesian product pairs the tuples of two relations in all possible ways

join selectively pairs tuples from two relations.

4. An operation called “renaming.”

Page 14: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Conditions for Set Operations on Relations

1. R and S must have schemas with identical sets of attributes.

2. Before applying the operations, the columns of R and S must be ordered so that the order of attributes is the same for both relations.

Page 15: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Projection

Produces from a relation R a new relation that has only some of R’s columns.

A1, A2,…,An(R) is a relation that has only the columns for attributes A1, A2,…, An of R.

Example:

Compute the expression title, year, length(Movies) on the table:

title year length filmType studioName producerC#

Star wars 1977 124 color Fox 12345

Mighty Ducks 1991 104 color Disney 67890

Wayne’s World 1992 95 color Paramount 99999

Page 16: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Example (Continued)

Resulting relation:

title year length

Star wars 1977 124

Mighty Ducks 1991 104

Wayne’s World 1992 95

What about filmtype(Movies)

Page 17: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

SelectionSelection, applied to a relation R, produces a new relation with a subset of R’s tuples.

The tuples in the result are those that satisfy some condition C.

Denote it with C( R ).

The schema for the resulting relation is the same as R’s schema.

Example: The expression length100(Movie) is:

title year length filmType studioName producerC#

Star wars 1977 124 color Fox 12345

Mighty Ducks 1991 104 color Disney 67890

Page 18: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Cartesian ProductCartesian Product of two relations R and S is the set of pairs that

can be formed by choosing the first element of the pair to be any element of R and the second an element of S. This denoted as RS.

Example:

R: A B S: B C D

1 2 2 5 6

3 4 4 7 8

9 10 11

RS: A R.B S.B C D

1 2 2 5 6

1 2 4 7 8

1 2 9 10 11

3 4 2 5 6

3 4 4 7 8

3 4 9 10 11

Page 19: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Natural JoinDenoted as R S.

Let A1, A2,…,An be the attributes in both the schema of R and the schema of S.

Then a tuple r from R and a tuple s from S are successfully paired if and only if r and s agree on each of the attributes

A1, A2, …, An.

Example: The natural join of the relation R and S from previous example is:

A B C D

1 2 5 6

3 4 7 8

Page 20: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Combing Operations to Form Queries“What are the title and years of movies made by Fox that are at

least 100 minutes long?”

One way to compute the answer to this query is:

• Select those Movie tuples that have length 100.

• Select those Movie tuples that have studioName =‘Fox’.

• Compute the intersection of first and second steps.

• Project the relation from the third step onto attributes title and year.

Page 21: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Another ExampleConsider two relations Movie1 and Movie2,

With schemas:

Movie1(title, year, length, filmType, studioName)

Movie2(title, year, starName)

Suppose we want to know:

“Find the stars of the movies that are at least 100 minutes long.”

First we join the two relations: Movie1, Movie2

Second we select movies with length at least 100 min.

Then we project the starName.

Page 22: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Relational Algebra on Bags• A bag is like a set, but an element may appear

more than once.

• Example: {1,2,1,3} is a bag. {1,2,3} is also a bag that happens to be a set.

• Bags also resemble lists, but order in a bag is unimportant.– Example:

• {1,2,1} = {1,1,2} as bags, but

• [1,2,1] != [1,1,2] as lists.

Page 23: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Operations on Bags• Selection applies to each tuple, so its effect on

bags is like its effect on sets.

• Projection also applies to each tuple, but as a bag operator, we do not eliminate duplicates.

• Products and joins are done on each pair of tuples, so duplicates in bags have no effect on how we operate.

Page 24: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Bag Union, Intersection, Difference• An element appears in the union of two bags the sum of the

number of times it appears in each bag.

• Example: {1,2,1} {1,1,2,3,1} = {1,1,1,1,1,2,2,3}

• An element appears in the intersection of two bags the minimum of the number of times it appears in either.

• Example: {1,2,1} {1,2,3} = {1,2}.

• An element appears in difference A – B of bags as many times as it appears in A, minus the number of times it appears in B.

– But never less than 0 times.

• Example: {1,2,1} – {1,2,3} = {1}.

Page 25: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

The Extended Algebra

1. = eliminate duplicates from bags.

2. = sort tuples.

3. Extended projection: arithmetic, duplication of columns.

4. = grouping and aggregation.

5. OUTERJOIN: avoids “dangling tuples” = tuples that do not join with anything.

Page 26: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Example: Outerjoin

R = A B S = B C1 2 2 34 5 6 7

(1,2) joins with (2,3), but the other two tuplesare dangling.

R S = A B C1 2 34 5 NULLNULL 6 7

Page 27: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Aggregation Operators

• They apply to entire columns of a table and produce a single result.

• The most important examples: – SUM – AVG – COUNT– MIN– MAX

Page 28: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Example: Aggregation

R = A B1 33 43 2

SUM(A) = 7COUNT(A) = 3MAX(B) = 4MIN(B) = 2AVG(B) = 3

Page 29: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Grouping Operator• R1 := L (R2). • L is a list of elements that are either:

1. Grouping attributes.2. AGG(A), where AGG is one of the aggregation operators and A

is an attribute.

Semantics• Group R according to all the grouping attributes on list L.• That is, form one group for each distinct list of values for those

attributes in R.

• Within each group, compute AGG(A) for each aggregation on list L.

• Result has grouping attributes and aggregations as attributes. • One tuple for each list of values for the grouping attributes and their

group’s aggregations.

Page 30: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Example: Grouping/Aggregation

R = A B C1 2 34 5 61 2 5

A,B,AVG(C) (R) = ??

First, group R :A B C1 2 31 2 54 5 6

Then, average C withingroups:

A B AVG(C)1 2 44 5 6

Page 31: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Example: Grouping/Aggregation

• StarsIn(title, year, starName)• We want, for each star who has appeared in at least three movies

the earliest year in which he or she appeared.

• First we group, using starName as a grouping attribute.• Then, we have to compute the MIN(year) for each group.• However, we need also compute COUNT(title) aggregate for each

group, in order to filter out those stars with less than three movies.

ctTitle>3[starName,MIN(year)minYear,COUNT(title)ctTitle(StarsIn)]

Page 32: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Aggregations in SQL

• SUM, AVG, COUNT, MIN, and MAX can be applied to a column in a SELECT clause to produce that aggregation on the column.

• Find the average length of movies from Disney.

SELECT AVG(length)

FROM Movie

WHERE studioName = 'Disney';

Page 33: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Eliminating Duplicates in an Aggregation

• DISTINCT inside an aggregation causes duplicates to be eliminated before the aggregation.

• Example: Find the number of different producers for Disney movies:

SELECT COUNT(DISTINCT producerc)

FROM Movie

WHERE studioname = 'Disney';

This is not the same as:

SELECT DISTINCT COUNT(producerc)

FROM Movie

WHERE studioname = 'Disney';

Page 34: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

NULL’s Ignored in Aggregation

• NULL never contributes to a sum, average, or count, and can never be the minimum or maximum of a column.

select SUM(networth)

from movieexec;

Page 35: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Example: Effect of NULL’s

SELECT count(*)

FROM Movie

WHERE studioName = 'Disney';

SELECT count(length)

FROM Movie

WHERE studioName = 'Disney';

The number of moviesfrom Disney.

The number of moviesfrom Disney with aknown length.

Page 36: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Grouping• We may follow a SELECT-FROM-WHERE expression by

GROUP BY and a list of attributes.

• The relation that results from the SELECT-FROM-WHERE is grouped according to the values of all those attributes, and any aggregation is applied only within each group.

• From Movie relation, find the average length for each studio:

SELECT studioName, AVG(length)

FROM Movie

GROUP BY studioName;

Page 37: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Example: Grouping

• Find the producer’s total length of film produced.

SELECT name, SUM(length)FROM Movie, MovieExecWHERE producerc = cert GROUP BY name;

Computethosetuples first,then groupby name.

Page 38: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Restriction on SELECT Lists With Aggregation

• If any aggregation is used, then each element of the SELECT list must be either:

1. Aggregated, or

2. An attribute on the GROUP BY list.

Page 39: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Illegal Query Example• We might think we could find the shortest movie of Disney as:

SELECT title, MIN(length)

FROM Movie

WHERE studioName = 'Disney';

• But this query is illegal in SQL. Why?

• Because title is neither aggregated nor on the GROUP BY list.

• We should do instead:SELECT title, length

FROM Movie

WHERE studioName = 'Disney' AND length =

(SELECT MIN(length)

FROM Movie

WHERE studioName = 'Disney');

Page 40: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

HAVING Clauses

• HAVING <condition> may follow a GROUP BY clause.

• If so, the condition applies to each group, and groups not satisfying the condition are eliminated.

• These conditions may refer to attributes that make sense within a group; i.e., they are either:1.Grouping attributes, or2.Aggregated attributes.

Page 41: Midterm Review II. Redundancy. –Information may be repeated unnecessarily in several tuples. –E.g. length and filmType. Update anomalies. –We may change.

Example: HAVING• Suppose that we didn’t wish to include all the

producers in our table of aggregated movie lengths.

• Suppose for instance we want those producers who have at least one movie before 1972.

SELECT name, SUM(length)

FROM MovieExec, Movie

WHERE producerc = cert

GROUP BY name

HAVING MIN(year) < 1973;