Database Normalization
-
Upload
arnav-guddu -
Category
Documents
-
view
96 -
download
2
description
Transcript of Database Normalization
NormalizationConstraints: There are 2 types of constraints:
1) Defines the permitted values, the attributes can have.
2) Defines the relationship between the attribute.
Functional Dependency (FD): It is denoted by. FD X Y means that X uniquely determines Y, where X and Y are simple or composite attributes. The dependency from X to Y is said to be there if application has the following:
If T1 and T2 are 2 tuples with some values X then value for Y must also be same in T1 and T2 i.e., relationship between X and Y is independent of other attributes which must be present in the table. In simple words, for a given X there is always a single value of Y.
For example,
A Street of a City Pin Code: A Street of a city has a unique pin code however the reverse need not be true.
(ISBN, TITLE) AUTHOR: Given an ISBN number, one can find the Title and name of the Author of the book.
Implications and Covers: The application can call for some functional dependencies which may imply additional functional dependencies.
If F is the set of FDs then we define closure of F denoted as F+ to be set of all possible FDs, which is implied by F.
1 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
To find F+, given F, we have to find out the interface rules for the FD’s which are implied by F. The inference rules are very important for good database design for the following reasons:
Given F, one may like to determine whether XY is implied or not.
For computing the closure of F+ of F.
Given F we may want to remove those FDs, which are redundant in F. A FD is redundant if it is implied by another FD in F.
While designing database schema:
Find minimal cover of “G” of F.
By finding minimal cover, G does not contain any redundant FDs (that is, G+ will be same as F+). By computing minimal cover G of F, we can ensure that DBMS will enforce the constraints, which automatically enforces the constraints implied by G.
Inference rules for FDs:
Inference rules are known as Armstrong’s Axioms, are published by Armstrong. These properties are as given below:
1. Reflexive property: X Y is TRUE, if Y is a SUBSET of X.
2. Augmentation property: If X Y is TRUE, then XZ YZ is also TRUE.
3. Transitivity property: If X Y and Y Z are TRUE, then X Z is implied.
4. Union property: If X Y and X Z are TRUE, then X YZ is also TRUE. This property indicates that if Right Hand Side of FD contains many attributes then, FD exists for each of them.
2 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
5. Decomposition property: If X Y is implied and Z is a SUBSET of Y, the X Z is implied. This property is the Reverse of Union property.
6. Pseudo transitivity property: If X Y and WY Z are given, then XW Z is TRUE.
Example: Consider a college having a table STUDY with COURSE, TEACHER, ROOM NO and DEPARTMENT as attributes.
STUDY (COURSE, TEACHER, ROOMNO, DEPT) here identify few FDs namely:
Course Teacher
Teacher Department
Course Room Number
Additional FDs can be derived from above by using Inference properties:
By reflexivity: (Course, Teacher) Teacher
By Augmentation: (Course, Room Number) (Teacher, Room Number)
By Transitivity: Course Department
By Union: Course (Teacher, Room Number)
The main axioms of Armstrong are sound and complete, and are defined as:
1. Soundness property: If X Y can be inferred from F using the above axioms, the X Y will be TRUE in any relation in which F holds.
2. Completeness property: If X Y cannot be inferred from F and F holds in relation R, then X Y will not be TRUE in relation R.
3 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
Normalization: Consider the table shown below:
Order No.
Order Date
Item Lines
Item Code Quantity Price / Unit
1456 260289 3687 52 50.40
4627 38 60.20
3214 20 17.50
1886 040389 4629 45 20.25
4627 30 60.20
1788 040489 4627 40 60.20
Table 1: An UnNormalized Relation
In this relation, an order no. includes many items. The attribute “order lines” is not single attribute but is composed of many attributes.
Besides this, the number of ITEM LINES is variable. This form is not suitable for storage as a file in a computer. Further, retrieval of data based on a component of a composite attribute is difficult. For example, to find out how many items with a specified item code are ordered, one must break up composite attribute first before attempting a search. Thus a relation with a format such as in above table is not allowed. It is said to be unnormalized.
To normalize this relation, a composite attribute is converted to individual attributes. The normalization step consists of first identifying fields within a composite attribute as individual attributes. After doing this common attributes for a composite attribute are duplicated as many times as there are lines in the composite attribute. The normalized relation corresponding to the relation given in the table 1 above is show in the table 2 below:
4 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
Table 2: Normalized form of the Relation given in Table 1
Order No. Order Date Item Code Quantity Price / Unit
1456 260289 3687 52 50.40
1456 260289 4627 38 60.20
1456 260289 3214 20 17.50
1886 040389 4629 45 20.25
1886 040389 4627 30 60.20
1788 040489 4627 40 60.20
The relation shown in the Table 2 is said to be in First Normal Form, abbreviated as 1NF. This form is also called a flat file. There are no composite attribute and every single attribute is single and describes only one property.
Converting a relation to 1NF form is the first essential step in normalization. There are successive higher normalization forms known as 2NF, 3NF, BCNF, 4NF and 5NF. Each form is an improvement over the earlier form. In other words, 2NF is an improvement over 1NF; 3NF is an improvement over 2NF, and so on. A higher normal form relation is a SUBSET of lower normal form, as shown in Figure 1.
5 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
Figure 1: Illustration of successive normal form of a relation.
The higher order normalization steps are based on 3 important concepts:
1) Dependence among attributes in a relation.
2) Identification of an attribute or a set of attributes as the key of a relation.
3) Multivalued dependency between attributes.
Functional Dependency:
Remember that, there is no fool-proof algorithmic method to identify functional dependency.
Let X and Y be 2 attributes of a relation. Given a relation X, if there exists one value of Y corresponding to it, then Y is said
6 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
1NF1NF
2NF2NF
3NF3NF
BCNFBCNF
4NF4NF
5NF
to be functionally dependent on X. This is indicated by the notation:
X Y
For example, given the value of item code, there is only one value of item name for it. Thus item name is functionality dependent on item code. This is as shown as:
Item Code Item Name
Similarly in Table 2, given an Order Number, the date of the order is known. Thus:
Order No. Order Date
Functional dependency may be based on a composite attribute. For example, if we write:
X, Z Y
It means that there is only one value of Y corresponding to given values of X, Z. In other words, Y is functionality dependent on the composite X, Z. In Table 2, for example, Order No. and Item Code together with Quantity and Price. Thus,
Order No., Item Code, Quantity Price
Example: Consider the relation:
Student (Roll No., Name, Address, Dept., Year of Study)
In this relation, Name is functionally dependent on Roll No. In fact, given the value of Roll No., the values of the other entire attribute can be uniquely determined. Name and Department are not functionally dependent because given the name of a student; one cannot find his department uniquely. This is due to the fact that there may be more than one student with the same name. Name in this case is not a key. Department and Year of Study are not functionally dependent as Year of Study pertains to a student whereas Department is an independent attribute. The functional
7 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
dependency in this relation is shown in Figure 2, as a dependency diagram. Such dependency diagrams are very useful in normalization.
Relation Key: Given a relation, if the value of an attribute X uniquely determines the values of all other attributes in a row, then X is said to be the key of the relation. Sometimes more than one attribute is used to uniquely determine other attributes in a relation row. In that case, such a set of attributes is the key. In Table 2, Order No., and Item Code together determines Order Date, Quantity and Price. Thus the key is formed by the combination set of (Order No., Item Code). In this relation, Supplies (Vendor Code, Item Code, Quantity supplied, Date of Supply, Price/Unit), Vendor Code and Item Code together form the key. This dependency is shown in the dependency diagram of Figure 3:
8 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
NameName
AddressAddress
DepartmentDepartment
Roll No.Roll No.
Year of StudyYear of Study
Figure 2: Dependency diagram for the relation “Student”
Observe that in the figure, the fact that Vendor Code and Item Code together form the composite key is clearly shown by enclosing them together in a rectangle.
9 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
Figure 3: Dependency diagram for the relation “Supplies”.
Vendor CodeVendor Code
Item CodeItem Code
Quantity SuppliedQuantity Supplied
Date of SupplyDate of Supply
Price/UnitPrice/Unit
Why do we Normalize a Relation?
Relations are normalized when relations in a database are to be altered during the lifetime of a database, we do not loose information or introduce inconsistencies. The type of alterations normally needed for relations are:
1) Insertions of new data values to a relation. This should be possible without being forced to leave blank fields for some attributes.
2) Deletions of a tuple, namely, a row of a relation. This should be possible without losing vital information unknowingly.
3) Updating or changing a value of an attribute in a tuple. This should be possible without exhaustively searching all the tuples in the relation.
Ideal relations after normalization should have the following properties so that the problems mentioned above do not occur for relations in the (ideal) normalized form:
1) No data value should be duplicated in different rows unnecessarily.
2) A value must be specified (and required) for every attributes in a row.
3) Each relation should be self-contained. In other words, if a row from a relation is deleted, important information should not be accidentally lost.
4) When a row is added to a relation, other relations in the database should not be affected.
5) A value of an attribute in a tuple may be changed independent of other tuples in the relation and other relations.
The idea of normalizing relations to higher and higher normal forms is to attain the goals of having a set of ideal relations meeting the above criteria.
10 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
Second Normal Form (2NF) Relation
A relation is said to be in 2NF if:
1) It is already in 1NF form.
2) Non-key attributes are functionally dependent on Key attributes.
3) If the key has more than 1 attributes then no non-key attributes should be functionally dependent upon part of key attributes.
Example: Consider Table 2, as it is in 1NF form.
Key = (Order No., Item Code)
The dependency diagram for attributes of this relation is shown in Figure 3.
The non-key attribute Price/Unit is functionally dependent on Item Code, which is a part of relation key.
Also, non-key attribute Order Data is functionally dependent on Order No., which is a part of relation key.
As the non-key attributes depends only upon part of the key attributes, the relation in Table 2, is not in 2NF form.
To obtain the 2NF form of Table 2, we proceed as follows:
The relation “Orders” has Order No. as key. The relation “Order Details” has the composite key Order No. and Item Code. In both relations, the non-key attributes are functionally dependent on the whole key.
Observe that by transforming to 2NF relations the repetition of Order Date in Table 2 has been removed.
Further, if an order for an item is cancelled, the price of an item is not lost.
11 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
12 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
Table 3: Splitting of Relation given in Table 2 into 2NF Relations.
OrdersOrder No. Order
Date145626028918860403891788040489
Order DetailsOrder No.Item CodeQuantity1456368752145646273814563214201886462945188646273017884
62740
PricesItem CodePrice / Unit368750.40462760.2032141
7.50462920.25
Figure 4: Dependency diagram for the relation given in Table 2.
Order No.Order No.
Item CodeItem Code
Order DateOrder Date
QuantityQuantity
Price/UnitPrice/Unit
If Order No. 1886 for Item Code 4629 is cancelled in Table 2, then the fourth row will be removed and the price of the item will be lost.
In Table 3, only the fourth row of the centre table will be removed. The Item price is present in the right table in Table 3, which will not be touched and hence the price of Item 4629 is not lost. The Date of Order is also not lost as it is in the left table of the Table 3.
These relations in 2NF form meet all the “ideal” conditions specified. Observe that the 3 relations obtained are self-contained. There is no duplication of data within a relation.
13 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
Third Normal Form (3NF)
3NF form normalization will be needed where all attributes in a relation tuple are not functionally dependent on the key attribute. If 2 non-key attributes are functionally dependent, then there will be unnecessary duplication of data.
Consider the relation given in Table 4.
Table 4: A 2NF Form Relation
Roll No. Name Department Year Hostel Name
1784 Raman Physics 1 Ganga
1648 Krishnan Chemistry 1 Ganga
1768 Gopalan Mathematics 2 Kaveri
1848 Raja Botany 2 Kaveri
1682 Maya Geology 3 Krishna
1485 Singh Zoology 4 Godavari
Here, Roll No. is the Key and all other attributes are functionally dependent on it. Thus is in 2NF.
If it is known that in the college all 1st Year students are accommodated in Ganga Hostel, all 2nd Year students in Krishna, and all 4th Year students in Godavari, then the non-key attribute Hostel Name is dependent on the non-key attribute Year. This dependency is shown in the Figure 5.
Observe that given the Year of student, his Hostel is known and vice versa.
14 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
The dependency of Hostel on Year leads to duplication of data as is evident from Table 4. If it is decided to ask all 1st Year students to move to Kaveri Hostel, and all 2nd Year students to Ganga Hostel, this change should be made in many places in Table 4. Also, when a student’s Year of study changes, his Hostel changes, should also be noted in Table 4. This is undesirable.
A table is said to be in 3NF if it is in 2NF, and no non-key attribute is functionally dependent on any other non-key attribute.
Therefore, Table 4 is not in 3NF.
To transform it into 3NF, we should introduce another relation, which includes the functionality related non-key attributes. This is shown in the Table 5.
15 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
Roll No.Roll No.
NameName
DepartmentDepartment
YearYear
Hostel NameHostel Name
Figure 5: Dependency diagram for the relation given in Table 4.
It should be stressed again that dependency between attributes is a semantic property and has to be stated in the problem specification. In this example, the dependency between Year and Hostel is clearly stated.
In case, Hostel allocated to student is independent of their year in college, then Table 4 is already in 3NF.
Example: The relation Employee is given below and its dependency diagram is shown in the Figure 6:
Employee (Employee Code, Employee Name, Dept., Salary, Project No., Termination Date of project)
16 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
Roll No.NameDepartmentYear1784RamanPhysics11648KrishnanChemistry11768GopalanMathematics21848RajaBotany21682May
aGeology31485SinghZoology4
YearHostel Name1Ganga1Ganga2Kaveri2Kaveri3
Krishna4Godavari
Table 5: Conversion of Table 4 into 3NF Relations.
As it can be seen from the figure, the Termination Date of the project is dependent on the Project No. Thus the relation is not in 3NF. The 3NF relations are:
Employee (Employee Code, Employee Name, Salary, Project No.)
Project (Project No., Termination Date)
17 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
Employee CodeEmployee Code
Employee NameEmployee Name
DepartmentDepartment
SalarySalary
Project No.Project No.
Termination DateTermination Date
Figure 6: Dependency diagram of Employee Relation.
Boyce-Codd Normal Form (BCNF)
Assume that a relation has more than 1 possible key. Assume further that the composite key have a common attribute. IF an attribute of a composite key is dependent on an attribute of the other composite key, a normalization called BCNF is needed.
Consider an example, the relation Professor:
Professor (Professor Code, Department, Head of Department, Percent Time)
It is assumed that:
1) A Professor can work in more than 1 department.
2) The Percentage of the time he spends in each department is given.
3) Each department has only 1 Head of the Department.
The relationship diagram is given in Figure 7. Table 6 gives the relation attributes. The 2 possible composite keys are Professor Code and Department, or Professor Code and Head of Department. Observe that the Department as well as Head of Department is not non-key attributes. They are a part of the composite key.
18 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
Table 6: Normalization of Relation – “Professor”
Professor Code Department Head of Department
Percent Time
P1 Physics Ghosh 50
P1 Mathematics Krishnan 50
P2 Chemistry Rao 25
P2 Physics Ghosh 75
P3 Mathematics Krishnan 100
19 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
Department
Professor Code
Head of Department
Percent Time
Department Head of Department
Head of Department
Professor Code
Department
Percent Time
Figure 7: Dependency of diagram of Professor Relation.
The relation given in Table 6 is in 3NF. Observe, however, that the names of Department and Head of Department are duplicated. Further, if Professor P2 resigns, rows 3 and 4 are deleted. We then lose the information that Rao is the Head of Department of Chemistry.
The normalization of the relation is done by creating a new relation for Department and Head of Department, and deleting Head of Department from Professor Relation. The normalized relations are shown in Table 7:
Table 7: Normalized Professor Relation in BCNF
(a)
Professor Code
Department Percent Time
P1 Physics 50
P1 Mathematics 50
P2 Chemistry 25
P2 Physics 75
P3 Mathematics 100
(b)
Department Head of Department
Physics Ghosh
Mathematics Krishnan
Chemistry Rao
The dependency diagram is shown in Figure 8.
20 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
Department
Professor Code
Percent Time
Department Head of Department
Figure 8: Dependency diagram of Professor Relation.
The dependency diagram gives the important clue to this normalization step, as is clear from Figure 7 and 8.
21 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
Fourth and Fifth Normal Form (4NF and 5NF)
When attributes in a relation above have Multivalued dependency, further normalization to 4NF and 5NF are required.
Example: Consider a vendor supplying many items to many projects in an organization. The following are the assumptions:
1) A Vendor is capable of supplying many Items.
2) A Project uses many Items.
3) A Vendor supplies for many Projects.
4) An Item may be supplied by many Vendors.
Table 8 gives the relation for this problem and Figure 9, the dependency diagram(s).
Vendor Code Item Code Project No.
V1 I1 P1
V1 I2 P1
V1 I1 P3
V1 I2 P3
V2 I2 P3
V2 I3 P1
V3 I1 P1
V3 I1 P2
22 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
The relation in Table 8 has a number of problems. For example:
If vendor Y1 has supply to project P2, but the item is not yet decided, then a row with a blank for Item Code has to be introduced.
The information about Item 1 is stored twice for Vendor V3.
Observe that the relation given in Table 8 is in 3NF and also in, BCNF. It still has the problems mentioned above. The problem is reduced by expressing this relation as 2 relations in 4NF.
A relation is in 4NF, if it has no more than one independent multivalued dependency with a functional dependency.
Table 8 can be expressed as the 2 4NF relations given in Table 9. The fact that vendors are capable of supplying certain items and that they are assigned to supply for some projects is independently specified in the 4NF relation.
23 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
V
I
R
V
P
I
Figure 9: Dependency Diagram of Vendor-Supply-Project relation
Indicates Multivalued Dependency
Table 9: Vendor-Supply-Project Relations in 4NF.
(a) Vendor Supply
Vendor Code Item Code
V1 I1
V1 I2
V2 I2
V2 I3
V3 I1
Table 10: 5NF Additional Relation
Project No. Item Code
P1 I1
P1 I2
P2 I1
P3 I1
P3 I3
24 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
(a) Vendor Project
Vendor Code Project No.
V1 P1
V1 P3
V2 P1
V3 P1
V3 P2
These relations still have a problem. Even though Vendor V1’s capability to supply items and his allotment to supply for specified Projects are known, he may not be actually supplying them to a project as the project may not need it. We thus need another relation, which specifies this. This is called 5NF form. The 5NF relations are the relations in Table 9(a) and (b) together with the relation given in Table 10.
25 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
Table 11: Summary of Normalization Steps.
Input Relation Transformation Output Relation
All relations Eliminate Variable Length Records.
Remove multi-attribute lines in the Table.
1NF
1NF Remove dependency of non-key attribute on part of a multiattribute key.
2NF
2NF Remove dependency of non-key attributes on other non-key attributes.
3NF
3NF Remove dependency of an attribute of a multiattribute key on an attribute of another (overlapping) multiattribute key.
BCNF
BCNF Remove more than one independent Multivalued dependency from relation by splitting relation.
4NF
4NF Add one relation relating attributes with Multivalued dependency to the
5NF
26 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]
two relations with Multivalued dependency.
27 | P a g e
Written By Arnav Mukhopadhyay EMAIL: [email protected]