Post on 14-Jan-2016
Normalization• Is the gradual and sequential process of efficiently organizing data in a
database that follows the rules listed in the previous slide– Normalization commonly involves following three schemas (in order):
• First, Second, and Third Normal Form (1NF, 2NF, 3NF)– This is commonly done during early stages on UML class diagrams
• The goal of normalization is to:– eliminate the duplication of data (which make database large, inefficient, and slow)
which in turn prevents data manipulation anomalies and loss of data integrity• changes that happen in different places may not be the same
– This is done by creating tables and assigning PK for each table, and making sure that each information shows up once in the database
• It eliminates redundant data (storing the same data in more than one table) and ensuring data dependencies are logical (only storing related data in a table)
• Normalization reduces the amount of space a database consumes and ensures data is logically stored
First Normal Form (1NF)• 1NF deals with duplicative data across multiple columns!• It sets the very basic rules to make sure that:
– Separate tables are created for each group of related data (e.g., IsotopicAge, Fold, Rock)• i.e., each table should represent a distinct entity
1. Duplicative (repeating) columns containing the same type of data are removed from the same table• There should be no repeated data types: Mineral1, Mineral2,
Mineral3 or cellPhone, homePhone, workPhone• These should go to a new table
2. All columns must contain a single value, i.e.,• All attributes must be atomic (e.g., XRF,) not multi-valued. Each
cell must only have one value, e.g., XRF, not XRF, REE, Isotope3. There should be a set of one or more columns that uniquely
identify each row, i.e., there should be a primary key
Another example: Analysis tableInvestigator AnalysisType Address
Hassan Babaie XRF 24 Peachtree Center Ave, Atlanta, GA 30303
John Wayne XRF, XRD, REE 3500 Pacific View Dr, Newport Beach, CA
Elizabeth Tucker Petrography 1100 Angela Ra, Charlotte, NC,
John Wayne Isotopic age 3500 Pacific View Dr, Newport Beach, CA
• Investigators submit their samples to an Analyzing company. They company stores the above set of data for the customers
• What are the problems:– This is not in 1NF– The AnalysisType column does not represent a distinct entity
• Can’t find out how many people order analysis for XRF. They are all mixed.
– The Address column is compound, and needs to move out into another table. City depends on zip zode.
– There is no PK
Second Normal Form (2NF)
• 2NF deals with redundancy across multiple rows!• Second normal form (2NF) further addresses the concept of
removing duplicative data• Meet all the requirements of the first normal form (1NF)• Identify columns whose data repeat in different places– Remove them to their own table• In the next slide, we see that data for Joe Strat is
repeated. Solution is to remove the Alum column (with its address and school into their own Table called Alum and School• See next slide for more!
An improved Analysis Table
• Now we can query on the type of analysis• There are still problems with the structure:• There are still redundancies• The company can only keep track of three types of analyses; four would not work!• Address is still compound; needs to be broken• It is difficult to determine the analysis order for each person.
– Order in this case depends on non-Pk columns
Investigator
Analysis1
Analysis2
Analysis3
orders Address
Hassan Babaie
XRF Department of Geosciences, GSU, Atlanta, GA 30303
John Wayne
XRF XRD REE 3500 Pacific View Dr, Newport Beach, CA
Elizabeth Tucker
Petrography
1100 Angela Ra, Charlotte, NC,
John Wayne
Isotopic
3500 Pacific View Dr, Newport Beach, CA
Better solution• We need to break the table into several tables:– Investigator, Analysis, Order, OrderItems, and Address
investiID lastName firstName affiliation
1 Wayne John ExHollywood
2 Babaie Hassan GSU
AnalysisID AnalysisType
1 XRF
2 XRF
Number Street City State zipCode Country
3500 Pacific View Dr. Newport Beach CA 92662 USA
24 Peachtree Center Ave
Atlanta GA 30303
InvestigatorTable
AnalysisTable
AddressTable
…
• Order and OrderItem Tables, partially shown
OrderItemID OrderID AnalysisID Qty
1 1 1 2
2 2 2 1
OrderID InvestiID OrderDate DeliveryDate
1 1 3/5/1960 4/30/1960
2 2 2/17/2013 3/12/2013
OrderTable
OrderItemTable
Some improvement
Analysis
AnalysisIDAnalysisType
OrderItem
OrderItemIDOrderID
AnalysisIDQty
Order
OrderIDInvestID
OrderDateDeliveryDate
Investigator
InvestIDFirstNameFirstName
Address
Address
AddressIDNumber
Stree…
Third Normal Form (3NF)
• Third normal form goes one large step further• Meet all the requirements of the 2NF• No transitive functional dependencies– Remove columns that are not dependent upon the
primary key• Remove columns that their values depend on columns
other than the PK
– This means: remove subkeys
3NF, cont’d• There should be no partial functional dependencies• If x y, i.e., x functionally determines y, and y is functionally
dependent on x, then given x, we can find y.– Example, in the Address table, given the nine-digit zip code, we can
find city and state because they are functionally dependent on the zip code. The opposite is not true, given a city we cannot find the zip code (Note: some cities have several zip codes)
• By definition, a super key (primary key) functionally determines all other attributes in the table
• The zip code is a subkey (not a superkey) because it only determine the city and state part of the Address table not the other attributes
• To take care of the partial functional dependency issue take 3 steps:– Remove all the attributes that depend on the subkey from the table
(e.g., city and State from Address table)– Move them into a new table (e.g., call it ZipLocations with zipCode,
city, and state attributes– Keep a copy of the subkey attribute (i.e., zipCode) in the original table
as a foreign key• The address table now has firstname, last name, street (these 3
make the PK), and zipCode (as FK to the other table). • Summary: Subkeys always result in redundant data and must be removed!• In other words, remove subsets of data that apply to multiple rows of a
table and place them in separate tables– i.e., remove duplicative data– For example, break address into its independent constituents that do
not depend on each other• Create relationships between these new tables and their predecessors
through the use of foreign keys
Fourth Normal Form (4NF)
• Normalizing a database to the 3NF is usually sufficient
• Finally, fourth normal form (4NF) has one additional requirement
• • Meet all the requirements of the third normal form
• A relation is in 4NF if it has no multi-valued dependencies