Database Normalization Tips
-
Upload
steven-hendry -
Category
Documents
-
view
10 -
download
4
description
Transcript of Database Normalization Tips
Database Normalization TipsLuke Chung
FMS, President
September 2002
Applies to:
Microsoft® Access
Summary: This article offers tips to developers to help them avoid some of the pitfalls when
designing Access tables. This article applies to Microsoft Access databases (.mdb) and Microsoft
Access projects (.adp).
Contents
Introduction
Understanding Your Data
What Data Do You Need?
What Are You Going to Do with the Data?
How Is Your Data Related to Each Other?
What Is Going to Happen to the Data Over Time?
Learn How to Use Queries
Database Normalization Concepts
Store Unique Information in One Place
Records are Free, New Fields are Expensive
Know When Data Needs to Be Duplicated
Use Meaningless Field for the Key Field
Use Referential Integrity
Conclusion
Introduction
One of the most important steps in designing a database is ensuring that the data is properly
distributed among its tables. With proper data structures, the remainder of the application (the
queries, forms, reports, code, and so on) is significantly simplified. The formal name for proper table
design is database normalization.
This article is an overview of the basic database normalization concepts and some common pitfalls to
consider and avoid.
Understanding Your Data
Before proceeding with table design, it's important to understand what you're planning to do with your
data and how it will change over time. The assumptions you make will affect the eventual design.
What Data Do You Need?
When designing an application, it's critical to understand the final results to ensure that you have all
the necessary data and know where it comes from. For instance, what is the appearance of the
reports, where does each piece of data come from, and does all the data exist? Nothing is more
damaging to a project than the realization, late in the process, that data is missing for an important
report.
Once you know what data you need, you must determine where it comes from. Is the data imported
from another source? Does that data need to be cleaned or verified? Does the user enter data?
Having a firm grasp of what data is required and where it comes from is the first step in database
design.
What Are You Going to Do with the Data?
Will your users need to edit the data and, if so, how should the data be displayed for them to
understand and edit? Are there validation rules and related lookup tables? Are there auditing issues
associated with data entry that require keeping backups of edits and deletions? What kind of summary
information needs to be displayed to the user? Do you need to generate export files? With this
information, you can envision how the fields are related to each other.
How Is Your Data Related to Each Other?
Group your data into related fields (such as customer-related information, invoice-related information,
and so on). Each group of fields represents future tables. You should then consider how they are
related to each other. For instance, what tables are related in a one-to-many relationship (for example,
one customer may have multiple invoices)? What tables have a one-to-one relationship (often a
consideration to combine into one table)?
What Is Going to Happen to the Data Over Time?
After the tables are designed, the impact of time is often not considered and can cause huge problems
later. Many table designs work perfectly well for immediate use. However, many designs break down
as users modify the data, as new data gets added, and as time passes. Often, developers find they
need to restructure their tables to accommodate these changes. When table structures change, all
their dependencies (queries, forms, reports, code, and so on) also need to be updated. By
understanding and anticipating change over time, a better design can be implemented to minimize the
problems.
Learn How to Use Queries
Understanding how you are going to analyze and manipulate the data is also important. You should
have a firm grasp of how queries work, how to use them to link data across multiple tables, how to use
them to group and summarize data, and how to use crosstab queries when you need to display data in
non-normalized format.
Ultimately, the goal of good data design is to balance the needs of storing the data efficiently over
time, versus easily retrieving and analyzing it. Understanding the power of queries significantly helps
with properly designing your tables.
Database Normalization Concepts
Rather than presenting a theoretical discussion about database normalization, this section explains
basic concepts involved in database normalization. How you apply them in your situation may differ
based on the needs of your application. The goal is to understand these basic concepts, apply them
when you can, and understand the issues when you need to deviate from them.
Store Unique Information in One Place
Most database developers understand the basic concept of data normalization. Ideally, you'd like to
store the same data in one place and refer to it with an ID when you need to reference it. Therefore, if
some information changes, you can change it in one place and the information changes throughout
your application.
For instance, a customer table would store a record for each customer, including name, address,
phone numbers, e-mail address, and other characteristics. The customer table would have a unique
CustomerID field (usually an Autonumber field) that is its key field and used by other tables to refer
to the customer. Therefore, an invoice table, rather than storing all the customer information with
each invoice (because the same customer may have multiple invoices), would simply refer to the
customer ID value, which could be used to look up the customer details in the customer table. Access
makes it very easy to do this through its powerful forms that use combo boxes and subforms. If you
need to make a change to the customer's information (such as a new phone number), you can change
it in the customer table and know that any other part of your application that references that
information is automatically updated.
With a properly normalized database, changes to data over time are easily handled with a simple edit.
Improperly normalized databases often include programming or queries to make changes across
multiple records or tables. This not only requires more work to implement, but it also increases the
chances of the data becoming inconsistent if the code or queries don't execute properly.
Records are Free, New Fields are Expensive
Databases should be designed so that over time, you simply add new records. Database tables are
designed to hold huge numbers of records. However, if you find you need to add more fields, you
probably have a design problem.
This often happens with spreadsheet experts who design databases the way they are accustomed to
designing spreadsheets. Designing time-sensitive fields (such as Year, Quarter, Product, and
Salesman) requires new fields to be added in the future. But the correct design is to transpose the
information and have the time-sensitive data in one field so more records can be added. For instance,
rather than creating a separate field for each year, create a Year field, and enter the value of each
record's year in that field.
The reason it’s problematic to add additional fields is due to the impact of structural changes to
tables on other parts of the application. When more fields are added to a table, the objects and code
that depend on the table also need to be updated. For instance, queries need to grab the extra fields,
forms need to display them, reports need to include them, and so on. However, if the data were
normalized, the existing objects would automatically retrieve the new data and calculate or display it
correctly. Queries are particularly powerful because they allow you to group on the Year field to show
summaries by year — no matter what years are in your table.
Data normalization does not mean, however, that you can't display or use data with time-sensitive or
time-dependent fields. Developers who need to show and display such information can often do so by
using crosstab queries. If you aren’t familiar with crosstab queries, you should learn how to use
them. They are not the same as tables (in particular, you cannot edit the results of a crosstab query),
but they can certainly be used for displaying information in a datasheet (up to 255 fields). If you want
to use them in reports, it's more complicated because your report will need to accommodate the
additional or changing field names. That's why most reports will show data as separate groupings
within the report, rather than as separate columns. For those instances where you have no choice,
you'll have to invest the time to support this, but hopefully all parties will understand the implication
such decisions have on additional resources over time.
So, that's why additional records are free (the big advantage of databases) and why additional fields
are so expensive. Databases can accommodate massive amounts of change, if they are designed
properly.
Know When Data Needs to Be Duplicated
Sometimes, data needs to be de-normalized to preserve information that may change over time.
In our simple example of an invoice linked to the customer table via a customer ID number, we may
need to preserve the customer address at the time the invoice is issued (not at the time it’s
created, because the customer information may change between the two events). If we did not
preserve the address at the time the invoice was issued, and we had to update the customer
information in the future, we may not be able to confirm the exact address to which a particular
invoice was sent. This could be a huge business problem. Of course, some information, like the
customer's phone number, may not need to be preserved. Therefore, one should selectively determine
what data should be duplicated.
Another example in which data needs to be duplicated is when filling out the line items of an invoice.
Often a price list is used to pick the items the customer ordered. One could simply store the price list
ID to point to the price list with its product description, price, and other details. However, product
descriptions and prices change over time. If you don’t copy the data from the price list into the line
items table, you cannot accurately reprint the original invoice in the future, which can be a big
problem if you haven't been paid yet.
So while normalization works well for keeping the same data in one place and simplifies editing, there
are situations in which such benefits are not desired. If you need a snapshot of your data for historic
reasons, it's critical you design it into your database at the beginning. Otherwise, once the data is
overwritten, you can't get it back.
Use Meaningless Field for the Key Field
For efficiency, each table should have a key field. The key field defines uniqueness in the table and is
used by indexes on its other fields to improve search performance. For instance, the customer table
could have a CustomerID field that defines a unique number for each customer. For the purposes of
this discussion, we are considering tables that have multiple fields and not a simple single table
lookup, such as a list of countries.
In general, a key field should have these characteristics:
Should be One Field
It is possible to define multiple fields as the key fields of a table, but a single field is preferable.
First, if multiple fields are necessary to define uniqueness, it takes up more space to store the
key. Second, additional indexes on the table also have to use the combination of the key fields,
which takes up more space than if it were a single field. Finally, identifying records in the table
requires grabbing a combination of fields. It’s far better to have a CustomerID field than a
combination of other fields to define a customer.
Should be Numeric
Access offers an AutoNumber field type that is a Long Integer, which is ideal for key fields.
These values are automatically unique for each record, and they support multi-user data entry as
well.
Should Not Change Over Time
A key field should not change over time. Once identified, like a social security number, it should
never change. A key field that changes makes it very difficult to use historic data because the
links break.
Should be Meaningless
To ensure that a key field doesn't change over time, it should have no meaning. A meaningless
key value is also helpful in situations in which the other data is incomplete. For instance, you can
assign a customer number without having someone's complete address. The remainder of your
application can work perfectly, and you can add the information when you receive it. If your table
used, as part of its key, the country or some other identifying field you didn't have, you run the
risk of not being able to use your application.
So, for all the reasons listed above, we recommend using an AutoNumber field as the key field for
most of your tables. By using combo boxes and hidden columns, you can actually bind fields to the
AutoNumber field and hide it from the user.
Use Referential Integrity
Once your tables are defined and you understand how they are related to each other, be sure to add
referential integrity to enforce the relationship. This prevents linked fields from being modified
incorrectly and leaving "orphaned" records. The Microsoft Jet Database Engine supports sophisticated
referential integrity, which allows you to have cascading updates and deletes. In general, you should
not be changing the ID field. Therefore, cascading updates are less of an issue, but cascading deletes
can be very helpful.
For instance, if you have an invoice table related to an orders table where one invoice can have an
unlimited number of orders (line items) and each order record contains the invoice number it is linked
to, cascading deletes allow you to delete the invoice record and automatically delete all its
corresponding order records. That ensures that you never have an order record without a
corresponding invoice record.
Conclusion
We hope you'll be able to apply these database design concepts early in your application design to
minimize the many problems and remedies required when such designs are not implemented. Good
luck.