DBMS

History of Database

Database

It is often said that we live in an information society and that information is a very valuable resource (or, as some people say, information is power). In this information society, the term database has become a rather common term although its meaning seems to have become somewhat vague as the importance of database systems has grown. Some people use the term database of an organization to mean all the data in the organization (whether computerized or not). Other people use the term to mean the software that manages the data. We would use it to mean a collection of computerized information such that it is available to many people for various uses.

“A database is a well organized collection of data that are related in a meaningful way which can be accessed in different logical orders but are stored only once. The data in the database is therefore integrated, structured, and shared.”

Or

“A Database can be considered an organized collection of information.”

For example, it is easy to find a book in a library because books are arranged in an organized manner. The tools that help us

Store Data

Adding new Information

Changing the existing information

Deleting the existing information

Viewing the information

The data is managed by a Database Management System.

The phone book is a good example of a simple database. For example, to find the phone number of Mr. Jonathan Smith, living at No.5 4 th Avenue Apt 15, New York, NY, one may use the following process: Get the phone book for the city of New York. Flip to the section where the last name is ‘Smith’. More than one Mr. Smith may be listed. In that case, look for Mr. Smith whose first name is Jonathan. Look for Jonathan Smith living at No. 5 4 th Avenue Apt 15, New York, NY to get the corresponding phone number. In the example above, the phone

number can be found because the information in the phone book was organized in a certain manner.

In the early days of computing, computers were mainly used for solving numerical problems. Even in those days, it was observed that some of the tasks were common in many problems that were being solved. It therefore was considered desirable to build special subroutines which perform frequently occurring computing tasks such as computing Sin(x). This lead to the development of mathematical subroutine libraries that are now considered integral part of any computer system.

By the late fifties storage, maintenance and retrieval of non-numeric data had become very important. Again, it was observed that some of the data processing tasks occurred quite frequently e.g. sorting. This lead to the development of generalized routines for some of the most common and frequently used tasks. A general routine by its nature tends to be somewhat less efficient than a routine designed for a specific problem. In the late fifties and early sixties, when the hardware costs were high, use of general routines was not very popular since it became a matter of trade-off between hardware and software costs. In the last two decades, hardware costs have gone down dramatically while the cost of building software has gone up. It is no more a trade-off between hardware and software costs since hardware is relatively cheap and building reliable software is expensive.

Database management systems evolved from generalized routines for file processing as the users demanded more extensive and more flexible facilities for managing data. Database technology has undergone major changes during the 1970's. An interesting history of database systems is presented by Fry and Sibley (1976).

Types of Databases

There are many techniques and technologies for managing

DBMS (Database Management Systems)

dBase, Fox Pro

HDBMS (Hierarchical DBMS)

IMS

NDBMS (Network DBMS)

IDMS

RDBMS (Relational DBMS)

Oracle, SQL Server, Sybase, Informix, etc.

ORDBMS (Object Relational DBMS)

Oracle

DBMS: - Database Management Systems have been used for quite a while. Software packages like dBase; Clipper, etc had facilities for managing data. The user had to take care of two aspects – specify what data was to be retrieved and how to go and get the data.

HDBMS: - Managed the data in a hierarchical fashion. IBM’s product IMS is a Hierarchical Database. It is used in mainframe computers. It is still in use in some installations. Over a period of time the demands of the users prompted a search for alternate approaches.

NDBMS: - Managed the data according to network models. IBM’s product IDMS is a Network Database. It is also used in mainframe computers. Like the HDBMS (IMS), it is still used in a few installations. Over a period of time disadvantages of the network database technology prompted the search for an alternate solution. With more powerful personal computer systems and the inherent limitations of the Hierarchical and Network database models, the popularity of the Relational Database Model increased.

RDBMS: - Was invented by the team lead by Dr. Edmund F. Codd and funded by IBM in the early 1970’s. The Relational Model is based on the principles of relational algebra. This model also known as the Relational Database Management System is very popular and is in use by a majority of the Fortune 500 companies. This model will be used in great detail in this course. Many vendors sell products that conform to this model. Some of these vendors are: Oracle, SQL Server, Sybase, Informix, Ingress, Gupta SQL, DB2, and Microsoft Access.

ORDBMS: - This model has been proposed and is presently being implemented by the Oracle Corporation. This model addresses the shortcomings of the RDBMS model. The Object Relational Database Management System is not purely object oriented. However, it is implemented via the same relational engine that drives the Relational Database Management System.

Data is stored in tables. A table is the combination of columns and rows. Each column contains a certain type of data about a person. For example, the first column contains the last name, the second column contains the first name and

the third column contains the address. Each row contains all types of data about a person.

Client / Server Database Concepts

The Client computer sends requests to the server. The Server processes the requests and delivers results to the client.

Server: - This is usually a high powered computer. Typically it could consist of several processors (2, 3 or 4), a very large amount of RAM (2GB or 4GB), a very large amount of hard disk storage space (possibly a few hundred GB), high speed network data transfer rates, etc. The server need not be from a specific vendor. It could be purchased from any vendor. The server platform may be UNIX, Windows NT, AIX, Sun Solaris, Novell, etc.

Client: - These are usually personal computers. They have processing power of their own. The client operating system may be Windows 95, Windows 98, Windows 2000, Windows NT Workstation, Sun Sparc Station, Mac OS, etc.

Operation: - Operation of the client server network is as follows:

• The server is powered on and is ready to receive connections / requests. The network administrator configures the server.

• Each client is powered on and is required to provide user identification and authentication information (such as a user ID and a password) to connect to the server. In a multiple server environment, the client computer may also be required to specify the location of the server to which it must connect. This setup creates the infrastructure for other programs to be used on this configuration. In most environments, databases, e-mail servers, web servers are run on a server computer. Clients use the resources offered by these databases, websites, etc.

DBMS

( Database Management System )

As discussed earlier, a database is a well organized collection of data. To be able to carry out operations like insertion, deletion and retrieval, the database needs to be managed by a substantial package of software. This software is usually called a Database Management System (DBMS). The primary purpose of a DBMS is to allow a user to store, update and retrieve data in abstract terms and thus make it easy to maintain and retrieve information from a database. A DBMS relieves the user from having to know about exact physical

representations of data and having to specify detailed algorithms for storing, updating and retrieving data.

A DBMS is usually a very large software package that carries out many different tasks including the provision of facilities to enable the user to access and modify information in the database. The database is an intermediate link between the physical database, the computer and the operating system, and on the other hand, the users. To provide the various facilities to different types of users, a DBMS normally provides one or more specialized programming languages often called Database Languages. Different DBMS provide different database languages although a language called SQL has recently taken on the role of a de facto standard. Database languages come in different forms. A language is needed to describe the database to the DBMS as well as provide facilities for changing the database and for defining and changing physical data structures. Another language is needed for manipulating and retrieving data stored in the DBMS. These languages are called Data Description Languages (DDL) and Data Manipulation Languages (DML) respectively.

To summarize, a database system consists of

1. The database (data) 2. A DBMS (software) 3. A DDL and a DML (Part of the DBMS)Application programs

Some DBMS packages are marketed by computer manufacturers that will run only on that manufacturer's machines (e.g. IBM's IMS) but increasingly independent software houses are designing and selling DBMS software that would run on several different types of machines (e.g. ORACLE).

Evolution of Database Management Systems

The history of Database Management Systems (DBMS) is an evolutionary history. Each successive wave of innovation can be seen as a response either to some limiting aspect of the technology preceding it or to demands from new application domains. In economic terms, successful information technology lowers the costs incurred when implementing a solution to a particular problem. Lowering development costs increases the range of information management problems that can be dealt with in a cost-effective way, leading to new waves of systems development.

As an example of changing development costs, consider that early mainframe information systems became popular because they could make numerical calculations with unprecedented speed and accuracy (by the standards of the time). This had a major impact on how businesses managed financial assets and how engineering projects were undertaken. It took another evolutionary step—the invention of higher level languages such as COBOL—to get to the point that the economic return from a production management system justified the development cost. It took still another innovation—the adoption of relational DBMS before customer management and human resource systems could be

automated. Today it is difficult to imagine how you could use COBOL and ISAM to build the number and variety of e-commerce Web sites that have sprung up over the last three years. And looking ahead, as we move into an era of ubiquitous, mobile computers and the growing importance of digital media, it is difficult to see how RDBMS technology can cope.

Early Information Systems

Pioneering information systems of the late 1960s and early 1970s were constrained by the capacity of the computers on which they ran. Hardware was relatively expensive, and by today’s standards, quite limited. Consequently, considerable human effort was devoted to optimizing programming code. Diverting effort from functionality to efficiency constrained the ambition of early developers, but the diversion of effort was necessary because it was the only way to develop an information system that ran with adequate performance. Information systems of this era were implemented as monolithic programs combining user-interface, data processing logic, and data management operations into a single executable on a single machine. Such programs intermingled low-level descriptions of data structure details, how records were laid out on disk, how records were interrelated, with user-interface management code. User-interface management code would dictate, for example, where an element from a disk record went on the screen. Maintaining and evolving these systems, even to perform tasks as simple as adding new data fields and indexes, required programmers to make changes involving most aspects of the system. Experience has shown that development is the least expensive stage of an information system’s life cycle. Programming and debugging take time and money, but by far the largest percentage of the total cost of an information system is incurred after it goes into production. As the business environment changes, there is considerable pressure to change systems with it. Changing a pre-relational information system, either to fix bugs or to make enhancements, is extremely difficult. Yet, there is always considerable pressure to do so. In addition, hardware constantly gets faster and cheaper. A second problem with these early systems is that as their complexity increased, it became increasingly important to detect errors in the design stage. The most pervasive design problem was redundancy, which occurs when storage structure designs record an item of information twice.

Different DBMS Architectures

There are three types of Architectures in the Database. They are as under

Hierarchical Model

The hierarchical model is the basis of the oldest database management systems, which grew out of early attempts to organize data for the U.S space program. Since these databases were ad-hoc solution to immediate problems, they were

created without the strong theoretical foundations that later systems had. Their designers were familiar with file organizations and data structures, and used these concepts to solve the immediate data representation problems of users. The hierarchical model uses the tree as its basic data structure. Nodes of the trees in the hierarchical model represent data records or record segments, which are the portions of the data records. Relationships are represented as links or pointers between nodes.

The Network Model

The network uses a network or plex structure, which is data structure consisting of nodes and branches. Unlike a tree, a plex structure allows a node to have more than one parent. The nodes of the network represent records of various types. Relationships between records are represented as links, which become pointers in the implementation.

Relational Model

The relational model was proposed by Codd in 1970 and continues to be the subject of much research. It is now widely used by both mainframe and microcomputers-based DBMSs because of its simplicity from user’s point of view and its power. The relation model uses the theory of relations from mathematics and adapts it for use in database theory. In relation model both entities and relationships are represented by relations which are physically represented as tables or two-dimensional arrays, and attributes as columns of those tables.

Before going into the detail of the Relational and Object Relational Database Management Systems we should have a knowledge of the Relational Design.

RDBMS

( Relational Database Management System )

In 1970 an IBM researcher named Ted Codd wrote a paper that described a new approach to the management of “large shared data banks.” In his paper Codd identifies two objectives for managing shared data. The first of these is data independence, which dictates that applications using a database be maintained independent of the physical details by which the database organizes itself. Second, he describes a series of rules to ensure that the shared data is consistent by eliminating any redundancy in the database’s design. Codd’s paper deliberately ignores any consideration of how his model might be implemented. He was attempting to define an abstraction of the problem of information management: a general framework for thinking about and working with information. The Relational Model Codd described had three parts: a data model,

a means of expressing operations in a high-level language, and a set of design principles that ensured the elimination of certain kinds of redundant data problems. Codd’s relational model views data as being stored in tables containing a variable number of rows (or records), each of which has a fixed number of columns. Something like a telephone directory or a registry of births, deaths, and marriages, is a good analogy for a table. Each entry contains different information but is identical in its form to all other entries of the same kind. The relational model also describes a number of logical operations that could be performed over the data. These operations included a means of getting a subset of rows (all names in the telephone directory with the surname “Brown”), a subset of columns (just the name and number), or data from a combination of tables (a person who is married to a person with a particular phone number). By borrowing mathematical techniques from predicate logic, Codd was able to derive a series of design principles that could be used to guarantee that the database’s structure was free of the kinds of redundancy so problematic in other systems. Greatly expanded by later writers, these ideas formed the basis of the theory of normal forms. Properly applied, the system of normal form rules can ensure that the database’s logical design is free of redundancy and, therefore, any possibility of anomalies in the data.

Elements of a Relational Database

There are four basic elements which are necessary for the Relational Database. Those are given below.

Relations

Attributes

Domains

The Relational Operators.

Codd describe a data storage system that possessed three characteristics that were sorely needed at that time:

1. Logical data independence: - This desirable characteristic means that changes made to an attribute (column), for example, an increase or decrease in size have no perceivable effect on other attributes for the same relation (table). Logical data independence was attractive to data processing organizations because it could substantially reduce the cost of software maintenance.

2. Referential and data integrity: - Unlike other database systems, a relational database would relieve the application software of the burden of enforcing integrity constraints. Codd described two characteristics that would be maintained by a relational database, referential integrity and data integrity.

3. Ad hoc query: - This characteristic would enable the user to tell the database which data to retrieve without indicating how to accomplish the task.

Some time passed before a commercial product actually implemented some of the relational database features that Codd described. Today many vendors sell Relational Database Management Systems; some of the more well-known vendors are Oracle, Sybase, IBM, Informix, Microsoft, and Computer Associates. Of these vendors, Oracle has emerged as the leader. The Oracle RDBMS engine has been ported to more platforms than any other database product. Because of Oracle's multiplatform support, many application software vendors have chosen Oracle as their database platform. And now, Oracle Corporation has ported the same RDBMS engine to the desktop environment with its release of Personal Oracle.

Relational Database Implementation

Today's relational databases implement a number of extremely useful features that Codd did not mention in his original article. However, as of this writing, no commercially available database is a full implementation of Codd's rules for relational databases.

During the early 1970s, several parallel research projects set out to implement a working RDBMS. This turned out to be very hard. It wasn’t until the late 1980s that RDBMS products worked acceptably in the kinds of high-end, online transactions processing applications served so well by earlier technologies. Despite the technical shortcomings RDBMS technology exploded in popularity because even the earliest products made it cheaper, quicker, and easier to build information systems. For an increasing number of applications, economics favored spending more on hardware and less on people. RDBMS technology made it possible to develop information systems that, while desirable from a business management point of view, had been deemed too expensive. To emphasize the difference between the relational and pre-relational approaches, a four hundred line C program can be replaced by the SQL-92 expression in Listing below.

CREATE TABLE Employees (

Name VARCHAR(128),

DOB DATE,

Salary DECIMAL(10,2),

Address VARCHAR(128)

);

The code in list implements considerably more functionality than a C program because RDBMS provide transactional guarantees for data changes. They automatically employ locking, logging, and backup and recovery facilities to

guarantee the integrity of the data they store. Also, RDBMS provide elaborate security features. Different tables in the same database can be made accessible only by different groups of users. All of this built-in functionality means that developers focus more of their effort on their system’s functionality and less on complex technical details.

Relational Database Theory

With relational database, systems can be considered complete unless it discusses the basic concepts of relational database theory. In a nutshell:

Every entity has a set of attributes that uniquely define an instance of that entity.

No part of the primary key can be null.

Every entity has a set of attributes that uniquely identify each row in that entity.

A table cannot have duplicate rows.

Tables are related to one another.

The order of the rows in a table is arbitrary.

The order of the columns in a table is arbitrary.

Architecture of RDBMS

There are three standard levels for Relational Database management system

Conceptual level

External level

Internal level

External Level

The external level consist of many different external views or external models of the database. Each user has a model of the real world represented in a form that is suitable for that user. A particular user interacts with only certain entities in the real world and is interested in only some of their attributes and relationships. Therefore, that user’s view contain only information about those aspects of the real world.

Conceptual Level

The conceptual level consists of base tables, which are physical stored tables. These tables are created by Database Administrator using a CREATE Table

command. Once the table is created the DBA can create “VIEW” for Users. A view may be a subset of single base table or may be created by combining base tables or performing other Operations on them.

Internal Level

The level which covers the physical implementation of the database. It includes the data structure and file organization used to store data on physical storage devices. The internal schema, written in DDL, is a complete description of the internal model. It includes such items as how data is represented, how records are sequenced, what index exist, what pointers exist. An internal record is a single stored record. It is the unit that is passed up to the internal level.

Primary Key

Every entity has a set of attributes that uniquely define an instance of that entity. This set of attributes is called the primary key. The primary key may be composed of a single attribute

No Part of the Primary Key Is Null

A basic tenet of relational theory is that no part of the primary key can be null. If you think about that idea for a moment, it seems intuitive: The primary key must uniquely identify each row in an entity; therefore, if the primary key (or a part of it) is null, it wouldn't be able to identify anything.

Data Integrity

According to relational theory, every entity has a set of attributes that uniquely identify each row in that entity. Relational theory also states that no duplicate rows can exist in a table.

Referential Integrity

Tables are related to one another through foreign keys. A foreign key is one table column for which the set of possible values is found in the primary key of a second table. Referential integrity is achieved when the set of values in a foreign key column is restricted to the primary key that it references or to the null value. Once the database designer declares primary and foreign keys, enforcing data and referential integrity is the responsibility of the DBMS.

Five Normal Forms in Relational Database Theory

INTRODUCTION

The normal forms defined in relational database theory represent guidelines for record design. The guidelines corresponding to first through fifth normal forms are presented here, in terms that do not require an understanding of relational theory. The design guidelines are meaningful even if one is not using a relational database system. We present the guidelines without referring to the concepts of the relational model in order to emphasize their generality, and also to make them easier to understand. Our presentation

conveys an intuitive sense of the intended constraints on record design, although in its informality it may be imprecise in some technical details. A comprehensive treatment of the subject is provided by Date.

The normalization rules are designed to prevent update anomalies and data inconsistencies. With respect to performance tradeoffs, these guidelines are biased toward the assumption that all non-key fields will be updated frequently. They tend to penalize retrieval, since data which may have been retrievable from one record in an un-normalized design may have to be retrieved from several records in the normalized form. There is no obligation to fully normalize all records when actual performance requirements are taken into account.

FIRST NORMAL FORM

First normal form deals with the "shape" of a record type. Under first normal form, all occurrences of a record type must contain the same number of fields. First normal form excludes variable repeating fields and groups. This is not so much a design guideline as a matter of definition. Relational database theory doesn't deal with records having a variable number of fields.

SECOND AND THIRD NORMAL FORMS

Second and third normal forms deal with the relationship between non-key and key fields. Under second and third normal forms, a non-key field must provide a fact about the key, us the whole key, and nothing but the key. In addition, the record must satisfy first normal form. We deal now only with "single-valued" facts. The fact could be a one-to-many relationship, such as the department of an employee, or a one-to-one relationship, such as the spouse of an employee. Thus the phrase "Y is a fact about X" signifies a one-to-one or one-to-many relationship between Y and X. In the general case, Y might consist of one or more fields, and so might X. In the following example, QUANTITY is a fact about the combination of PART and WAREHOUSE.

Second Normal Form

Second normal form is violated when a non-key field is a fact about a subset of a key. It is only relevant when the key is composite, i.e., consists of several fields. Consider the following inventory record:

---------------------------------------------------| PART | WAREHOUSE | QUANTITY | WAREHOUSE-ADDRESS |====================-------------------------------

The key here consists of the PART and WAREHOUSE fields together, but WAREHOUSE-ADDRESS is a fact about the WAREHOUSE alone. The basic problems with this design are:

The warehouse address is repeated in every record that refers to a part stored in that warehouse. If the address of the warehouse changes, every record referring to a part stored in that warehouse

must be updated. Because of the redundancy, the data might become inconsistent, with different records showing

different addresses for the same warehouse. If at some point in time there are no parts stored in the warehouse, there may be no record in which

to keep the warehouse's address.

To satisfy second normal form, the record shown above should be decomposed into (replaced by) the two records:

------------------------------- --------------------------------- | PART | WAREHOUSE | QUANTITY | | WAREHOUSE | WAREHOUSE-ADDRESS |====================----------- =============--------------------

When a data design is changed in this way, replacing unnormalized records with normalized records, the process is referred to as normalization. The term "normalization" is sometimes used relative to a particular normal form. Thus a set of records may be normalized with respect to second normal form but not with respect to third. The normalized design enhances the integrity of the data, by minimizing redundancy and inconsistency, but at some possible performance cost for certain retrieval applications. Consider an application that wants the addresses of all warehouses stocking a certain part. In the unnormalized form, the application searches one record type. With the normalized design, the application has to search two record types, and connect the appropriate pairs.

Third Normal Form

Third normal form is violated when a non-key field is a fact about another non-key field, as in

------------------------------------| EMPLOYEE | DEPARTMENT | LOCATION |============------------------------

The EMPLOYEE field is the key. If each department is located in one place, then the LOCATION field is a fact about the DEPARTMENT -- in addition to being a fact about the EMPLOYEE. The problems with this design are the same as those caused by violations of second normal form:

The department's location is repeated in the record of every employee assigned to that department. If the location of the department changes, every such record must be updated. Because of the redundancy, the data might become inconsistent, with different records showing

different locations for the same department. If a department has no employees, there may be no record in which to keep the department's

location.

To satisfy third normal form, the record shown above should be decomposed into the two records:

------------------------- -------------------------| EMPLOYEE | DEPARTMENT | | DEPARTMENT | LOCATION |============------------- ==============-----------

To summarize, a record is in second and third normal forms if every field is either part of the key or provides a (single-valued) fact about exactly the whole key and nothing else.

Functional Dependencies

In relational database theory, second and third normal forms are defined in terms of functional dependencies, which correspond approximately to our single-valued facts. A field Y is "functionally dependent" on a field (or fields) X if it is invalid to have two records with the same X-value but different Y-

values. That is, a given X-value must always occur with the same Y-value. When X is a key, then all fields are by definition functionally dependent on X in a trivial way, since there can't be two records having the same X value.

There is a slight technical difference between functional dependencies and single-valued facts as we have presented them. Functional dependencies only exist when the things involved have unique and singular identifiers (representations). For example, suppose a person's address is a single-valued fact, i.e., a person has only one address. If we don't provide unique identifiers for people, then there will not be a functional dependency in the data:

----------------------------------------------| PERSON | ADDRESS |-------------+--------------------------------| John Smith | 123 Main St., New York || John Smith | 321 Center St., San Francisco |----------------------------------------------

Although each person has a unique address, a given name can appear with several different addresses. Hence we do not have a functional dependency corresponding to our single-valued fact.

Similarly, the address has to be spelled identically in each occurrence in order to have a functional dependency. In the following case the same person appears to be living at two different addresses, again precluding a functional dependency.

---------------------------------------| PERSON | ADDRESS |-------------+-------------------------| John Smith | 123 Main St., New York || John Smith | 123 Main Street, NYC |---------------------------------------

We are not defending the use of non-unique or non-singular representations. Such practices often lead to data maintenance problems of their own. We do wish to point out, however, that functional dependencies and the various normal forms are really only defined for situations in which there are unique and singular identifiers. Thus the design guidelines as we present them are a bit stronger than those implied by the formal definitions of the normal forms.

For instance, we as designers know that in the following example there is a single-valued fact about a non-key field, and hence the design is susceptible to all the update anomalies mentioned earlier.

------------------------------------------------------| EMPLOYEE | FATHER | FATHER'S-ADDRESS ||============------------+----------------------------| Art Smith | John Smith | 123 Main St., New York || Bob Smith | John Smith | 123 Main Street, NYC || Cal Smith | John Smith | 321 Center St., San Francisco |------------------------------------------------------

However, in formal terms, there is no functional dependency here between FATHER'S-ADDRESS and FATHER, and hence no violation of third normal form.

FOURTH AND FIFTH NORMAL FORMS

Fourth and fifth normal forms deal with multi-valued facts. The multi-valued fact may correspond to a many-to-many relationship, as with employees and skills, or to a many-to-one relationship, as with the children of an employee (assuming only one parent is an employee). By "many-to-many" we mean that an employee may have several skills, and a skill may belong to several employees.

Note that we look at the many-to-one relationship between children and fathers as a single-valued fact about a child but a multi-valued fact about a father.

In a sense, fourth and fifth normal forms are also about composite keys. These normal forms attempt to minimize the number of fields involved in a composite key, as suggested by the examples to follow.

Fourth Normal Form

Under fourth normal form, a record type should not contain two or more independent multi-valued facts about an entity. In addition, the record must satisfy third normal form.

The term "independent" will be discussed after considering an example.

Consider employees, skills, and languages, where an employee may have several skills and several languages. We have here two many-to-many relationships, one between employees and skills, and one between employees and languages. Under fourth normal form, these two relationships should not be represented in a single record such as

-------------------------------| EMPLOYEE | SKILL | LANGUAGE |===============================

Instead, they should be represented in the two records

-------------------- -----------------------| EMPLOYEE | SKILL | | EMPLOYEE | LANGUAGE |==================== =======================

Note that other fields, not involving multi-valued facts, are permitted to occur in the record, as in the case of the QUANTITY field in the earlier PART/WAREHOUSE example.

The main problem with violating fourth normal form is that it leads to uncertainties in the maintenance policies. Several policies are possible for maintaining two independent multi-valued facts in one record:

(1) A disjoint format, in which a record contains either a skill or a language, but not both:

-------------------------------| EMPLOYEE | SKILL | LANGUAGE ||----------+-------+----------|| Smith | cook | | | Smith | type | |

| Smith | | French || Smith | | German || Smith | | Greek |-------------------------------

This is not much different from maintaining two separate record types. (We note in passing that such a format also leads to ambiguities regarding the meanings of blank fields. A blank SKILL could mean the person has no skill, or the field is not applicable to this employee, or the data is unknown, or, as in this case, the data may be found in another record.)

(2) A random mix, with three variations:

(a) Minimal number of records, with repetitions:

-------------------------------| EMPLOYEE | SKILL | LANGUAGE ||----------+-------+----------|| Smith | cook | French | | Smith | type | German || Smith | type | Greek |-------------------------------

(b) Minimal number of records, with null values:

-------------------------------| EMPLOYEE | SKILL | LANGUAGE ||----------+-------+----------|| Smith | cook | French | | Smith | type | German || Smith | | Greek |-------------------------------

(c) Unrestricted:

-------------------------------| EMPLOYEE | SKILL | LANGUAGE ||----------+-------+----------|| Smith | cook | French | | Smith | type | || Smith | | German || Smith | type | Greek |-------------------------------

(3) A "cross-product" form, where for each employee, there must be a record for every possible pairing of one of his skills with one of his languages:

-------------------------------| EMPLOYEE | SKILL | LANGUAGE ||----------+-------+----------|

| Smith | cook | French || Smith | cook | German || Smith | cook | Greek || Smith | type | French || Smith | type | German || Smith | type | Greek |-------------------------------

Other problems caused by violating fourth normal form are similar in spirit to those mentioned earlier for violations of second or third normal form. They take different variations depending on the chosen maintenance policy:

If there are repetitions, then updates have to be done in multiple records, and they could become inconsistent.

Insertion of a new skill may involve looking for a record with a blank skill, or inserting a new record with a possibly blank language, or inserting multiple records pairing the new skill with some or all of the languages.

Deletion of a skill may involve blanking out the skill field in one or more records (perhaps with a check that this doesn't leave two records with the same language and a blank skill), or deleting one or more records, coupled with a check that the last mention of some language hasn't also been deleted.

Fourth normal form minimizes such update problems.

1 Independence

We mentioned independent multi-valued facts earlier, and we now illustrate what we mean in terms of the example. The two many-to-many relationships, employee: skill and employee: language, are "independent" in that there is no direct connection between skills and languages. There is only an indirect connection because they belong to some common employee. That is, it does not matter which skill is paired with which language in a record; the pairing does not convey any information. That's precisely why all the maintenance policies mentioned earlier can be allowed.

In contrast, suppose that an employee could only exercise certain skills in certain languages. Perhaps Smith can cook French cuisine only, but can type in French, German, and Greek. Then the pairings of skills and languages becomes meaningful, and there is no longer an ambiguity of maintenance policies. In the present case, only the following form is correct:

-------------------------------| EMPLOYEE | SKILL | LANGUAGE ||----------+-------+----------|| Smith | cook | French || Smith | type | French || Smith | type | German || Smith | type | Greek |-------------------------------

Thus the employee: skill and employee: language relationships are no longer independent. These records do not violate fourth normal form. When there is interdependence among the relationships, then it is acceptable to represent them in a single record.

2 Multivalued Dependencies

For readers interested in pursuing the technical background of fourth normal form a bit further, we mention that fourth normal form is defined in terms of multivalued dependencies, which correspond to our independent multi-valued facts. Multivalued dependencies, in turn, are defined essentially as relationships which accept the "cross-product" maintenance policy mentioned above. That is, for our example, every one of an employee's skills must appear paired with every one of his languages. It may or may not be obvious to the reader that this is equivalent to our notion of independence: since every possible pairing must be present, there is no "information" in the pairings. Such pairings convey information only if some of them can be absent, that is, only if it is possible that some employee cannot perform some skill in some language. If all pairings are always present, then the relationships are really independent.

We should also point out that multivalued dependencies and fourth normal form apply as well to relationships involving more than two fields. For example, suppose we extend the earlier example to include projects, in the following sense:

An employee uses certain skills on certain projects. An employee uses certain languages on certain projects.

If there is no direct connection between the skills and languages that an employee uses on a project, then we could treat this as two independent many-to-many relationships of the form EP:S and EP:L, where "EP" represents a combination of an employee with a project. A record including employee, project, skill, and language would violate fourth normal form. Two records, containing fields E,P,S and E,P,L, respectively, would satisfy fourth normal form.

Fifth Normal Form

Fifth normal form deals with cases where information can be reconstructed from smaller pieces of information that can be maintained with less redundancy. Second, third, and fourth normal forms also serve this purpose, but fifth normal form generalizes to cases not covered by the others.

We will not attempt a comprehensive exposition of fifth normal form, but illustrate the central concept with a commonly used example, namely one involving agents, companies, and products. If agents represent companies, companies make products, and agents sell products, then we might want to keep a record of which agent sells which product for which company. This information could be kept in one record type with three fields:

-----------------------------| AGENT | COMPANY | PRODUCT ||-------+---------+---------|| Smith | Ford | car | | Smith | GM | truck | -----------------------------

This form is necessary in the general case. For example, although agent Smith sells cars made by Ford and trucks made by GM, he does not sell Ford trucks or GM cars. Thus we need the combination of three fields to know which combinations are valid and which are not.

But suppose that a certain rule was in effect: if an agent sells a certain product, and he represents a company making that product, then he sells that product for that company.

-----------------------------| AGENT | COMPANY | PRODUCT ||-------+---------+---------|| Smith | Ford | car | | Smith | Ford | truck |

| Smith | GM | car | | Smith | GM | truck | | Jones | Ford | car | -----------------------------

In this case, it turns out that we can reconstruct all the true facts from a normalized form consisting of three separate record types, each containing two fields:

------------------- --------------------- ------------------- | AGENT | COMPANY | | COMPANY | PRODUCT | | AGENT | PRODUCT ||-------+---------| |---------+---------| |------+---------|| Smith | Ford | | Ford | car | | Smith | car || Smith | GM | | Ford | truck | | Smith | truck || Jones | Ford | | GM | car | | Jones | car |------------------- | GM | truck | ------------------- ---------------------

These three record types are in fifth normal form, whereas the corresponding three-field record shown previously is not.

Roughly speaking, we may say that a record type is in fifth normal form when its information content cannot be reconstructed from several smaller record types, i.e., from record types each having fewer fields than the original record. The case where all the smaller records have the same key is excluded. If a record type can only be decomposed into smaller records which all have the same key, then the record type is considered to be in fifth normal form without decomposition. A record type in fifth normal form is also in fourth, third, second, and first normal forms.

Fifth normal form does not differ from fourth normal form unless there exists a symmetric constraint such as the rule about agents, companies, and products. In the absence of such a constraint, a record type in fourth normal form is always in fifth normal form.

One advantage of fifth normal form is that certain redundancies can be eliminated. In the normalized form, the fact that Smith sells cars is recorded only once; in the un-normalized form it may be repeated many times.

It should be observed that although the normalized form involves more record types, there may be fewer total record occurrences. This is not apparent when there are only a few facts to record, as in the example shown above. The advantage is realized as more facts are recorded, since the size of the normalized files increases in an additive fashion, while the size of the un-normalized file increases in a multiplicative fashion. For example, if we add a new agent who sells x products for y companies, where each of these companies makes each of these products, we have to add x+y new records to the normalized form, but xy new records to the un-normalized form.

It should be noted that all three record types are required in the normalized form in order to reconstruct the same information. From the first two record types shown above we learn that Jones represents Ford and that Ford makes trucks. But we can't determine whether Jones sells Ford trucks until we look at the third record type to determine whether Jones sells trucks at all.

The following example illustrates a case in which the rule about agents, companies, and products is satisfied, and which clearly requires all three record types in the normalized form. Any two of the record types taken alone will imply something untrue.

-----------------------------| AGENT | COMPANY | PRODUCT ||-------+---------+---------|| Smith | Ford | car |

| Smith | Ford | truck | | Smith | GM | car | | Smith | GM | truck | | Jones | Ford | car | | Jones | Ford | truck | | Brown | Ford | car | | Brown | GM | car | | Brown | Toyota | car | | Brown | Toyota | bus | ------------------------------------------------ --------------------- ------------------- | AGENT | COMPANY | | COMPANY | PRODUCT | | AGENT | PRODUCT ||-------+---------| |---------+---------| |-------+---------|| Smith | Ford | | Ford | car | | Smith | car | Fifth| Smith | GM | | Ford | truck | | Smith | truck | Normal| Jones | Ford | | GM | car | | Jones | car | Form| Brown | Ford | | GM | truck | | Jones | truck || Brown | GM | | Toyota | car | | Brown | car || Brown | Toyota | | Toyota | bus | | Brown | bus |------------------- --------------------- -------------------

Observe that:

Jones sells cars and GM makes cars, but Jones does not represent GM. Brown represents Ford and Ford makes trucks, but Brown does not sell trucks. Brown represents Ford and Brown sells buses, but Ford does not make buses.

Fourth and fifth normal forms both deal with combinations of multivalued facts. One difference is that the facts dealt with under fifth normal form are not independent, in the sense discussed earlier. Another difference is that, although fourth normal form can deal with more than two multivalued facts, it only recognizes them in pair wise groups. We can best explain this in terms of the normalization process implied by fourth normal form. If a record violates fourth normal form, the associated normalization process decomposes it into two records, each containing fewer fields than the original record. Any of this violating fourth normal form is again decomposed into two records, and so on until the resulting records are all in fourth normal form. At each stage, the set of records after decomposition contains exactly the same information as the set of records before decomposition.

In the present example, no pairwise decomposition is possible. There is no combination of two smaller records which contains the same total information as the original record. All three of the smaller records are needed. Hence an information-preserving pairwise decomposition is not possible, and the original record is not in violation of fourth normal form. Fifth normal form is needed in order to deal with the redundancies in this case.

UNAVOIDABLE REDUNDANCIES

Normalization certainly doesn't remove all redundancies. Certain redundancies seem to be unavoidable, particularly when several multivalued facts are dependent rather than independent. In the example shown, it seems unavoidable that we record the fact that "Smith can type" several times. Also, when the rule about agents, companies, and products is not in effect, it seems unavoidable that we record the fact that "Smith sells cars" several times.

INTER-RECORD REDUNDANCY

The normal forms discussed here deal only with redundancies occurring within a single record type. Fifth normal form is considered to be the "ultimate" normal form with respect to such redundancies.

Other redundancies can occur across multiple record types. For the example concerning employees, departments, and locations, the following records are in third normal form in spite of the obvious redundancy:

------------------------- -------------------------| EMPLOYEE | DEPARTMENT | | DEPARTMENT | LOCATION |============------------- ==============----------------------------------| EMPLOYEE | LOCATION |============-----------

In fact, two copies of the same record type would constitute the ultimate in this kind of undetected redundancy.

Inter-record redundancy has been recognized for some time , and has recently been addressed in terms of normal forms and normalization .

CONCLUSION

While we have tried to present the normal forms in a simple and understandable way, we are by no means suggesting that the data design process is correspondingly simple. The design process involves many complexities which are quite beyond the scope of this paper. In the first place, an initial set of data elements and records has to be developed, as candidates for normalization. Then the factors affecting normalization have to be assessed:

Single-valued vs. multi-valued facts. Dependency on the entire key. Independent vs. dependent facts. The presence of mutual constraints. The presence of non-unique or non-singular representations.

And, finally, the desirability of normalization has to be assessed, in terms of its performance impact on retrieval applications.

RDBMS & The Oracle Product Line

As the world's leading vendor of relational database software, Oracle Corporation supports its flagship product, the Oracle RDBMS, on more than 90 platforms. The release of Personal Oracle for Windows extends Oracle's reach to the most popular desktop operating system; Microsoft Windows.

The Oracle RDBMS is available in the following three configurations:

The Oracle Universal Server can support many users on highly scaleable platforms such as Sun, HP, Pyramid, and Sequent. The various options that are available with this configuration include the distributed option, by which several Oracle databases on separate computers can function as a single logical database. The Oracle Enterprise Server is available for a wide variety of operating systems and hardware configurations. Oracle Web Server ;an integrated system for dynamically generating HTML output from the content of an Oracle database; is also included with the Universal Server.

The Oracle Workgroup Server is designed for workgroups and is available on NetWare, Windows NT, SCO UNIX, and UnixWare. The Oracle Workgroup Server is a cost-effective and low-maintenance solution for supporting small groups of users. Oracle Web Server is also available as an option with the Oracle Workgroup Server.

Personal Oracle is a Windows-based version of the Oracle database engine that offers the same functionality that exists in the Oracle Universal Server and the Oracle Workgroup Server. Even though Personal Oracle cannot function as a database server by supporting multiple users, it still provides an excellent environment for experimentation and prototyping.

We are discussed here generally about personal oracle As in case of relational database management system

Oracle with Structured Query Language (SQL)

You communicate with Personal Oracle through Oracle's version of the Structured Query Language (SQL, usually pronounced sequel). SQL is a nonprocedural language; unlike C or COBOL in which you must describe exactly how to access and manipulate data, SQL specifies what to do. Internally, Oracle determines how to perform the request. SQL exists as an ANSI standard as well as an industry standard. Oracle's implementation of SQL adheres to Level 2 of the ANSI X3.135-1989/ISO 9075-1989 standard with full implementation of the Integrity Enhancement Feature. Oracle (as well as other database vendors) provides many extensions to ANSI SQL.

In addition, Oracle's implementation of SQL adheres to the U.S. government standard as described in the Federal Information Processing Standard Publication (FIPS PUB) 127, entitled Database Language SQL.

Oracle Data Types

The Oracle DBMS provides a number of different data types for the storage of the different forms of data in a manner that is most suitable to the manipulations that are likely to be performed. As part of the Data Modeling process it is important that the most appropriate data types are identified. Where there is any doubt (e.g. storing a year as a number or as a date), the data type chosen should reflect the way the data will be used most often. It is possible to inter-convert between data types, but this will reduce the efficiency of queries. The use of these data types in table design is "Manipulating Oracle Tables".

The available Oracle data types the most widely used are these:

VARCHAR2

CHAR

NUMBER

DATE

http://www.geo.ed.ac.uk/~kwm/practicals/dmds/ch4.html#s4p4




Long

VARCHAR2

This data type should be used for any columns which may contain characters. This includes alphabetic letters, together with _, -, !, ?, + etc. and numbers as a character representation (i.e. they look like numbers when displayed, but they cannot be subjected to arithmetic manipulation. Similarly an attempt to do a greater than (>), or less than (<) operation will not have the correct arithmetic result- VARCHAR2 values are compared character by character up to the first character that differs. Whichever value has the "greater" character in that position is considered "greater". Characters are compared via the ASCII character set with the largest ASCII value (i.e. 255) is considered greatest.) VARCHAR2 columns may be up to 2000 characters wide.

CHAR

This data type is now effectively replaced by VARCHAR2 - CHAR should no longer be used, although is still recognized and valid. The CHAR data type is fixed length up to a maximum of 255 characters, whereas VARCHAR2 is of variable length up to 2000 characters. The variable length of VARCHAR2 gives it significant storage and performance advantages over CHAR.

NUMBER

Any kind of number; positive, negative, integer or real. Up to 22 digits may be entered. For comparison, a larger value is considered greater than a smaller value, with all negative values smaller than all positive values.

DATE

A special data type with some of the characteristics of both Character and Number Data-types. It is used for the storage of date and time information. The operators =, > or <>

WHERE HIREDATE = '01-DEC-95'

the date value, on the right, must appear in quotes. For comparison, a later date is considered greater than an earlier date. Oracle dates must lie in the range 1st January, 4712 BC to 31st December, 4712 AD.

The standard Oracle date format is:

Digit Day followed by

Letter Month and

Digit Year

All should be separated by hyphens e.g. '01-JAN-95'

LONG

Can hold strings of characters up to 2 gigabytes long. Only one column of LONG data type can be defined per table. Binary data can only be inserted into these using the programming language interface (an example of binary data storage in a LONG data type would be storing a picture in an oracle table).

Importantly, there are significant restrictions on the use of character information stored in LONG data types. While the INSERT and UPDATE statements can be used to insert and modify LONG data types, and the SELECT statement to retrieve this data, no functions can be used to manipulate a retrieved a LONG data type and a LONG column can never be used in a WHERE clause.

When retrieving LONG data types the display length of the column is set to 80 characters, and anything stored in the column beyond this will be truncated. The following command is required to extend the display length of a LONG field, within SQL*Plus:

SQL> SET LONG xxxx

The value xxxx should be replaced with a number representing thelongest value likely to be stored in the LONG column.

Above are the general use data types in oracle, after similar with the data Types of oracle the next step in sense of RDBMS is that to clear knowledge of Creating the tables In Oracle

DBMS Vs RDBMS

The characteristic that differentiates a DBMS from an RDBMS is that the RDBMS provides a set-oriented database language. For most RDBMSs, this set-oriented database language is SQL. Set oriented means that SQL processes sets of data in groups. Two standards organizations, the American National Standards Institute (ANSI) and the International Standards Organization (ISO), currently promote SQL standards to industry. The ANSI-92 standard is the standard for the SQL used throughout this book. Although these standard-making bodies prepare standards for database system designers to follow, all database products differ from the ANSI standard to some degree. In addition, most systems provide some proprietary extensions to SQL that extend the language into a true procedural language.

Problems with RDBMS

Starting in the late 1980s, several deficiencies in relational DBMS products began receiving a lot of attention. The first deficiency is that the dominant relational language, SQL-92, is limiting in several important respects. For instance, SQL-92 supports a restricted set of built-in types that accommodate only numbers and strings, but many database applications began to include deal with complex objects such as

geographic points, text, and digital signal data. A related problem concerns how this data is used. Conceptually simple questions involving complex data structures turn into lengthy SQL-92 queries.

The second deficiency is that the relational model suffers from certain structural shortcomings. Relational tables are flat and do not provide good support for nested structures, such as sets and arrays. Also, certain kinds of relationships, such as sub typing, between database objects are hard to represent in the model. (Subtyping occurs when we say that one kind of thing, such as a SalesPerson, is a subtype of another kid of thing, such as an Employee.) SQL-92 supports only independent tables of rows and columns. The third deficiency is that RDBMS technology did not take advantage of object-oriented (OO) approaches to software engineering which have gained widespread acceptance in industry. OO techniques reduce development costs and improve information system quality by adopting an object-centric view of software development. This involves integrating the data and behavior of a real-world entity into a single software module or component. A complex data structure or algorithmically sophisticated operation can be hidden behind a set of interfaces. This allows another programmer to make use of the complex functionality without having to understand how it is implemented. The relational model did a pretty good job handling most information management problems. But for an emerging class of problems RDBMS technology could be improved upon.

Object-Oriented DBMS

Object-Oriented Database Management Systems (OODBMS) are an extension of OO programming language techniques into the field of persistent data management. For many applications, the performance, flexibility, and development cost of OODBMS are significantly better than RDBMS or ORDBMS. The chief advantage of OODBMS lies in the way they can achieve a tighter integration between OO languages and the DBMS. Indeed, the main standards body for OODBMS, the Object Database Management Group (ODMG) defines an OODBMS as a system that integrates database capabilities with object-oriented programming language capabilities. The idea behind this is that so far as an application developer is concerned, it would be useful to ignore not only questions of how an object is implemented behind its interface, but also how it is stored and retrieved.

All developers have to do is implement their application using their favorite OO programming language, such as C++, Smalltalk, or Java, and the OODBMS takes care of data caching, concurrency control, and disk storage. In addition to this seamless integration, OODBMS possess a number of interesting and useful features derived mainly from the object model. In order to solve the finite type system problem that constrains SQL-92, most OODBMS feature an extensible type system. Using this technique, an OODBMS can take the complex objects that are part of the application and store them directly. An OODBMS can be used to invoke methods on these objects, either through a direct call to the object or

through a query interface. And finally, many of the structural deficiencies in SQL-92 are overcome by the use of OO ideas such as inheritance and allowing sets and arrays. OODBMS products saw a surge of academic and commercial interest in the early 1990s, and today the annual global market for OODBMS products runs at about $50 million per year. In many application domains, most notably computer-aided design or manufacturing (CAD/CAM), the expense of building a monolithic system to manage both database and application is balanced by the kinds of performance such systems deliver.

Problems with OODBMS

Regrettably, much of the considerable energy of the OODBMS community has been expended relearning the lessons of twenty years ago. First, OODBMS vendors have rediscovered the difficulties of tying database design too closely to application design. Maintaining and evolving an OODBMS-based information system is an arduous undertaking. Second, they relearned that declarative languages such as SQL-92 bring such tremendous productivity gains that organizations will pay for the additional computational resources they require. You can always buy hardware, but not time. Third, they re-discovered the fact that a lack of a standard data model leads to design errors and inconsistencies. In spite of these shortcomings OODBMS technology provides effective solutions to a range of data management problems. Many ideas pioneered by OODBMS have proven themselves to be very useful and are also found in ORDBMS. Object-relational systems include features such as complex object extensibility, encapsulation, inheritance, and better interfaces to OO languages.

ORDBMS

( Object Relational Database Management System )

ORDBMS Evolution from RDBMS

One of the most popular sort algorithms is a called insertion sort. Insertion sort is an O (N2) algorithm. Roughly speaking, the time it takes to sort an array of records increases with the square of the number of records involved, so it should only be used for record sets containing less than about 25 rows. But insertion sort is extremely efficient when the input is almost in sorted order, such as when you are sorting the result of an index scan. List presents an implementation of the basic insertion sort algorithm for an array of integers.

InsertSort( integer arTypeInput[] )

{ integer nTypeTemp;

integer InSizeArray, nOuter, nInner;

nSizeArray = arTypeInput[].getSize();

for ( nOuter = 2; nOuter <= nSizeArray; nOuter ++)

{

vTypeTemp

DBMS

Documents

Transcript of DBMS