MC0077 Fall 2012 Full Assignment

Master of Computer Applications Sikkim Manipal UniversityDirectorate of Distance Education

Assignment

Name :

Registration No. :

Learning Center :

Learning Center Code

:

Course : MCA

Subject : MC0077 – Advanced Database Systems

Semester : IV Semester

Module No. :

Date of submission :

Marks awarded :

_______________________________________

____________________

_Signature of Coordinator

Signature of Center Signature of Evaluator


1. 1. List and explain various Normal Forms. How BCNF differs from the Third Normal Form and 4th Normal forms?

Answer1: The normal forms defined in relational database theory represent guidelines for record design. The guidelines corresponding to first through fifth normal forms are presented, in terms that do not require an understanding of relational theory. The design guidelines are meaningful even if a relational database system is not used. We present the guidelines without referring to the concepts of the relational model in order to emphasize their generality and to make them easier to understand. Our presentation conveys an intuitive sense of the intended constraints on record design, although in its informality it may be imprecise in some technical details. A comprehensive treatment of the subject is provided by Date.

The normalization rules are designed to prevent update anomalies and data inconsistencies. With respect to performance trade-offs, these guidelines are biased toward the assumption that all non-key fields will be updated frequently. They tend to penalize retrieval, since data which may have been retrievable from one record in an un-normalized design may have to be retrieved from several records in the normalized form. There is no obligation to fully normalize all records when actual performance requirements are taken into account.

1. FIRST NORMAL FORMFirst normal form deals with the "shape" of a record type. Under first normal form, all occurrences of a record type must contain the same number of fields. First normal form excludes variable repeating fields and groups. This is not so much a design guideline as a matter of definition. Relational database theory does not deal with records having a variable number of fields.

2. SECOND AND THIRD NORMAL FORMSSecond and third normal forms deal with the relationship between non-key and key fields. Under second and third normal forms, a non-key field must provide a fact about the key, the whole key, and nothing but the key. In addition, the record must satisfy first normal form.

We deal now only with "single-valued" facts. A single-valued fact could be a one-to-many relationship such as the department of an employee or a one-to-one relationship such as the spouse of an employee. Thus, the phrase "Y is a fact about X" signifies a one-to-one or one-to-many relationship between Y and


X. In the general case, Y might consist of one or more fields and so might X. In the following example, QUANTITY is a fact about the combination of PART and WAREHOUSE.

3.1 SECOND NORMAL FORMSecond normal form is violated when a non-key field is a fact about a subset of a key. It is only relevant when the key is composite, i.e., consists of several fields. Consider the following inventory record.

PART WAREHOUSE QUANTITY WAREHOUSE-ADDRESS

………………… Key…………………….

The key here consists of the PART and WAREHOUSE fields together, but WAREHOUSE-ADDRESS is a fact about the WAREHOUSE alone. The basic problems with this design are:

The warehouse address is repeated in every record that refers to a part stored in that warehouse.

If the address of the warehouse changes, every record referring to a part stored in that warehouse must be updated.

Because of the redundancy, the data might become inconsistent, with different records showing different addresses for the same warehouse.

If at some point in time there are no parts stored in the warehouse, there may be no record in which to keep the warehouse's address.


To satisfy second normal form, the record shown above should be decomposed into (replaced by) the two records:

PART WAREHOUSE QUANTITY………………… Key…………………….WAREHOUSE WAREHOUSE-

ADDRESS………………… Key…………………….

When a data design is changed in this way, i.e., replacing Un-normalized records with normalized records, the process is referred to as normalization. The term "normalization" is sometimes used relative to a particular normal form. Thus, a set of records may be normalized with respect to second normal form but not with respect to third.

The normalized design enhances the integrity of the data by minimizing r e d u n d a n c y and inconsistency, but at some possible performance cost for certain retrieval applications. Consider an application that wants the addresses of all warehouses stocking a certain part. In the un-normalized form, the application searches one record type. With the normalized design, the application has to search two record types and connect the appropriate pairs.

3.2 THIRD NORMAL FORMThird normal form is violated when a non-key field is a fact about another non-key field, as in

EMPLOYEE DEPARTMENT LOCATION………………… Key…………………….

The EMPLOYEE field is the key. If each department is located in one place, then the LOCATION field is a fact about the DEPARTMENT - in addition to being a fact about the EMPLOYEE. The problems with this design are the same as those caused by violations of second normal form.

The department's location is repeated in the record of every employee assigned to that department.

If the location of the department changes, every such record must be updated.

Because of the redundancy, the data might become inconsistent, e.g., different records showing different locations for the same department.

If a department has no employees, there may be no record in which to keep the department's location.


To satisfy third normal form, the record shown above should be decomposed into the two records:

EMPLOYEE DEPARTMENT DEPARTMENT LOCATION. . . . Key . . . . . . . . . . . . . . Key . . . . . . . . . .

To summarize, a record is in second and third normal forms if every field is either part of the key or provides a (single-valued) fact about exactly the whole key and nothing else.

3.3 FUNCTIONAL DEPENDENCIESIn relational database theory, second and third normal forms are defined in terms of functional dependencies, which correspond approximately to our single-valued facts. A field Y is "functionally dependent" on a field (or fields) X if it is invalid to have two records with the same X value but different Y values. That is, a given X value must always occur with the same Y value. When X is a key, then all fields are by definition functionally dependent on X in a trivial way, since there cannot be two records having the same X value.

There is a slight technical difference between functional dependencies and single-valued facts as we have presented them. Functional dependencies only exist when the things involved have unique and singular identifiers (representations). For example, suppose a person's address is a single-valued fact, i.e., a person has only one address. If we do not provide unique identifiers for people, then there will not be a functional dependency in the data.

PERSON ADDRESSJohn Smith

123 Main St., New York

John Smith

321 Center St., San Francisco

Although each person has a unique address, a given name can appear with several different addresses. Hence, we do not have a functional dependency corresponding to our single-valued fact. Similarly, the address has to be spelled identically in each occurrence in order to have a functional dependency. In the following case, the same person appears to be living at two different addresses, again precluding a functional dependency.

PERSON ADDRESSJohn 123 Main St., New


Smith YorkJohn Smith

123 Main Street, NYC

We are not defending the use of non-unique or nonsingular representations. Such practices often lead to data maintenance problems of their own. We do wish to point out, however, those functional dependencies and the various normal forms are really only defined for situations in which there are unique and singular identifiers. Thus, the design guidelines as we present them are a bit stronger than those implied by the formal definitions of the normal forms.

For instance, we as designers know that in the following example there is a single-valued fact about a non-key field, and hence the design is susceptible to all the update anomalies mentioned earlier.

EMPLOYEE FATHER FATHER’S-ADDRESS

Art Smith John Smith

123 Main St., New York

Bob Smith John Smith

123 Main Street, NYC

Cal Smith John Smith

321 Center St., San Francisco

However, in formal terms, there is no functional dependency here between FATHER'S-ADDRESS and FATHER, and hence, no violation of third normal form.

4. FOURTH AND FIFTH NORMAL FORMSFourth and fifth normal forms deal with multi-valued facts. A multi-valued fact m a y correspond to a many-to-many relationship, as with employees and skills, or to a many-to-one relationship, as with the children of an employee (assuming only one parent is an employee). By "many-to-many" we mean that an employee may have several skills and/or a skill m a y belongs to several employees. Note that we look at the many-to-one relationship between children and fathers as a single-valued fact about a child but a multi-valued fact about a father.

In a sense, fourth and fifth normal forms are also about composite keys. These normal forms attempt to minimize the number of fields involved in a composite key, as suggested by the examples that follow.

4.1 FOURTH NORMAL FORM


Under fourth normal form, a record type should not contain two or more independent multi-valued facts about an entity. In addition, the record must satisfy third normal form. The term "independent" will be discussed after considering an example.

Consider employees, skills, and languages, where an employee may have several skills and several languages. We have here two many-to-many relationships, one between employees and skills, and one between employees and languages. Under fourth normal form, these two relationships should not be represented in a single record such as

EMPLOYEE SKILL LANGUAGE……………………………Key…………….……………..

Instead, they should be represented in the two records

EMPLOYEE SKILL EMPLOYEE LANGUAGE…………………….Key…………………….

…………………….Key…………………….

Note that other fields, not involving multi-valued facts, are permitted to occur in the record, as in the case of the QUANTITY field in the earlier PART/WAREHOUSE example.

The main problem with violating fourth normal form is that it leads to uncertainties in the maintenance policies. Several policies are possible for maintaining two independent multi-valued facts in one record.

(1) A disjoint format, in which a record contains either a skill or a language, but not both.

EMPLOYEE SKILL LANGUAGESmith cookSmith typeSmith FrenchSmith GermanSmith Greek

This is not much different from maintaining two separate record types. W e note in passing that such a format also leads to ambiguities regarding the meanings of blank fields. A blank SKILL could mean the person has no skill, that the field


is not applicable to this employee, that the data is unknown, or, as in this case, that the data may be found in another record.

(2) A random mix, with three variations

(a) Minimal number of records with repetitions.

EMPLOYEE SKILL LANGUAGESmith cook FrenchSmith type GermanSmith type Greek

(b) Minimal number of records, with null values.

EMPLOYEE SKILL LANGUAGESmith cook FrenchSmith type GermanSmith Greek

(c) Unrestricted.

EMPLOYEE SKILL LANGUAGESmith cook FrenchSmith typeSmith GermanSmith type Greek

(3) A "cross-product" form where, for each employee, there must be a record for every possible pairing of one of his skills with one of his languages.

EMPLOYEE SKILL LANGUAGESmith cook FrenchSmith cook GermanSmith cook GreekSmith type FrenchSmith type GermanSmith type Greek

Other problems caused by violating fourth normal form are similar in spirit to those mentioned earlier for violations of second or third normal form. They take different variations depending on the chosen maintenance policy.

If there are repetitions, then updates have to be done in multiple records, and the records could become inconsistent.


Insertion of a new skill may involve looking for a record with a blank skill, inserting a new record with a possibly blank language, or inserting multiple records pairing the new skill with some or all of the languages.

Deletion of a skill may involve blanking out the skill field in one or more records (perhaps with a check that this does not leave two records with the same language and a blank skill) or deleting one or more records, coupled with a check that the last mention of some language has not been deleted also.

Fourth normal form minimizes such update problems.

4.1.1 INDEPENDENCE

We mentioned independent multi-valued facts earlier, and we now illustrate what we mean by that term. The two many-to-many relationships, employee: skill and employee: language, are independent in that there is no direct connection between skills and languages. There is only an indirect connection because they belong to some common employee. That is, it does not matter which skill is paired with which language in a record; the pairing does not convey any information. That is precisely why all the maintenance policies mentioned earlier can be allowed.

In contrast, suppose that an employee can only exercise certain skills in certain languages. Perhaps Smith can cook French cuisine only, but can type French, German, and Greek. Then the pairing of skills and languages becomes meaningful, and there is no longer an ambiguity of maintenance policies. In the present case, only the following form is correct.

EMPLOYEE SKILL LANGUAGESmith cook FrenchSmith type FrenchSmith type GermanSmith type Greek

Thus, the employee: skill and employee: language relationships are no longer independent. These records do not violate fourth normal form. When there is interdependence among the relationships, it is acceptable to represent them in a single record.

4.1.2. MULTIVALUED DEPENDENCIES


Fourth normal form is defined in terms of multi-valued dependencies that correspond to our independent multi-valued facts. Multi-valued dependencies, in turn, are defined essentially as relationships that accept the "cross-product" maintenance policy mentioned above. For our example, every one of an employee's skills must appear paired with every one of his languages. It may or may not be obvious to the reader that this is equivalent to our notion of independence; since every possible pairing must be present, there is no "information" in the pairings. Such pairings convey information only if some of them can be absent, i.e., only if it is possible that some employee cannot perform some skill in some language. If all pairings are always present, then the relationships are really independent.

We should also point out that multi-valued dependencies and fourth normal form also apply to relationships involving more than two fields. For example, suppose we extend the earlier example to include projects, in the following sense:

An employee uses certain skills on certain projects. An employee uses certain languages on certain projects.

If there is no direct connection between the skills and languages that an employee uses on a project, then we could treat this as two independent many-to-many relationships of the form EP:S and EP:L, where EP represents a combination of an employee with a project. A record including employee, project, skill, and language would violate fourth normal form. Two records, containing fields E, P, S and E, P, L, respectively, would satisfy fourth normal form.

4.2. FIFTH NORMAL FORMFifth normal form deals with cases where information can be reconstructed from smaller pieces of information which can be maintained with less redundancy. Second, third, and fourth normal forms also serve this purpose, but fifth normal form generalizes to cases not covered by the others.

We will not attempt a comprehensive exposition of fifth normal form, but will illustrate the central concept with a commonly used example, namely, one involving agents, companies, and products. If agents represent companies, companies make products, and agents sell products, then we might want to keep


a record of which agent sells which product for which company. This information could be kept in one record type with three fields:

AGENT COMPANY PRODUCTSmith Ford carSmith GM truck

This form is necessary in the general case. For example, although agent Smith sells cars made by Ford and trucks made by GM, he does not sell Ford trucks or GM cars. Thus, we need the combination of all three fields to know which combinations are valid and which are not.

But suppose that a certain rule is in effect: if an agent sells a certain product and he represents the company making that product, then he sells that product for that company.

AGENT COMPANY PRODUCTSmith Ford carSmith Ford truckSmith GM carSmith GM truckJones Ford car

In this case, it turns out that we can reconstruct all the true facts from a normalized form consisting of three separate record types, each containing two fields.

AGENT COMPANY AGENT PRODUCTSmith Ford Smith carSmith Ford Smith truckSmith GM Smith carSmith GM Smith truckJones Ford Jones car

COMPANY PRODUCTFord carFord truckGM carGM truck


These three record types are in fifth normal form, whereas the corresponding three-field record shown previously is not.

Roughly speaking, we may say that a record type is in fifth normal form when its information content cannot be reconstructed from several smaller record types, i.e., from record types each having fewer fields than the original record. The case where all the smaller records have the same key is excluded. If a record type can only be decomposed into smaller records which all have the same key, then the record type is considered to be in fifth normal form without decomposition. A record type in fifth normal form is also in fourth, third, second, and first normal forms.

Fifth normal form does not differ from fourth normal form, unless, there exists a symmetric constraint such as the rule about agents, companies, and products. In the absence of such a constraint, a record type in fourth normal form is always in fifth normal form.

One advantage of fifth normal form is that certain redundancies can be eliminated. In the normalized form, the fact that Smith sells cars is recorded only once; in the un-normalized form, it may be repeated many times.

It should be observed that although the normalized form involves more record types, there may be fewer total record occurrences. This is not apparent w h e n there are only a few facts to record, as in the example shown above. The advantage is realized as more facts are recorded, since the size of the normalized files increases in an additive fashion, while the size of the un-normalized file increases in a multiplicative fashion. For example, if we add a new agent who sells x products for y companies, where each of these companies makes each of these products, we have to add x + y new records to the normalized form, but x. y new records to the un-normalized form.

It should also be noted that all three record types are required in the normalized form in order to reconstruct the same information. From the first two record types shown above we learn that Jones represents Ford and that Ford makes trucks. But we cannot determine whether Jones sells Ford trucks until we look at the third record type to determine whether ]ones sells trucks at all.


The following example illustrates a case in which the rule about agents, companies, and products is satisfied, and which clearly requires all three record types in the normalized form. Any two of the record types taken alone will imply something untrue.

AGENT COMPANY PRODUCTSmith Ford carSmith Ford truckSmith GM carSmith GM truckJones Ford carJones Ford truckBrown Ford carBrown GM carBrown Toyota carBrown Toyota bus

AGENT COMPANY

Fifth Normal Form

Smith FordSmith GMJones FordBrown FordBrown GMBrown Toyota

COMPANY PRODUCT

Fifth Normal Form

Ford carFord truckGM carGM truckToyota carToyota bus

AGENT PRODUCT Fifth Normal Form

Smith carSmith truckJones carJones truckBrown car


Brown bus

Observe that:

Jones sells cars and GM makes cars, but Jones does not represent GM. Brown represents Ford and Ford makes trucks, but Brown does not sell

trucks. Brown represents Ford and Brown sells buses, but Ford does not make

buses.

Fourth and fifth normal forms both deal with combinations of multi-valued facts. One difference is that the facts dealt with under fifth normal form are not independent, in the sense discussed earlier. Another difference is that, although fourth normal form can deal with more than two multi-valued facts, it only recognizes them in pair-wise groups. We can best explain this in terms of the normalization process implied by fourth normal form. If a record violates fourth normal form, the associated normalization process decomposes it into two records, each containing fewer fields than the original record. Any of these violating fourth normal forms is again decomposed into two records, and so on until the resulting records are all in fourth normal form. At each stage, the set of records after decomposition contains exactly the same information as the set of records before decomposition.

In the present example, no pair-wise decomposition is possible. There is no combination of two smaller records which contains the same total information as the original record. All three of the smaller records are needed. Hence, an information-preserving pair-wise decomposition is not possible, and the original record is not in violation of fourth normal form. Fifth normal form is needed in order to deal with the redundancies in this case.

Difference between BCNF and Third Normal Form

Both 3NF and BCNF are normal forms that are used in relational databases to minimize redundancies in tables. In a table that is in the BCNF normal form, for every non-trivial functional dependency of the form A → B, A is a super-key whereas, a table that complies with 3NF should be in the 2NF, and every non-prime attribute should directly depend on every candidate key of that table. BCNF is considered as a stronger normal form than the 3NF and it was developed to capture some of the anomalies that could not be captured by 3NF.


Obtaining a table that complies with the BCNF form will require decomposing a table that is in the 3NF. This decomposition will result in additional join operations (or Cartesian products) when executing queries. This will increase the computational time. On the other hand, the tables that comply with BCNF would have fewer redundancies than tables that only comply with 3NF. Furthermore, most of the time, it is possible to obtain a table that comply with 3NF without hindering dependency preservation and lossless joining. But this is not always possible with BCNF.

Difference between BCNF and 4th Normal Form

Database must be already achieved to 3NF to take it to BCNF, but database must be in 3NF and BCNF, to reach 4NF.

In fourth normal form, there are no multi-valued dependencies of the tables, but in BCNF, there can be multi-valued dependency data in the tables.


2. What are differences in Centralized and Distributed Database Systems? List the relative advantages of data distribution.

Answer 2:

Differences in Distributed & Centralized Databases

1. Centralized Control vs. Decentralized Control

In centralized control one "database administrator" ensures safety of data whereas in distributed control, it is possible to use hierarchical control structure based on a "global database administrator" having the central responsibility of whole data along with "local database administrators", who have the responsibility of local databases.

2. Data Independence

In central databases it means the actual organization of data is transparent to the application programmer. The programs are written with "conceptual" view of the data (called "Conceptual schema"), and the programs are unaffected by physical organization of data. In Distributed Databases, another aspect of "distribution dependency" is added to the notion of data independence as used in Centralized databases. Distribution Dependency means programs are written assuming the data is not distributed. Thus correctness of programs is unaffected by the movement of data from one site to another; however, their speed of execution is affected.

3. Reduction of Redundancy

In centralized databases redundancy was reduced for two reasons :(a) inconsistencies among several copies of the same logical data are avoided, (b) storage space is saved. Reduction of redundancy is obtained by data sharing. In distributed databases data redundancy is desirable as (a) locality of applications can be increased if data is replicated at all sites where applications need it, (b) the availability of the system can be increased, because a site failure does not stop the execution of applications at other sites if the data is replicated. With data replication, retrieval can be performed on any copy, while updates must be performed consistently on all copies.


4. Complex Physical Structures and Efficient Access

In centralized databases complex accessing structures like secondary indexed, interfile chains are used. All these features provide efficient access to data. In distributed databases efficient access requires accessing data from different sites. For this an efficient distributed data access plan is required which can be generated either by the programmer or produced automatically by an optimizer. Problems faced in the design of an optimizer can be classified in two categories: a) Global optimization consists of determining which data must be accessed at which sites and which data files must consequently be transmitted between sites. b) Local optimization consists of deciding how to perform the local database accesses at each site.

5. Integrity, Recovery and Concurrency Control

A transaction is an atomic unit of execution and atomic transactions are the means to obtain database integrity. Failures and concurrency are two dangers of atomicity. Failures may cause the system to stop in midst of transaction execution, thus violating the atomicity requirement. Concurrent execution of different transactions may permit one transaction to observe an inconsistent, transient state created by another transaction during its execution. Concurrent execution requires synchronization amongst the transactions, which is much harder in all distributed systems.

6. Privacy and Security

In traditional databases, the database administrator, having centralized control, can ensure that only authorized access to the data is performed. In distributed databases, local administrators face the same as well as two new aspects of the problem; (a) security (protection) problems because of communication networks is intrinsic to database systems. (b) In certain databases with a high degree of "site autonomy" may feel more protected because they can enforce their own protections instead of depending on a central database administrator.

7. Distributed Query Processing

The DDBMS should be capable of gathering and presenting data from more than one site to answer a single query. In theory a distributed system can handle queries more quickly than a centralized one, by exploiting parallelism and


reducing disc contention; in practice the main delays (and costs) will be imposed by the communications network. Routing algorithms must take many factors into account to determine the location and ordering of operations. Communications costs for each link in the network are relevant, as also are variable processing capabilities and loadings for different nodes, and (where data fragments are replicated) trade-offs between cost and currency. If some nodes are updated less frequently than others there may be a choice between querying the local out-of-date copy very cheaply and getting a more up-to-date answer by accessing a distant location. The ability to do query optimization is essential in this context – the main objective being to minimize the quantity of data to be moved around. With single-site databases one must consider both generalized operations on internal query representations and the exploitation of information about the current state of the database.

8. Distributed Directory (Catalog) Management

Catalogs for distributed databases contain information like fragmentation description, allocation description, mappings to local names, access method description, statistics on the database, protection and integrity constraints (consistency information) which are more detailed as compared to centralized databases.

Relative Advantages of Distributed Databases over Centralized Databases

1. Organizational and Economic Reasons

Many organizations are decentralized, and a distributed database approach fits more naturally the structure of the organization. The organizational and economic motivations are amongst the main reasons for the development of distributed databases. In organizations already having several databases and feeling the necessity of global applications, distributed databases is the natural choice.

2. Incremental Growth

In a distributed environment, expansion of the system in terms of adding more data, increasing database size, or adding more processors is much easier.

3. Reduced Communication Overhead


Many applications are local, and these applications do not have any communication overhead. Therefore, the maximization of the locality of applications is one of the primary objectives in distributed database design.

4. Performance Considerations

Data localization reduces the contention for CPU and I/O services and simultaneously reduces access delays involved in wide area networks. Local queries and transactions accessing data at a single site have better performance because of the smaller local databases. In addition, each site has a smaller number of transactions executing than if all transactions are submitted to a single centralized database. Moreover, inter-query and intra-query parallelism can be achieved by executing multiple queries at different sites, or breaking up a query into a number of sub queries that execute in parallel. This contributes to improved performance.

5. Reliability and Availability

Reliability is defined as the probability that a system is running (not down) at a certain time point. Availability is the probability that the system is continuously available during a time interval. When the data and DBMS software are distributed over several sites, one site may fail while other sites continue to operate. Only the data and software that exist at the failed site cannot be accessed. This improves both reliability and availability. Further improvement is achieved by judiciously replicating data and software at more than one site.

6. Management of Distributed Data with Different Levels of Transparency

In a distributed database, following types of transparencies are possible:

1. Distribution or Network Transparency

This refers to freedom for the user from the operational details of the network. It may be divided into location and naming transparency. Location transparency refers to the fact that the command used to perform a task is independent of the location of data and the location of the system where the command was issued. Naming transparency implies that once a name is specified, the named objects can be accessed unambiguously without additional specification.

2. Replication Transparency


Copies of the data may be stored at multiple sites for better availability, performance, and reliability. Replication transparency makes the user unaware of the existence of copies.

3. Fragmentation Transparency

Two main types of fragmentation are Horizontal fragmentation, which distributes a relation into sets of tuples (rows), and Vertical Fragmentation which distributes a relation into sub relations where each sub relation is defined by a subset of the column of the original relation. A global query by the user must be transformed into several fragment queries. Fragmentation transparency makes the user unaware of the existence of fragments.


3. Describe the concepts of Structural Semantic Data Model (SSM).

Answer 3:

Structural Semantic Data Model – SSM

The Structural Semantic Model, SSM, first described in Nordbotten (1993 a & b), is an extension and graphic simplification of the EER modeling tool1st presented in the '89 edition of (Elmasri & Navathe, 2003). SSM was developed as a teaching tool and has been and can continue to be modified to include new modeling concepts. A particular requirement today is the inclusion of concepts and syntax symbols for modeling multimedia objects.

SSM Concepts

The current version of SSM belongs to the class of Semantic Data Model types extended with concepts for specification of user defined data types and functions, UDT and UDF. It supports the modeling concepts defined in Table 4.4 and compared in Table 4. Figure 4.2 shows the concepts and graphic syntax of SSM, which include:

Table: Data Modeling Concepts

Concepts (synonym(s))

Definition Examples

Entity types:Entity (object) Something of interest to the

information system about which data is collected.

A person, student, computer, employee, department, product, exam, order, __

Entity type A set of entities sharing common attributes.

Citizen of NorwayPERSON {Name, Address, ….}

Subclass, Superclass, entity type

A subclass entity type is a specialization, of, alternatively a role played by, a superclass entity type.

Subclass:SuperclassStudent IS_A PersonTeacher IS_A Person

Shared subclass entity type

A shared subclass entity type has characteristics of 2 or more parent entity types

A student assistant IS_BOTHA student and an employee.

Category entity type

A subclass entity type of 2 or more distinct/independent superclass entity type

An owner IS_ETHERA person or an organization.


Weak entity type An entity type dependent on another for its identification and existence

Education is (can be) a weak entity type dependent on person

Attributes:Property A characteristic of an entity Person.Name = JohnAttribute The name given to a

property of an entity or relationship type

Person {ID, Name, Address, Telephone, Age, Position, __}

Atomic An attribute having a single value

Person.ID

Multivalued An attribute with multiple values

Telephone #{home, office, mobile, fax}

Composite (Compound)

An attribute composed of several sub-attributes

Address {Street No, city, state, pin}Namr {First, Middle, Last}

Derived An attribute whose value depends on other values in DB/environment.

Person.Age: Current_Date – Birth_DatePerson.Salary: calculated in relationship to correct salary levels.


Relationships:Relationship A relation between 2 or

more entitiesJoan MARRIED_TO SveinJoan WORKS_FOR IFICourse_Grade {John, 133, UiBDB, 19nn, 1.5, ….}

Associative relationship

A set of relationships between 2 or more entity types

Employee WORKS_FOR DepartmentCourse_Grade::Student, Course, Proejct

Hierarchic relationship

A super class structure-Strict hierarchy = 1 path to each subclass entity type-Lattice structure = multiple paths

Person=>Student=>Graduate-StudentPerson=>(Teacher, Student)=>Assistant

Constraints:Domain The set of valid values for

an attributePerson.Age::[0-125]

Primary Key (PK) (Identifier, OID)

The set of attributes whose values uniquely identifies an entity

Person.ID

Foreign Key (Reference Key)

An attribute containing the PK of an entity to which this entity is related

Persona.ID, ……., Manager, Department

Rel. Cardinality Structure

(min, max) association between an entity type and a relation type

Student may have many course_grade(min, max) =

Classification participation

[partial p | total t],[disjoint d | overlapping o]

Person(p, o) => (Teacher, Student)

“(Data Behaviour)”:: dbms action by event:User Defined Functions, UDF

A function triggered by use {storage, update, retrieval} of an attribute.

Calculation of current data values, such as birth date

1. Three types of entity specifications: base (root), subclass, and weak2. Four types of inter-entity relationships: n-ary associative and 3 types of

classification hierarchies,3. Four attribute types: atomic, multi-valued, composite, and derived4. Domain type specifications in the graphic model, including; standard data

types, Binary large objects (blob, text, image ...), user-defined types (UDT) and functions (UDF),

(0, n)(0, n)

(1, 5)

(0, n) (0, n)

(n, 0)(0, 1) (1, 1)

TEACHER

Teaches / Taught By Writes / Written By

STUDENT

REPORTCOURSE

CVhasPERSON


5. Cardinality specifications for entity to relationship-type connections and for multi-valued attribute types and

6. Data value constraints.

Base and weak entity types

Hierarchic relationship

Subclass entity types

Associative relationships with (min, max) cardinality specification

Base entity types

SSM Entity Relationships - Hierarchical and Associative


Primary key atomic attributes PERSON

- - ID <PersonNo> - - Birth date

<date>

- - Name | - - First

<vchar (10)>

| - - Last

<vchar (20)>

- - (0, 4) - Telephone

<dec (13)>

- - (1, 2) - Address | Street

<vhar (25)>

|… |PostCode

<Pcode>

|geo-loc

<Point>

- - age <integer> - - picture <image> - - CV <text>

Composite attribute

Multivalued attributeMultivalued composite attribute with:UDTSpatial data typesDerived attribute

ImageText data types

SSM Attribute and Data Types


4. Describe the following with respect to Object-Oriented Databases:

Answer 4:

a. Query Processing in Object-Oriented Database Systems

One of the criticisms of first-generation object-oriented database management systems (OODBMSs) was their lack of declarative query capabilities. This led some researchers to brand first generation (network and hierarchical) DBMSs as object-oriented. It was commonly believed that the application domains that OODBMS technology targets do not need querying capabilities. This belief no longer holds, and declarative query capability is accepted as one of the fundamental features of OO-DBMS. Indeed, most of the current prototype systems experiment with powerful query languages and investigate their optimization. Commercial products have started to include such languages as well e.g. O2 and Object-Store.

In this Section we discuss the issues related to the optimization and execution of OODBMS query languages (which we collectively call query processing).Query optimization techniques are dependent upon the query model and language. For example, a functional query language lends itself to functional optimization which is quite different from the algebraic, cost-based optimization techniques employed in relational as well as a number of object-oriented systems. The query model, in turn, is based on the data (or object) model since the latter defines the access primitives which are used by the query model. These primitives, at least partially, determine the power of the query model. Despite this close relationship, in this unit we do not consider issues related to the design of object models, query models, or query languages in any detail.

Almost all object query processors proposed to date use optimization techniques developed for relational systems. However, there are a number of issues that make query processing more difficult in OODBMSs. The following are some of the more important issues:

Type System

Relational query languages operate on a simple type system consisting of a single aggregate type: relation. The closure property of relational languages implies that each relational operator takes one or more relations as operands and


produces a relation as a result. In contrast, object systems have richer type systems. The results of object algebra operators are usually sets of objects (or collections) whose members may be of different types. If the object languages are closed under the algebra operators, these heterogeneous sets of objects can be operands to other operators. This requires the development of elaborate type inference schemes to determine which methods can be applied to all the objects in such a set. Furthermore, object algebras often operate on semantically different collection types (e.g., set, bag, list) which imposes additional requirements on the type inference schemes to determine the type of the results of operations on collections of different types.

Encapsulation

Relational query optimization depends on knowledge of the physical storage of data (access paths) which is readily available to the query optimizer. The encapsulation of methods with the data that they operate on in OODBMSs raises (at least) two issues. First, estimating the cost of executing methods is considerably more difficult than estimating the cost of accessing an attribute according to an access path. In fact, optimizers have to worry about optimizing method execution, which is not an easy problem because methods may be written using a general-purpose programming language. Second, encapsulation raises issues related to the accessibility of storage information by the query optimizer. Some systems overcome this difficulty by treating the query optimizer as a special application that can break encapsulation and access information directly. Others propose a mechanism whereby objects “reveal” their costs as part of their interface.

Complex Objects and Inheritance

Objects usually have complex structures where the state of an object references other objects. Accessing such complex objects involves path expressions. The optimization of path expressions is a difficult and central issue in object query languages. We discuss this issue in some detail in this unit. Furthermore, objects belong to types related through inheritance hierarchies. Efficient access to objects through their inheritance hierarchies is another problem that distinguishes object-oriented from relational query processing.

Object Models

Execution Plan Generation

Algebra Optimization

Type-checkCalculus–algebra Transformation

CalculusOptimization

Declarationquery

Normalized Calculus ExpressionObject Algebra ExpressionType Consistent ExpressionOptimized Algebra ExpressionExecution Plan


OODBMSs lack a universally accepted object model definition. Even though there is some consensus on the basic features that need to be supported by any object model (e.g., object identity, encapsulation of state and behavior, type inheritance, and typed collections), how these features are supported differs among models and systems. As a result, the numerous projects that experiment with object query processing follow quite different paths and are, to a certain degree, incompatible, making it difficult to amortize on the experiences of others. This diversity of approaches is likely to prevail for some time, therefore, it is important to develop extensible approaches to query processing that allow experimentation with new ideas as they evolve. We provide an overview of various extensible object query processing approaches.

b. Query Processing Architecture

In this section we focus on two architectural issues: the query processing methodology and the query optimizer architecture.

Query Processing Methodology :A query processing methodology similar to relational DBMSs, but modified to deal with the difficulties discussed in the previous section, can be followed in OODBMSs. Figure 6.1 depicts such a methodology proposed in.

The steps of the methodology are as follows.

1. Queries are expressed in a declarative language2. It requires no user knowledge of object implementations, access paths or

processing strategies3. The calculus expression is first4. Calculus Optimization5. Calculus Algebra Transformation6. Type check7. Algebra Optimization8. Execution Plan Generation9. Execution

Object Query Processing Methodology


5. Describe the Differences between Distributed & Centralized Databases

Answer 5:

Differences in Distributed & Centralized Databases

1. Centralized Control vs. Decentralized Control

In centralized control one "database administrator" ensures safety of data whereas in distributed control, it is possible to use hierarchical control structure based on a "global database administrator" having the central responsibility of whole data along with "local database administrators", who have the responsibility of local databases.

2. Data Independence

In central databases it means the actual organization of data is transparent to the application programmer. The programs are written with "conceptual" view of the data (called "Conceptual schema"), and the programs are unaffected by physical organization of data. In Distributed Databases, another aspect of "distribution dependency" is added to the notion of data independence as used in Centralized databases. Distribution Dependency means programs are written assuming the data is not distributed. Thus correctness of programs is unaffected by the movement of data from one site to another; however, their speed of execution is affected.

3. Reduction of Redundancy

In centralized databases redundancy was reduced for two reasons :(a) inconsistencies among several copies of the same logical data are avoided, (b) storage space is saved. Reduction of redundancy is obtained by data sharing. In distributed databases data redundancy is desirable as (a) locality of applications can be increased if data is replicated at all sites where applications need it, (b) the availability of the system can be increased, because a site failure does not stop the execution of applications at other sites if the data is replicated. With data replication, retrieval can be performed on any copy, while updates must be performed consistently on all copies.


4. Complex Physical Structures and Efficient Access

In centralized databases complex accessing structures like secondary indexed, interfile chains are used. All these features provide efficient access to data. In distributed databases efficient access requires accessing data from different sites. For this an efficient distributed data access plan is required which can be generated either by the programmer or produced automatically by an optimizer. Problems faced in the design of an optimizer can be classified in two categories: a) Global optimization consists of determining which data must be accessed at which sites and which data files must consequently be transmitted between sites. b) Local optimization consists of deciding how to perform the local database accesses at each site.

5. Integrity, Recovery and Concurrency Control

A transaction is an atomic unit of execution and atomic transactions are the means to obtain database integrity. Failures and concurrency are two dangers of atomicity. Failures may cause the system to stop in midst of transaction execution, thus violating the atomicity requirement. Concurrent execution of different transactions may permit one transaction to observe an inconsistent, transient state created by another transaction during its execution. Concurrent execution requires synchronization amongst the transactions, which is much harder in all distributed systems.

6. Privacy and Security

In traditional databases, the database administrator, having centralized control, can ensure that only authorized access to the data is performed. In distributed databases, local administrators face the same as well as two new aspects of the problem; (a) security (protection) problems because of communication networks is intrinsic to database systems. (b) In certain databases with a high degree of "site autonomy" may feel more protected because they can enforce their own protections instead of depending on a central database administrator.

7. Distributed Query Processing

The DDBMS should be capable of gathering and presenting data from more than one site to answer a single query. In theory a distributed system can handle queries more quickly than a centralized one, by exploiting parallelism and reducing disc contention; in practice the main delays (and costs) will be


imposed by the communications network. Routing algorithms must take many factors into account to determine the location and ordering of operations. Communications costs for each link in the network are relevant, as also are variable processing capabilities and loadings for different nodes, and (where data fragments are replicated) trade-offs between cost and currency. If some nodes are updated less frequently than others there may be a choice between querying the local out-of-date copy very cheaply and getting a more up-to-date answer by accessing a distant location. The ability to do query optimization is essential in this context – the main objective being to minimize the quantity of data to be moved around. With single-site databases one must consider both generalized operations on internal query representations and the exploitation of information about the current state of the database.

8. Distributed Directory (Catalog) Management

Catalogs for distributed databases contain information like fragmentation description, allocation description, mappings to local names, access method description, statistics on the database, protection and integrity constraints (consistency information) which are more detailed as compared to centralized databases.

6. Describe the following:


o Data Mining Functions o Data Mining Techniques

Answer 6:

Data Mining Functions

Data mining methods may be classified by the function they perform or according to the class of application they can be used in. Some of the main techniques used in data mining are described in this section.

Classification

Data Mining tools have to infer a model from the database, and in the case of Supervised Learning this requires the user to define one or more classes. The database contains one or more attributes that denote the class of a tuple and these are known as predicted attributes whereas the remaining attributes are called predicting attributes. A combination of values for the predicted attributes defines a class.

Once classes are defined the system should infer rules that govern the classification therefore the system should be able to find the description of each class. The descriptions should only refer to the predicting attributes of the training set so that the positive examples should satisfy the description and none of the negative. A rule said to be correct if its description covers all the positive examples and none of the negative examples of a class.

A rule is generally presented as, if the left hand side (LHS) then the right hand side (RHS), so that in all instances where LHS is true then RHS is also true, is very probable. The categories of rules are:

· Exact Rule – permits no exceptions so each object of LHS must be an element of RHS

· Strong Rule – allows some exceptions, but the exceptions have a given limit

· Probabilistic Rule – relates the conditional probability P(RHS|LHS) to the probability P(RHS) Other types of rules are classification rules where LHS is a sufficient condition to classify objects as belonging to the concept referred to in the RHS.

AssociationsGiven a collection of items and a set of records, each of which contain some number of items from the given collection, an association function is an


operation against this set of records which return affinities or patterns that exist among the collection of items. These patterns can be expressed by rules such as "72% of all the records that contain items A, B and C also contain items D and E." The specific percentage of occurrences (in this case 72) is called the confidence factor of the rule. Also, in this rule, A, B and C are said to be on an opposite side of the rule to D and E. Associations can involve any number of items on either side of the rule.

Sequential/Temporal patterns

Sequential/temporal pattern functions analyze a collection of records over a period of time for example to identify trends. Where the identity of a customer who made a purchase is known an analysis can be made of the collection of related records of the same structure (i.e. consisting of a number of items drawn from a given collection of items). The records are related by the identity of the customer who did the repeated purchases. Such a situation is typical of a direct mail application where for example a catalogue merchant has the information, for each customer, of the sets of products that the customer buys in every purchase order. A sequential pattern function will analyze such collections of related records and will detect frequently occurring patterns of products bought over time. A sequential pattern operator could also be used to discover for example the set of purchases that frequently precedes the purchase of a microwave oven.

Clustering/Segmentation

Clustering and Segmentation are the processes of creating a partition so that all the members of each set of the partition are similar according to some metric. A Cluster is a set of objects grouped together because of their similarity or proximity. Objects are often decomposed into an exhaustive and/or mutually exclusive set of clusters.

IBM – Market Basket Analysis example

IBM have used segmentation techniques in their Market Basket Analysis on POS transactions where they separate a set of untagged input records into reasonable groups according to product revenue by market basket i.e. the market baskets were segmented based on the number and type of products in the individual baskets.

Each segment reports total revenue and number of baskets and using a neural network 275,000 transaction records were divided into 16 segments. The following types of analysis were also available:


1. Revenue by segment2. Baskets by segment3. Average revenue by segment etc.Data Mining Techniques

Cluster Analysis

In an unsupervised learning environment the system has to discover its own classes and one way in which it does this is to cluster the data in the database as shown in the following diagram. The first step is to discover subsets of related objects and then find descriptions e.eg D1, D2, D3 etc. which describe each of these subsets.

Induction

A database is a store of information but more important is the information which can be inferred from it. There are two main inference techniques available i.e. deduction and induction.

· Deduction is a technique to infer information that is a logical consequence of the information in the database e.g. the join operator applied to two relational tables where the first concerns employees and departments and the second departments and managers infers a relation between employee and managers.

· Induction has been described earlier as the technique to infer information that is generalised from the database as in the example mentioned above to infer that each employee has a manager. This is higher level information or knowledge in that it is a general statement about objects in the database. The database is searched for patterns or regularities.

Induction has been used in the following ways within data mining.

Decision Trees


Decision Trees are simple knowledge representation and they classify examples to a finite number of classes, the nodes are labeled with attribute names, the edges are labeled with possible values for this attribute and the leaves labeled with different classes. Objects are classified by following a path down the tree, by taking the edges, corresponding to the values of the attributes in an object.

The following is an example of objects that describe the weather at a given time. The objects contain information on the outlook, humidity etc. Some objects are positive examples denote by P and others are negative i.e. N. Classification is in this case the construction of a tree structure, illustrated in the following diagram, which can be used to classify all the objects correctly.

Decision Tree Structure

Neural Networks

Neural Networks are an approach to computing that involves developing mathematical structures with the ability to learn. The methods are the result of academic investigations to model nervous system learning. Neural Networks have the remarkable ability to derive meaning from complicated or imprecise data and can be used to extract patterns and detect trends that are too complex to be noticed by either humans or other computer techniques. A trained Neural Network can be thought of as an "expert" in the category of information it has been given to analyze. This expert can then be used to provide projections given new situations of interest and answer "what if" questions.

Neural Networks have broad applicability to real world business problems and have already been successfully applied in many industries. Since neural


networks are best at identifying patterns or trends in data, they are well suited for prediction or forecasting needs including:

· Sales Forecasting · Industrial Process Control · Customer Research · Data Validation · Risk Management · Target Marketing etc.

Neural Networks use a set of processing elements (or nodes) analogous to Neurons in the brain. These processing elements are interconnected in a network that can then identify patterns in data once it is exposed to the data, i.e. the network learns from experience just as people do. This distinguishes neural networks from traditional computing programs that simply follow instructions in a fixed sequential order.

The structure of a neural network looks something like the following:

Structure of a neural network

The bottom layer represents the input layer, in this case with 5 inputs labels X1 through X5. In the middle is something called the hidden layer, with a variable number of nodes. It is the hidden layer that performs much of the work of a network. The output layer in this case has two nodes, Z1 and Z2 representing output values we are trying to determine from the inputs. For example, predict sales (output) based on past sales, price and season (input).


Each node in the hidden layer is fully connected to the inputs which means that what is learned in a hidden node is based on all the inputs taken together. Statisticians maintain that the network can pick up the interdependencies in the model. The following diagram provides some detail into what goes on inside a hidden node.

Simply speaking a weighted sum is performed: X1 times W1 plus X2 times W2 on through X5 and W5. This weighted sum is performed for each hidden node and each output node and is how interactions are represented in the network.

The issue of where the network get the weights from is important but suffice to say that the network learns to reduce error in it's prediction of events already known (i.e. past history).

The problems of using neural networks have been summed by Arun Swami of Silicon Graphics Computer Systems. Neural networks have been used successfully for classification but suffer somewhat in that the resulting network is viewed as a black box and no explanation of the results is given. This lack of explanation inhibits confidence, acceptance and application of results. He also notes as a problem the fact that neural networks suffered from long learning times which become worse as the volume of data grows.

The Clementine User Guide has the following simple diagram 7.6 to summarize a Neural Net trained to identify the risk of cancer from a number of factors.

On-line Analytical processing


A major issue in information processing is how to process larger and larger databases, containing increasingly complex data, without sacrificing response time. The client/server architecture gives organizations the opportunity to deploy specialized servers which are optimized for handling specific data management problems. Until recently, organizations have tried to target Relational Database Management Systems (RDBMSs) for the complete spectrum of database applications. It is however apparent that there are major categories of database applications which are not suitably serviced by relational database systems. Oracle, for example, has built a totally new Media Server for handling multimedia applications. Sybase uses an Object - Oriented DBMS (OODBMS) in its Gain Momentum product which is designed to handle complex data such as images and audio. Another category of applications is that of On-Line Analytical Processing (OLAP). OLAP was a term coined by E F Codd (1993) and was defined by him as “the dynamic synthesis, analysis and consolidation of large volumes of multidimensional data”

Codd has developed rules or requirements for an OLAP system; · Multidimensional Conceptual View · Transparency · Accessibility · Consistent Reporting Performance · Client/Server Architecture · Generic Dimensionality · Dynamic Sparse Matrix Handling · Multi-User Support · Unrestricted Cross Dimensional Operations · Intuitive Data Manipulation · Flexible Reporting · Unlimited Dimensions and Aggregation Levels

MC0077 Fall 2012 Full Assignment

Documents

Transcript of MC0077 Fall 2012 Full Assignment